WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[default0]:using world size: 384, data-parallel-size: 8, tensor-model-parallel size: 4, pipeline-model-parallel size: 12 
[default0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:PretrainedFromHF
[default0]:accumulate and all-reduce gradients in fp32 for bfloat16 data type.
[default0]:using torch.bfloat16 for parameters ...
[default0]:------------------------ arguments ------------------------
[default0]:  abort_on_unmet_fused_kernel_constraints ......... True
[default0]:  accumulate_allreduce_grads_in_fp32 .............. True
[default0]:  adam_beta1 ...................................... 0.9
[default0]:  adam_beta2 ...................................... 0.95
[default0]:  adam_eps ........................................ 1e-08
[default0]:  adlr_autoresume ................................. False
[default0]:  adlr_autoresume_interval ........................ 1000
[default0]:  apply_query_key_layer_scaling ................... True
[default0]:  apply_residual_connection_post_layernorm ........ False
[default0]:  attention_dropout ............................... 0.1
[default0]:  attention_softmax_in_fp32 ....................... False
[default0]:  bert_binary_head ................................ True
[default0]:  bert_load ....................................... None
[default0]:  bf16 ............................................ True
[default0]:  bias_dropout_fusion ............................. True
[default0]:  bias_gelu_fusion ................................ True
[default0]:  biencoder_projection_dim ........................ 0
[default0]:  biencoder_shared_query_context_model ............ False
[default0]:  block_data_path ................................. None
[default0]:  checkpoint_activations .......................... True
[default0]:  checkpoint_in_cpu ............................... False
[default0]:  checkpoint_num_layers ........................... 1
[default0]:  clip_grad ....................................... 1.0
[default0]:  codecarbon_dir .................................. None
[default0]:  consumed_train_samples .......................... 0
[default0]:  consumed_train_tokens ........................... 0
[default0]:  consumed_valid_samples .......................... 0
[default0]:  contigious_checkpointing ........................ False
[default0]:  cpu_optimizer ................................... False
[default0]:  cpu_torch_adam .................................. False
[default0]:  curriculum_learning ............................. False
[default0]:  data_impl ....................................... mmap
[default0]:  data_parallel_size .............................. 8
[default0]:  data_path ....................................... None
[default0]:  dataloader_type ................................. single
[default0]:  DDP_impl ........................................ local
[default0]:  decoder_seq_length .............................. None
[default0]:  deepscale ....................................... False
[default0]:  deepscale_config ................................ None
[default0]:  deepspeed ....................................... True
[default0]:  deepspeed_activation_checkpointing .............. True
[default0]:  deepspeed_config ................................ ./ds_config.176449.json
[default0]:  deepspeed_mpi ................................... False
[default0]:  distribute_checkpointed_activations ............. False
[default0]:  distributed_backend ............................. nccl
[default0]:  embed_layernorm ................................. True
[default0]:  embedding_path .................................. None
[default0]:  encoder_seq_length .............................. 2048
[default0]:  eod_mask_loss ................................... False
[default0]:  eval_interval ................................... 1000
[default0]:  eval_iters ...................................... 10
[default0]:  eval_only ....................................... None
[default0]:  evidence_data_path .............................. None
[default0]:  exit_duration_in_mins ........................... 1190
[default0]:  exit_interval ................................... None
[default0]:  ffn_hidden_size ................................. 57344
[default0]:  finetune ........................................ False
[default0]:  fp16 ............................................ False
[default0]:  fp16_lm_cross_entropy ........................... False
[default0]:  fp32_residual_connection ........................ False
[default0]:  gigaflos_no_embeds .............................. 0
[default0]:  global_batch_size ............................... 2048
[default0]:  glu_activation .................................. None
[default0]:  hidden_dropout .................................. 0.1
[default0]:  hidden_size ..................................... 14336
[default0]:  hysteresis ...................................... 2
[default0]:  ict_head_size ................................... None
[default0]:  ict_load ........................................ None
[default0]:  img_dim ......................................... 224
[default0]:  indexer_batch_size .............................. 128
[default0]:  indexer_log_interval ............................ 1000
[default0]:  init_method_std ................................. 0.0048
[default0]:  init_method_xavier_uniform ...................... False
[default0]:  initial_loss_scale .............................. 4294967296
[default0]:  kill_switch_path ................................ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/kill-switch-tr11-176B-exp1
[default0]:  kv_channels ..................................... 128
[default0]:  layernorm_epsilon ............................... 1e-05
[default0]:  lazy_mpu_init ................................... None
[default0]:  load ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:  local_rank ...................................... None
[default0]:  log_batch_size_to_tensorboard ................... True
[default0]:  log_interval .................................... 1
[default0]:  log_learning_rate_to_tensorboard ................ True
[default0]:  log_level ....................................... None
[default0]:  log_level_replica ............................... None
[default0]:  log_loss_scale_to_tensorboard ................... True
[default0]:  log_num_zeros_in_grad ........................... False
[default0]:  log_params_norm ................................. False
[default0]:  log_path ........................................ None
[default0]:  log_timers_to_tensorboard ....................... True
[default0]:  log_validation_ppl_to_tensorboard ............... True
[default0]:  loss_on_targets_only ............................ False
[default0]:  loss_scale ...................................... None
[default0]:  loss_scale_window ............................... 1000
[default0]:  lr .............................................. 6e-05
[default0]:  lr_decay_iters .................................. None
[default0]:  lr_decay_samples ................................ 200000000
[default0]:  lr_decay_style .................................. cosine
[default0]:  lr_decay_tokens ................................. None
[default0]:  lr_warmup_fraction .............................. None
[default0]:  lr_warmup_iters ................................. 0
[default0]:  lr_warmup_samples ............................... 183105
[default0]:  make_vocab_size_divisible_by .................... 128
[default0]:  mask_prob ....................................... 0.15
[default0]:  masked_softmax_fusion ........................... True
[default0]:  max_position_embeddings ......................... 2048
[default0]:  memory_centric_tiled_linear ..................... False
[default0]:  merge_file ...................................... None
[default0]:  micro_batch_size ................................ 2
[default0]:  min_loss_scale .................................. 1.0
[default0]:  min_lr .......................................... 6e-06
[default0]:  mmap_warmup ..................................... False
[default0]:  no_load_optim ................................... None
[default0]:  no_load_rng ..................................... None
[default0]:  no_save_optim ................................... None
[default0]:  no_save_rng ..................................... None
[default0]:  num_attention_heads ............................. 112
[default0]:  num_channels .................................... 3
[default0]:  num_classes ..................................... 1000
[default0]:  num_layers ...................................... 70
[default0]:  num_layers_per_virtual_pipeline_stage ........... None
[default0]:  num_workers ..................................... 2
[default0]:  onnx_safe ....................................... None
[default0]:  openai_gelu ..................................... False
[default0]:  optimizer ....................................... adam
[default0]:  override_lr_scheduler ........................... False
[default0]:  pad_vocab_size_to ............................... 250880
[default0]:  params_dtype .................................... torch.bfloat16
[default0]:  partition_activations ........................... False
[default0]:  patch_dim ....................................... 16
[default0]:  pipeline_model_parallel_size .................... 12
[default0]:  position_embedding_type ......................... PositionEmbeddingType.alibi
[default0]:  pp_partition_method ............................. type:transformer|embedding
[default0]:  profile_backward ................................ False
[default0]:  query_in_block_prob ............................. 0.1
[default0]:  rampup_batch_size ............................... ['16', '16', '9_765_625']
[default0]:  rank ............................................ 0
[default0]:  remote_device ................................... none
[default0]:  reset_attention_mask ............................ False
[default0]:  reset_position_ids .............................. False
[default0]:  retriever_report_topk_accuracies ................ []
[default0]:  retriever_score_scaling ......................... False
[default0]:  retriever_seq_length ............................ 256
[default0]:  reweight_loss_based_on_position_frequency ....... False
[default0]:  sample_rate ..................................... 1.0
[default0]:  save ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:  save_interval ................................... 50
[default0]:  scatter_gather_tensors_in_pipeline .............. True
[default0]:  scattered_embeddings ............................ False
[default0]:  seed ............................................ 42
[default0]:  seq_length ...................................... 2048
[default0]:  sgd_momentum .................................... 0.9
[default0]:  short_seq_prob .................................. 0.1
[default0]:  skip_train_iteration_range ...................... None
[default0]:  split ........................................... None
[default0]:  split_transformers .............................. False
[default0]:  synchronize_each_layer .......................... False
[default0]:  tensor_model_parallel_size ...................... 4
[default0]:  tensorboard_dir ................................. /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/tr11-176B-ml-logs/tensorboard
[default0]:  tensorboard_log_interval ........................ 1
[default0]:  tensorboard_queue_size .......................... 5
[default0]:  test_weighted_split_names ....................... ['test']
[default0]:  test_weighted_split_paths ....................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  test_weighted_split_paths_path .................. None
[default0]:  test_weighted_split_splits ...................... [['0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0']]
[default0]:  test_weighted_split_weights ..................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  tile_factor ..................................... 1
[default0]:  titles_data_path ................................ None
[default0]:  tokenizer_name_or_path .......................... bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k
[default0]:  tokenizer_type .................................. PretrainedFromHF
[default0]:  train_iters ..................................... None
[default0]:  train_samples ................................... 220000000
[default0]:  train_tokens .................................... None
[default0]:  train_weighted_split_names ...................... ['train']
[default0]:  train_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  train_weighted_split_paths_path ................. None
[default0]:  train_weighted_split_splits ..................... [['0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949']]
[default0]:  train_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  use_bnb_optimizer ............................... False
[default0]:  use_checkpoint_lr_scheduler ..................... False
[default0]:  use_contiguous_buffers_in_ddp ................... True
[default0]:  use_cpu_initialization .......................... None
[default0]:  use_one_sent_docs ............................... False
[default0]:  use_pin_memory .................................. False
[default0]:  valid_weighted_split_names ...................... ['valid']
[default0]:  valid_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  valid_weighted_split_paths_path ................. None
[default0]:  valid_weighted_split_splits ..................... [['0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999']]
[default0]:  valid_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  virtual_pipeline_model_parallel_size ............ None
[default0]:  vocab_extra_ids ................................. 0
[default0]:  vocab_file ...................................... None
[default0]:  weight_decay .................................... 0.1
[default0]:  world_size ...................................... 384
[default0]:  zero_allgather_bucket_size ...................... 0.0
[default0]:  zero_contigious_gradients ....................... False
[default0]:  zero_reduce_bucket_size ......................... 0.0
[default0]:  zero_reduce_scatter ............................. False
[default0]:  zero_stage ...................................... 0
[default0]:-------------------- end of arguments ---------------------
[default0]:will use batch size rampup starting from global batch size 16 to global batch size 2048 with batch size increments 16 over 9765625 samples.
[default0]:> building PretrainedFromHF tokenizer ...
[default0]: vocab file is un-used. loading tokenizer from pre-trained model
[default0]:Offline mode: forcing local_files_only=True
[default0]:Offline mode: forcing local_files_only=True
[default0]:Can't load following files from cache: ['added_tokens_file'] and cannot check if these files are necessary for the tokenizer to operate.
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/special_tokens_map.json from cache at /gpfswork/rech/six/commun/models/b0b3428eb9bea3ef62a6e9983742117e4860f4ec1af66eebce1702b8ec7cb364.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer_config.json from cache at /gpfswork/rech/six/commun/models/31fb66a88196017b3a12c4798e55bcf8a11b312b42dd9429c83f7237c0a8a807.e683c1a11fe6388761e34fd7cddbcd77f3552cefb70e9aca4a4cc72c027c8f40
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer.json from cache at /gpfswork/rech/six/commun/models/b28b4c1d8aed4c72b765cce6a9a7ce8c5460d05a5b4ea6fa5855dff6a721d171.397b0d7316cb89fa15f0bebce2bd6c5e71e92a14e95de167940173a60253b03e
[default0]: > padded vocab (size: 250680) with 200 dummy tokens (new size: 250880)
[default0]:DeepSpeed general environment info:
[default0]:torch install path ............... ['/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch']
[default0]:torch version .................... 1.11.0+cu115
[default0]:torch cuda version ............... 11.5
[default0]:nvcc version ..................... 11.4
[default0]:deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed']
[default0]:deepspeed info ................... 0.6.0+ed26ef4, ed26ef4, olruwase/bf16-updates
[default0]:deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
[default0]:**** Git info for Megatron: git_hash=0415583 git_branch=sync-meg-lm ****
[default0]:> initializing torch distributed ...
[default7]:> setting tensorboard ...
[default0]:> initializing tensor model parallel with size 4
[default0]:> initializing pipeline model parallel with size 12
[default0]:> setting random seeds to 42 ...
[default0]:[2022-03-03 05:45:00,513] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2760 and data parallel seed: 42
[default0]:> compiling dataset index builder ...
[default0]:make: Entering directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data'
[default0]:make: Nothing to be done for 'default'.
[default0]:make: Leaving directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data'
[default0]:>>> done with dataset index builder. Compilation time: 0.106 seconds
[default0]:> compiling and loading fused kernels ...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module scaled_upper_triang_masked_softmax_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module scaled_upper_triang_masked_softmax_cuda...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module scaled_masked_softmax_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module scaled_masked_softmax_cuda...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module fused_mix_prec_layer_norm_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module fused_mix_prec_layer_norm_cuda...
[default0]:>>> done with compiling and loading fused kernels. Compilation time: 8.876 seconds
[default0]:time to initialize megatron (seconds): 85.498
[default0]:[after megatron is initialized] datetime: 2022-03-03 05:45:09 
[default0]:building GPT model ...
[default0]:[2022-03-03 05:45:09,538] [INFO] [utils.py:828:see_memory_usage] Before Building Model
[default0]:[2022-03-03 05:45:09,539] [INFO] [utils.py:829:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[default0]:[2022-03-03 05:45:09,539] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.16 GB, percent = 8.6%
[default0]:SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
[default0]:Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=0, model=1): 1, ProcessCoord(pipe=0, data=0, model=2): 2, ProcessCoord(pipe=0, data=0, model=3): 3, ProcessCoord(pipe=0, data=1, model=0): 4, ProcessCoord(pipe=0, data=1, model=1): 5, ProcessCoord(pipe=0, data=1, model=2): 6, ProcessCoord(pipe=0, data=1, model=3): 7, ProcessCoord(pipe=0, data=2, model=0): 8, ProcessCoord(pipe=0, data=2, model=1): 9, ProcessCoord(pipe=0, data=2, model=2): 10, ProcessCoord(pipe=0, data=2, model=3): 11, ProcessCoord(pipe=0, data=3, model=0): 12, ProcessCoord(pipe=0, data=3, model=1): 13, ProcessCoord(pipe=0, data=3, model=2): 14, ProcessCoord(pipe=0, data=3, model=3): 15, ProcessCoord(pipe=0, data=4, model=0): 16, ProcessCoord(pipe=0, data=4, model=1): 17, ProcessCoord(pipe=0, data=4, model=2): 18, ProcessCoord(pipe=0, data=4, model=3): 19, ProcessCoord(pipe=0, data=5, model=0): 20, ProcessCoord(pipe=0, data=5, model=1): 21, ProcessCoord(pipe=0, data=5, model=2): 22, ProcessCoord(pipe=0, data=5, model=3): 23, ProcessCoord(pipe=0, data=6, model=0): 24, ProcessCoord(pipe=0, data=6, model=1): 25, ProcessCoord(pipe=0, data=6, model=2): 26, ProcessCoord(pipe=0, data=6, model=3): 27, ProcessCoord(pipe=0, data=7, model=0): 28, ProcessCoord(pipe=0, data=7, model=1): 29, ProcessCoord(pipe=0, data=7, model=2): 30, ProcessCoord(pipe=0, data=7, model=3): 31, ProcessCoord(pipe=1, data=0, model=0): 32, ProcessCoord(pipe=1, data=0, model=1): 33, ProcessCoord(pipe=1, data=0, model=2): 34, ProcessCoord(pipe=1, data=0, model=3): 35, ProcessCoord(pipe=1, data=1, model=0): 36, ProcessCoord(pipe=1, data=1, model=1): 37, ProcessCoord(pipe=1, data=1, model=2): 38, ProcessCoord(pipe=1, data=1, model=3): 39, ProcessCoord(pipe=1, data=2, model=0): 40, ProcessCoord(pipe=1, data=2, model=1): 41, ProcessCoord(pipe=1, data=2, model=2): 42, ProcessCoord(pipe=1, data=2, model=3): 43, ProcessCoord(pipe=1, data=3, model=0): 44, ProcessCoord(pipe=1, data=3, model=1): 45, ProcessCoord(pipe=1, data=3, model=2): 46, ProcessCoord(pipe=1, data=3, model=3): 47, ProcessCoord(pipe=1, data=4, model=0): 48, ProcessCoord(pipe=1, data=4, model=1): 49, ProcessCoord(pipe=1, data=4, model=2): 50, ProcessCoord(pipe=1, data=4, model=3): 51, ProcessCoord(pipe=1, data=5, model=0): 52, ProcessCoord(pipe=1, data=5, model=1): 53, ProcessCoord(pipe=1, data=5, model=2): 54, ProcessCoord(pipe=1, data=5, model=3): 55, ProcessCoord(pipe=1, data=6, model=0): 56, ProcessCoord(pipe=1, data=6, model=1): 57, ProcessCoord(pipe=1, data=6, model=2): 58, ProcessCoord(pipe=1, data=6, model=3): 59, ProcessCoord(pipe=1, data=7, model=0): 60, ProcessCoord(pipe=1, data=7, model=1): 61, ProcessCoord(pipe=1, data=7, model=2): 62, ProcessCoord(pipe=1, data=7, model=3): 63, ProcessCoord(pipe=2, data=0, model=0): 64, ProcessCoord(pipe=2, data=0, model=1): 65, ProcessCoord(pipe=2, data=0, model=2): 66, ProcessCoord(pipe=2, data=0, model=3): 67, ProcessCoord(pipe=2, data=1, model=0): 68, ProcessCoord(pipe=2, data=1, model=1): 69, ProcessCoord(pipe=2, data=1, model=2): 70, ProcessCoord(pipe=2, data=1, model=3): 71, ProcessCoord(pipe=2, data=2, model=0): 72, ProcessCoord(pipe=2, data=2, model=1): 73, ProcessCoord(pipe=2, data=2, model=2): 74, ProcessCoord(pipe=2, data=2, model=3): 75, ProcessCoord(pipe=2, data=3, model=0): 76, ProcessCoord(pipe=2, data=3, model=1): 77, ProcessCoord(pipe=2, data=3, model=2): 78, ProcessCoord(pipe=2, data=3, model=3): 79, ProcessCoord(pipe=2, data=4, model=0): 80, ProcessCoord(pipe=2, data=4, model=1): 81, ProcessCoord(pipe=2, data=4, model=2): 82, ProcessCoord(pipe=2, data=4, model=3): 83, ProcessCoord(pipe=2, data=5, model=0): 84, ProcessCoord(pipe=2, data=5, model=1): 85, ProcessCoord(pipe=2, data=5, model=2): 86, ProcessCoord(pipe=2, data=5, model=3): 87, ProcessCoord(pipe=2, data=6, model=0): 88, ProcessCoord(pipe=2, data=6, model=1): 89, ProcessCoord(pipe=2, data=6, model=2): 90, ProcessCoord(pipe=2, data=6, model=3): 91, ProcessCoord(pipe=2, data=7, model=0): 92, ProcessCoord(pipe=2, data=7, model=1): 93, ProcessCoord(pipe=2, data=7, model=2): 94, ProcessCoord(pipe=2, data=7, model=3): 95, ProcessCoord(pipe=3, data=0, model=0): 96, ProcessCoord(pipe=3, data=0, model=1): 97, ProcessCoord(pipe=3, data=0, model=2): 98, ProcessCoord(pipe=3, data=0, model=3): 99, ProcessCoord(pipe=3, data=1, model=0): 100, ProcessCoord(pipe=3, data=1, model=1): 101, ProcessCoord(pipe=3, data=1, model=2): 102, ProcessCoord(pipe=3, data=1, model=3): 103, ProcessCoord(pipe=3, data=2, model=0): 104, ProcessCoord(pipe=3, data=2, model=1): 105, ProcessCoord(pipe=3, data=2, model=2): 106, ProcessCoord(pipe=3, data=2, model=3): 107, ProcessCoord(pipe=3, data=3, model=0): 108, ProcessCoord(pipe=3, data=3, model=1): 109, ProcessCoord(pipe=3, data=3, model=2): 110, ProcessCoord(pipe=3, data=3, model=3): 111, ProcessCoord(pipe=3, data=4, model=0): 112, ProcessCoord(pipe=3, data=4, model=1): 113, ProcessCoord(pipe=3, data=4, model=2): 114, ProcessCoord(pipe=3, data=4, model=3): 115, ProcessCoord(pipe=3, data=5, model=0): 116, ProcessCoord(pipe=3, data=5, model=1): 117, ProcessCoord(pipe=3, data=5, model=2): 118, ProcessCoord(pipe=3, data=5, model=3): 119, ProcessCoord(pipe=3, data=6, model=0): 120, ProcessCoord(pipe=3, data=6, model=1): 121, ProcessCoord(pipe=3, data=6, model=2): 122, ProcessCoord(pipe=3, data=6, model=3): 123, ProcessCoord(pipe=3, data=7, model=0): 124, ProcessCoord(pipe=3, data=7, model=1): 125, ProcessCoord(pipe=3, data=7, model=2): 126, ProcessCoord(pipe=3, data=7, model=3): 127, ProcessCoord(pipe=4, data=0, model=0): 128, ProcessCoord(pipe=4, data=0, model=1): 129, ProcessCoord(pipe=4, data=0, model=2): 130, ProcessCoord(pipe=4, data=0, model=3): 131, ProcessCoord(pipe=4, data=1, model=0): 132, ProcessCoord(pipe=4, data=1, model=1): 133, ProcessCoord(pipe=4, data=1, model=2): 134, ProcessCoord(pipe=4, data=1, model=3): 135, ProcessCoord(pipe=4, data=2, model=0): 136, ProcessCoord(pipe=4, data=2, model=1): 137, ProcessCoord(pipe=4, data=2, model=2): 138, ProcessCoord(pipe=4, data=2, model=3): 139, ProcessCoord(pipe=4, data=3, model=0): 140, ProcessCoord(pipe=4, data=3, model=1): 141, ProcessCoord(pipe=4, data=3, model=2): 142, ProcessCoord(pipe=4, data=3, model=3): 143, ProcessCoord(pipe=4, data=4, model=0): 144, ProcessCoord(pipe=4, data=4, model=1): 145, ProcessCoord(pipe=4, data=4, model=2): 146, ProcessCoord(pipe=4, data=4, model=3): 147, ProcessCoord(pipe=4, data=5, model=0): 148, ProcessCoord(pipe=4, data=5, model=1): 149, ProcessCoord(pipe=4, data=5, model=2): 150, ProcessCoord(pipe=4, data=5, model=3): 151, ProcessCoord(pipe=4, data=6, model=0): 152, ProcessCoord(pipe=4, data=6, model=1): 153, ProcessCoord(pipe=4, data=6, model=2): 154, ProcessCoord(pipe=4, data=6, model=3): 155, ProcessCoord(pipe=4, data=7, model=0): 156, ProcessCoord(pipe=4, data=7, model=1): 157, ProcessCoord(pipe=4, data=7, model=2): 158, ProcessCoord(pipe=4, data=7, model=3): 159, ProcessCoord(pipe=5, data=0, model=0): 160, ProcessCoord(pipe=5, data=0, model=1): 161, ProcessCoord(pipe=5, data=0, model=2): 162, ProcessCoord(pipe=5, data=0, model=3): 163, ProcessCoord(pipe=5, data=1, model=0): 164, ProcessCoord(pipe=5, data=1, model=1): 165, ProcessCoord(pipe=5, data=1, model=2): 166, ProcessCoord(pipe=5, data=1, model=3): 167, ProcessCoord(pipe=5, data=2, model=0): 168, ProcessCoord(pipe=5, data=2, model=1): 169, ProcessCoord(pipe=5, data=2, model=2): 170, ProcessCoord(pipe=5, data=2, model=3): 171, ProcessCoord(pipe=5, data=3, model=0): 172, ProcessCoord(pipe=5, data=3, model=1): 173, ProcessCoord(pipe=5, data=3, model=2): 174, ProcessCoord(pipe=5, data=3, model=3): 175, ProcessCoord(pipe=5, data=4, model=0): 176, ProcessCoord(pipe=5, data=4, model=1): 177, ProcessCoord(pipe=5, data=4, model=2): 178, ProcessCoord(pipe=5, data=4, model=3): 179, ProcessCoord(pipe=5, data=5, model=0): 180, ProcessCoord(pipe=5, data=5, model=1): 181, ProcessCoord(pipe=5, data=5, model=2): 182, ProcessCoord(pipe=5, data=5, model=3): 183, ProcessCoord(pipe=5, data=6, model=0): 184, ProcessCoord(pipe=5, data=6, model=1): 185, ProcessCoord(pipe=5, data=6, model=2): 186, ProcessCoord(pipe=5, data=6, model=3): 187, ProcessCoord(pipe=5, data=7, model=0): 188, ProcessCoord(pipe=5, data=7, model=1): 189, ProcessCoord(pipe=5, data=7, model=2): 190, ProcessCoord(pipe=5, data=7, model=3): 191, ProcessCoord(pipe=6, data=0, model=0): 192, ProcessCoord(pipe=6, data=0, model=1): 193, ProcessCoord(pipe=6, data=0, model=2): 194, ProcessCoord(pipe=6, data=0, model=3): 195, ProcessCoord(pipe=6, data=1, model=0): 196, ProcessCoord(pipe=6, data=1, model=1): 197, ProcessCoord(pipe=6, data=1, model=2): 198, ProcessCoord(pipe=6, data=1, model=3): 199, ProcessCoord(pipe=6, data=2, model=0): 200, ProcessCoord(pipe=6, data=2, model=1): 201, ProcessCoord(pipe=6, data=2, model=2): 202, ProcessCoord(pipe=6, data=2, model=3): 203, ProcessCoord(pipe=6, data=3, model=0): 204, ProcessCoord(pipe=6, data=3, model=1): 205, ProcessCoord(pipe=6, data=3, model=2): 206, ProcessCoord(pipe=6, data=3, model=3): 207, ProcessCoord(pipe=6, data=4, model=0): 208, ProcessCoord(pipe=6, data=4, model=1): 209, ProcessCoord(pipe=6, data=4, model=2): 210, ProcessCoord(pipe=6, data=4, model=3): 211, ProcessCoord(pipe=6, data=5, model=0): 212, ProcessCoord(pipe=6, data=5, model=1): 213, ProcessCoord(pipe=6, data=5, model=2): 214, ProcessCoord(pipe=6, data=5, model=3): 215, ProcessCoord(pipe=6, data=6, model=0): 216, ProcessCoord(pipe=6, data=6, model=1): 217, ProcessCoord(pipe=6, data=6, model=2): 218, ProcessCoord(pipe=6, data=6, model=3): 219, ProcessCoord(pipe=6, data=7, model=0): 220, ProcessCoord(pipe=6, data=7, model=1): 221, ProcessCoord(pipe=6, data=7, model=2): 222, ProcessCoord(pipe=6, data=7, model=3): 223, ProcessCoord(pipe=7, data=0, model=0): 224, ProcessCoord(pipe=7, data=0, model=1): 225, ProcessCoord(pipe=7, data=0, model=2): 226, ProcessCoord(pipe=7, data=0, model=3): 227, ProcessCoord(pipe=7, data=1, model=0): 228, ProcessCoord(pipe=7, data=1, model=1): 229, ProcessCoord(pipe=7, data=1, model=2): 230, ProcessCoord(pipe=7, data=1, model=3): 231, ProcessCoord(pipe=7, data=2, model=0): 232, ProcessCoord(pipe=7, data=2, model=1): 233, ProcessCoord(pipe=7, data=2, model=2): 234, ProcessCoord(pipe=7, data=2, model=3): 235, ProcessCoord(pipe=7, data=3, model=0): 236, ProcessCoord(pipe=7, data=3, model=1): 237, ProcessCoord(pipe=7, data=3, model=2): 238, ProcessCoord(pipe=7, data=3, model=3): 239, ProcessCoord(pipe=7, data=4, model=0): 240, ProcessCoord(pipe=7, data=4, model=1): 241, ProcessCoord(pipe=7, data=4, model=2): 242, ProcessCoord(pipe=7, data=4, model=3): 243, ProcessCoord(pipe=7, data=5, model=0): 244, ProcessCoord(pipe=7, data=5, model=1): 245, ProcessCoord(pipe=7, data=5, model=2): 246, ProcessCoord(pipe=7, data=5, model=3): 247, ProcessCoord(pipe=7, data=6, model=0): 248, ProcessCoord(pipe=7, data=6, model=1): 249, ProcessCoord(pipe=7, data=6, model=2): 250, ProcessCoord(pipe=7, data=6, model=3): 251, ProcessCoord(pipe=7, data=7, model=0): 252, ProcessCoord(pipe=7, data=7, model=1): 253, ProcessCoord(pipe=7, data=7, model=2): 254, ProcessCoord(pipe=7, data=7, model=3): 255, ProcessCoord(pipe=8, data=0, model=0): 256, ProcessCoord(pipe=8, data=0, model=1): 257, ProcessCoord(pipe=8, data=0, model=2): 258, ProcessCoord(pipe=8, data=0, model=3): 259, ProcessCoord(pipe=8, data=1, model=0): 260, ProcessCoord(pipe=8, data=1, model=1): 261, ProcessCoord(pipe=8, data=1, model=2): 262, ProcessCoord(pipe=8, data=1, model=3): 263, ProcessCoord(pipe=8, data=2, model=0): 264, ProcessCoord(pipe=8, data=2, model=1): 265, ProcessCoord(pipe=8, data=2, model=2): 266, ProcessCoord(pipe=8, data=2, model=3): 267, ProcessCoord(pipe=8, data=3, model=0): 268, ProcessCoord(pipe=8, data=3, model=1): 269, ProcessCoord(pipe=8, data=3, model=2): 270, ProcessCoord(pipe=8, data=3, model=3): 271, ProcessCoord(pipe=8, data=4, model=0): 272, ProcessCoord(pipe=8, data=4, model=1): 273, ProcessCoord(pipe=8, data=4, model=2): 274, ProcessCoord(pipe=8, data=4, model=3): 275, ProcessCoord(pipe=8, data=5, model=0): 276, ProcessCoord(pipe=8, data=5, model=1): 277, ProcessCoord(pipe=8, data=5, model=2): 278, ProcessCoord(pipe=8, data=5, model=3): 279, ProcessCoord(pipe=8, data=6, model=0): 280, ProcessCoord(pipe=8, data=6, model=1): 281, ProcessCoord(pipe=8, data=6, model=2): 282, ProcessCoord(pipe=8, data=6, model=3): 283, ProcessCoord(pipe=8, data=7, model=0): 284, ProcessCoord(pipe=8, data=7, model=1): 285, ProcessCoord(pipe=8, data=7, model=2): 286, ProcessCoord(pipe=8, data=7, model=3): 287, ProcessCoord(pipe=9, data=0, model=0): 288, ProcessCoord(pipe=9, data=0, model=1): 289, ProcessCoord(pipe=9, data=0, model=2): 290, ProcessCoord(pipe=9, data=0, model=3): 291, ProcessCoord(pipe=9, data=1, model=0): 292, ProcessCoord(pipe=9, data=1, model=1): 293, ProcessCoord(pipe=9, data=1, model=2): 294, ProcessCoord(pipe=9, data=1, model=3): 295, ProcessCoord(pipe=9, data=2, model=0): 296, ProcessCoord(pipe=9, data=2, model=1): 297, ProcessCoord(pipe=9, data=2, model=2): 298, ProcessCoord(pipe=9, data=2, model=3): 299, ProcessCoord(pipe=9, data=3, model=0): 300, ProcessCoord(pipe=9, data=3, model=1): 301, ProcessCoord(pipe=9, data=3, model=2): 302, ProcessCoord(pipe=9, data=3, model=3): 303, ProcessCoord(pipe=9, data=4, model=0): 304, ProcessCoord(pipe=9, data=4, model=1): 305, ProcessCoord(pipe=9, data=4, model=2): 306, ProcessCoord(pipe=9, data=4, model=3): 307, ProcessCoord(pipe=9, data=5, model=0): 308, ProcessCoord(pipe=9, data=5, model=1): 309, ProcessCoord(pipe=9, data=5, model=2): 310, ProcessCoord(pipe=9, data=5, model=3): 311, ProcessCoord(pipe=9, data=6, model=0): 312, ProcessCoord(pipe=9, data=6, model=1): 313, ProcessCoord(pipe=9, data=6, model=2): 314, ProcessCoord(pipe=9, data=6, model=3): 315, ProcessCoord(pipe=9, data=7, model=0): 316, ProcessCoord(pipe=9, data=7, model=1): 317, ProcessCoord(pipe=9, data=7, model=2): 318, ProcessCoord(pipe=9, data=7, model=3): 319, ProcessCoord(pipe=10, data=0, model=0): 320, ProcessCoord(pipe=10, data=0, model=1): 321, ProcessCoord(pipe=10, data=0, model=2): 322, ProcessCoord(pipe=10, data=0, model=3): 323, ProcessCoord(pipe=10, data=1, model=0): 324, ProcessCoord(pipe=10, data=1, model=1): 325, ProcessCoord(pipe=10, data=1, model=2): 326, ProcessCoord(pipe=10, data=1, model=3): 327, ProcessCoord(pipe=10, data=2, model=0): 328, ProcessCoord(pipe=10, data=2, model=1): 329, ProcessCoord(pipe=10, data=2, model=2): 330, ProcessCoord(pipe=10, data=2, model=3): 331, ProcessCoord(pipe=10, data=3, model=0): 332, ProcessCoord(pipe=10, data=3, model=1): 333, ProcessCoord(pipe=10, data=3, model=2): 334, ProcessCoord(pipe=10, data=3, model=3): 335, ProcessCoord(pipe=10, data=4, model=0): 336, ProcessCoord(pipe=10, data=4, model=1): 337, ProcessCoord(pipe=10, data=4, model=2): 338, ProcessCoord(pipe=10, data=4, model=3): 339, ProcessCoord(pipe=10, data=5, model=0): 340, ProcessCoord(pipe=10, data=5, model=1): 341, ProcessCoord(pipe=10, data=5, model=2): 342, ProcessCoord(pipe=10, data=5, model=3): 343, ProcessCoord(pipe=10, data=6, model=0): 344, ProcessCoord(pipe=10, data=6, model=1): 345, ProcessCoord(pipe=10, data=6, model=2): 346, ProcessCoord(pipe=10, data=6, model=3): 347, ProcessCoord(pipe=10, data=7, model=0): 348, ProcessCoord(pipe=10, data=7, model=1): 349, ProcessCoord(pipe=10, data=7, model=2): 350, ProcessCoord(pipe=10, data=7, model=3): 351, ProcessCoord(pipe=11, data=0, model=0): 352, ProcessCoord(pipe=11, data=0, model=1): 353, ProcessCoord(pipe=11, data=0, model=2): 354, ProcessCoord(pipe=11, data=0, model=3): 355, ProcessCoord(pipe=11, data=1, model=0): 356, ProcessCoord(pipe=11, data=1, model=1): 357, ProcessCoord(pipe=11, data=1, model=2): 358, ProcessCoord(pipe=11, data=1, model=3): 359, ProcessCoord(pipe=11, data=2, model=0): 360, ProcessCoord(pipe=11, data=2, model=1): 361, ProcessCoord(pipe=11, data=2, model=2): 362, ProcessCoord(pipe=11, data=2, model=3): 363, ProcessCoord(pipe=11, data=3, model=0): 364, ProcessCoord(pipe=11, data=3, model=1): 365, ProcessCoord(pipe=11, data=3, model=2): 366, ProcessCoord(pipe=11, data=3, model=3): 367, ProcessCoord(pipe=11, data=4, model=0): 368, ProcessCoord(pipe=11, data=4, model=1): 369, ProcessCoord(pipe=11, data=4, model=2): 370, ProcessCoord(pipe=11, data=4, model=3): 371, ProcessCoord(pipe=11, data=5, model=0): 372, ProcessCoord(pipe=11, data=5, model=1): 373, ProcessCoord(pipe=11, data=5, model=2): 374, ProcessCoord(pipe=11, data=5, model=3): 375, ProcessCoord(pipe=11, data=6, model=0): 376, ProcessCoord(pipe=11, data=6, model=1): 377, ProcessCoord(pipe=11, data=6, model=2): 378, ProcessCoord(pipe=11, data=6, model=3): 379, ProcessCoord(pipe=11, data=7, model=0): 380, ProcessCoord(pipe=11, data=7, model=1): 381, ProcessCoord(pipe=11, data=7, model=2): 382, ProcessCoord(pipe=11, data=7, model=3): 383}
[default0]:[2022-03-03 05:45:11,534] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method type:transformer|embedding
[default0]:stage=0 layers=8
[default0]:     0: _to_float16
[default0]:     1: EmbeddingPipe
[default0]:     2: <lambda>
[default0]:     3: ParallelTransformerLayerPipe
[default0]:     4: ParallelTransformerLayerPipe
[default0]:     5: ParallelTransformerLayerPipe
[default0]:     6: ParallelTransformerLayerPipe
[default0]:     7: ParallelTransformerLayerPipe
[default0]:stage=1 layers=6
[default0]:     8: ParallelTransformerLayerPipe
[default0]:     9: ParallelTransformerLayerPipe
[default0]:    10: ParallelTransformerLayerPipe
[default0]:    11: ParallelTransformerLayerPipe
[default0]:    12: ParallelTransformerLayerPipe
[default0]:    13: ParallelTransformerLayerPipe
[default0]:stage=2 layers=6
[default0]:    14: ParallelTransformerLayerPipe
[default0]:    15: ParallelTransformerLayerPipe
[default0]:    16: ParallelTransformerLayerPipe
[default0]:    17: ParallelTransformerLayerPipe
[default0]:    18: ParallelTransformerLayerPipe
[default0]:    19: ParallelTransformerLayerPipe
[default0]:stage=3 layers=6
[default0]:    20: ParallelTransformerLayerPipe
[default0]:    21: ParallelTransformerLayerPipe
[default0]:    22: ParallelTransformerLayerPipe
[default0]:    23: ParallelTransformerLayerPipe
[default0]:    24: ParallelTransformerLayerPipe
[default0]:    25: ParallelTransformerLayerPipe
[default0]:stage=4 layers=6
[default0]:    26: ParallelTransformerLayerPipe
[default0]:    27: ParallelTransformerLayerPipe
[default0]:    28: ParallelTransformerLayerPipe
[default0]:    29: ParallelTransformerLayerPipe
[default0]:    30: ParallelTransformerLayerPipe
[default0]:    31: ParallelTransformerLayerPipe
[default0]:stage=5 layers=6
[default0]:    32: ParallelTransformerLayerPipe
[default0]:    33: ParallelTransformerLayerPipe
[default0]:    34: ParallelTransformerLayerPipe
[default0]:    35: ParallelTransformerLayerPipe
[default0]:    36: ParallelTransformerLayerPipe
[default0]:    37: ParallelTransformerLayerPipe
[default0]:stage=6 layers=6
[default0]:    38: ParallelTransformerLayerPipe
[default0]:    39: ParallelTransformerLayerPipe
[default0]:    40: ParallelTransformerLayerPipe
[default0]:    41: ParallelTransformerLayerPipe
[default0]:    42: ParallelTransformerLayerPipe
[default0]:    43: ParallelTransformerLayerPipe
[default0]:stage=7 layers=6
[default0]:    44: ParallelTransformerLayerPipe
[default0]:    45: ParallelTransformerLayerPipe
[default0]:    46: ParallelTransformerLayerPipe
[default0]:    47: ParallelTransformerLayerPipe
[default0]:    48: ParallelTransformerLayerPipe
[default0]:    49: ParallelTransformerLayerPipe
[default0]:stage=8 layers=6
[default0]:    50: ParallelTransformerLayerPipe
[default0]:    51: ParallelTransformerLayerPipe
[default0]:    52: ParallelTransformerLayerPipe
[default0]:    53: ParallelTransformerLayerPipe
[default0]:    54: ParallelTransformerLayerPipe
[default0]:    55: ParallelTransformerLayerPipe
[default0]:stage=9 layers=6
[default0]:    56: ParallelTransformerLayerPipe
[default0]:    57: ParallelTransformerLayerPipe
[default0]:    58: ParallelTransformerLayerPipe
[default0]:    59: ParallelTransformerLayerPipe
[default0]:    60: ParallelTransformerLayerPipe
[default0]:    61: ParallelTransformerLayerPipe
[default0]:stage=10 layers=6
[default0]:    62: ParallelTransformerLayerPipe
[default0]:    63: ParallelTransformerLayerPipe
[default0]:    64: ParallelTransformerLayerPipe
[default0]:    65: ParallelTransformerLayerPipe
[default0]:    66: ParallelTransformerLayerPipe
[default0]:    67: ParallelTransformerLayerPipe
[default0]:stage=11 layers=9
[default0]:    68: ParallelTransformerLayerPipe
[default0]:    69: ParallelTransformerLayerPipe
[default0]:    70: ParallelTransformerLayerPipe
[default0]:    71: ParallelTransformerLayerPipe
[default0]:    72: ParallelTransformerLayerPipe
[default0]:    73: <lambda>
[default0]:    74: MixedFusedLayerNorm
[default0]:    75: EmbeddingPipe
[default0]:    76: float16_to_fp32
[default0]:  loss: CrossEntropy
[default0]:[2022-03-03 05:45:12,761] [INFO] [utils.py:828:see_memory_usage] After Building Model
[default0]:[2022-03-03 05:45:12,761] [INFO] [utils.py:829:see_memory_usage] MA 7.43 GB         Max_MA 7.43 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-03 05:45:12,762] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.6 GB, percent = 8.7%
[default0]:setting training iterations to 128728
[default0]:> learning rate decay style: cosine
[default0]:DeepSpeed is enabled.
[default0]:[2022-03-03 05:45:12,782] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+ed26ef4, git-hash=ed26ef4, git-branch=olruwase/bf16-updates
[default0]:[2022-03-03 05:45:14,566] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[default0]:[2022-03-03 05:45:14,567] [INFO] [engine.py:1092:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[default0]:[2022-03-03 05:45:14,567] [INFO] [engine.py:1098:_configure_optimizer] Using client Optimizer as basic optimizer
[default0]:[2022-03-03 05:45:14,567] [INFO] [engine.py:1114:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
[default0]:[2022-03-03 05:45:14,567] [INFO] [engine.py:1328:_configure_bf16_optimizer] Creating unfused BF16 optimizer
[default0]:[2022-03-03 05:45:14,602] [INFO] [utils.py:828:see_memory_usage] begin bf16_optimizer
[default0]:[2022-03-03 05:45:14,603] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB         Max_MA 7.43 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-03 05:45:14,603] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 05:45:14,624] [INFO] [utils.py:828:see_memory_usage] before initializing group 0
[default0]:[2022-03-03 05:45:14,625] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB         Max_MA 7.42 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-03 05:45:14,625] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 05:45:14,675] [INFO] [utils.py:828:see_memory_usage] after initializing group 0
[default0]:[2022-03-03 05:45:14,675] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB         Max_MA 17.01 GB         CA 20.23 GB         Max_CA 20 GB 
[default0]:[2022-03-03 05:45:14,675] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 05:45:14,696] [INFO] [utils.py:828:see_memory_usage] before initializing group 1
[default0]:[2022-03-03 05:45:14,696] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB         Max_MA 17.01 GB         CA 20.23 GB         Max_CA 20 GB 
[default0]:[2022-03-03 05:45:14,696] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 05:45:14,738] [INFO] [utils.py:828:see_memory_usage] after initializing group 1
[default0]:[2022-03-03 05:45:14,739] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB         Max_MA 24.11 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-03 05:45:14,739] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.96 GB, percent = 8.7%
[default0]:[2022-03-03 05:45:14,760] [INFO] [utils.py:828:see_memory_usage] before initializing group 2
[default0]:[2022-03-03 05:45:14,760] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB         Max_MA 24.11 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-03 05:45:14,760] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.96 GB, percent = 8.7%
[default0]:[2022-03-03 05:45:14,782] [INFO] [utils.py:828:see_memory_usage] after initializing group 2
[default0]:[2022-03-03 05:45:14,783] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB         Max_MA 24.12 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-03 05:45:14,783] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.96 GB, percent = 8.7%
[default0]:[2022-03-03 05:45:14,804] [INFO] [utils.py:828:see_memory_usage] before initialize_optimizer
[default0]:[2022-03-03 05:45:14,804] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB         Max_MA 24.12 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-03 05:45:14,805] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.96 GB, percent = 8.7%
[default0]:[2022-03-03 05:45:14,851] [INFO] [utils.py:828:see_memory_usage] end initialize_optimizer
[default0]:[2022-03-03 05:45:14,852] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB         Max_MA 27.82 GB         CA 34.21 GB         Max_CA 34 GB 
[default0]:[2022-03-03 05:45:14,852] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.96 GB, percent = 8.7%
[default0]:[2022-03-03 05:45:14,872] [INFO] [utils.py:828:see_memory_usage] end bf16_optimizer
[default0]:[2022-03-03 05:45:14,873] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB         Max_MA 27.82 GB         CA 34.21 GB         Max_CA 34 GB 
[default0]:[2022-03-03 05:45:14,873] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.96 GB, percent = 8.7%
[default0]:[2022-03-03 05:45:14,873] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[default0]:[2022-03-03 05:45:14,873] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[default0]:[2022-03-03 05:45:14,873] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x149408ee1100>
[default0]:[2022-03-03 05:45:14,873] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
[default0]:[2022-03-03 05:45:14,873] [INFO] [config.py:1057:print] DeepSpeedEngine configuration:
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   activation_checkpointing_config  {
[default0]:    "partition_activations": false, 
[default0]:    "contiguous_memory_optimization": false, 
[default0]:    "cpu_checkpointing": false, 
[default0]:    "number_checkpoints": null, 
[default0]:    "synchronize_checkpoint_boundary": false, 
[default0]:    "profile": false
[default0]:}
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   amp_enabled .................. False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   amp_params ................... False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   autotuning_config ............ {
[default0]:    "enabled": false, 
[default0]:    "start_step": null, 
[default0]:    "end_step": null, 
[default0]:    "metric_path": null, 
[default0]:    "arg_mappings": null, 
[default0]:    "metric": "throughput", 
[default0]:    "model_info": null, 
[default0]:    "results_dir": null, 
[default0]:    "exps_dir": null, 
[default0]:    "overwrite": true, 
[default0]:    "fast": true, 
[default0]:    "start_profile_step": 3, 
[default0]:    "end_profile_step": 5, 
[default0]:    "tuner_type": "gridsearch", 
[default0]:    "tuner_early_stopping": 5, 
[default0]:    "tuner_num_trials": 50, 
[default0]:    "model_info_path": null, 
[default0]:    "mp_size": 1, 
[default0]:    "max_train_batch_size": null, 
[default0]:    "min_train_batch_size": 1, 
[default0]:    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
[default0]:    "min_train_micro_batch_size_per_gpu": 1, 
[default0]:    "num_tuning_micro_batch_sizes": 3
[default0]:}
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   bfloat16_enabled ............. True
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   checkpoint_tag_validation_enabled  True
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   checkpoint_tag_validation_fail  False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   communication_data_type ...... None
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   curriculum_enabled ........... False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   curriculum_params ............ False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   dataloader_drop_last ......... False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   disable_allgather ............ False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   dump_state ................... False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   dynamic_loss_scale_args ...... None
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   eigenvalue_enabled ........... False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   eigenvalue_gas_boundary_resolution  1
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   eigenvalue_layer_name ........ bert.encoder.layer
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   eigenvalue_layer_num ......... 0
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   eigenvalue_max_iter .......... 100
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   eigenvalue_stability ......... 1e-06
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   eigenvalue_tol ............... 0.01
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   eigenvalue_verbose ........... False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   elasticity_enabled ........... False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   flops_profiler_config ........ {
[default0]:    "enabled": false, 
[default0]:    "profile_step": 1, 
[default0]:    "module_depth": -1, 
[default0]:    "top_modules": 1, 
[default0]:    "detailed": true, 
[default0]:    "output_file": null
[default0]:}
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   fp16_enabled ................. False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   fp16_master_weights_and_gradients  False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   fp16_mixed_quantize .......... False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   global_rank .................. 0
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   gradient_accumulation_steps .. 128
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   gradient_clipping ............ 1.0
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   gradient_predivide_factor .... 1.0
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   initial_dynamic_scale ........ 1
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   loss_scale ................... 1.0
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   memory_breakdown ............. False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   optimizer_legacy_fusion ...... False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   optimizer_name ............... None
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   optimizer_params ............. None
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   pld_enabled .................. False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   pld_params ................... False
[default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print]   prescale_gradients ........... False
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   quantize_change_rate ......... 0.001
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   quantize_groups .............. 1
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   quantize_offset .............. 1000
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   quantize_period .............. 1000
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   quantize_rounding ............ 0
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   quantize_start_bits .......... 16
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   quantize_target_bits ......... 8
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   quantize_training_enabled .... False
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   quantize_type ................ 0
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   quantize_verbose ............. False
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   scheduler_name ............... None
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   scheduler_params ............. None
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   sparse_attention ............. None
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   sparse_gradients_enabled ..... False
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   steps_per_print .............. 2000
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   tensorboard_enabled .......... False
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   tensorboard_job_name ......... DeepSpeedJobName
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   tensorboard_output_path ...... 
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   train_batch_size ............. 2048
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   train_micro_batch_size_per_gpu  2
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   use_quantizer_kernel ......... False
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   wall_clock_breakdown ......... False
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   world_size ................... 8
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   zero_allow_untested_optimizer  False
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   zero_config .................. {
[default0]:    "stage": 0, 
[default0]:    "contiguous_gradients": true, 
[default0]:    "reduce_scatter": true, 
[default0]:    "reduce_bucket_size": 5.000000e+08, 
[default0]:    "allgather_partitions": true, 
[default0]:    "allgather_bucket_size": 5.000000e+08, 
[default0]:    "overlap_comm": false, 
[default0]:    "load_from_fp32_weights": true, 
[default0]:    "elastic_checkpoint": false, 
[default0]:    "offload_param": null, 
[default0]:    "offload_optimizer": null, 
[default0]:    "sub_group_size": 1.000000e+09, 
[default0]:    "prefetch_bucket_size": 5.000000e+07, 
[default0]:    "param_persistence_threshold": 1.000000e+05, 
[default0]:    "max_live_parameters": 1.000000e+09, 
[default0]:    "max_reuse_distance": 1.000000e+09, 
[default0]:    "gather_16bit_weights_on_model_save": false, 
[default0]:    "ignore_unused_parameters": true, 
[default0]:    "round_robin_gradients": false, 
[default0]:    "legacy_stage1": false
[default0]:}
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   zero_enabled ................. False
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print]   zero_optimization_stage ...... 0
[default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1063:print]   json = {
[default0]:    "train_micro_batch_size_per_gpu": 2, 
[default0]:    "train_batch_size": 2.048000e+03, 
[default0]:    "gradient_clipping": 1.0, 
[default0]:    "zero_optimization": {
[default0]:        "stage": 0
[default0]:    }, 
[default0]:    "bf16": {
[default0]:        "enabled": true
[default0]:    }, 
[default0]:    "steps_per_print": 2.000000e+03, 
[default0]:    "wall_clock_breakdown": false
[default0]:}
[default0]:[2022-03-03 05:45:14,875] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=128 micro_batch_size=2
[default1]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=129 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=128 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=130 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=131 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=226 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=224 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=227 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=225 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=289 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=288 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=291 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=290 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=34 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=32 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=35 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=33 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=98 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=97 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=96 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=64 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=66 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=99 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=65 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=67 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=193 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=194 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=195 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=192 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=352 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,693] [INFO] [engine.py:151:__init__] RANK=322 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=320 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=323 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=321 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 05:45:16,693] [INFO] [engine.py:151:__init__] RANK=257 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=162 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=161 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=160 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=163 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=258 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=256 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=259 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=354 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=353 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=355 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=2 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=1 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=3 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:time (ms) | load-checkpoint: 8.35
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:WARNING: could not find the metadata file /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints 
[default0]:    will not load any checkpoints and will start from random
[default0]:estimated model parameters: 191.162474496
[default0]:estimated model parameters without embeddings: 148.003086336
[default0]:[after model, optimizer, and learning rate scheduler are built] datetime: 2022-03-03 05:45:17 
[default0]:> building train, validation, and test datasets ...
[default0]: > datasets target sizes (minimum size):
[default0]:    train:      220000000
[default0]:    validation: 2641920
[default0]:    test:       20480
[default0]:> building train, validation, and test datasets for GPT ...
[default0]: > building dataset index ...
[default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/utils.py:280: UserWarning: Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings
[default0]:  warnings.warn("Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings")
[default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.101499 seconds
[default0]:    number of documents: 1276214
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 1211127) total of 1211127 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (388379) is smaller than 95.0% of number of samples per epoch (471556), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 2.342541
[default0]:    using:
[default0]:     number of documents:       1211127
[default0]:     number of epochs:          41
[default0]:     sequence length:           2048
[default0]:     total number of samples:   19333817
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.257259
[default0]: > building shuffle index with split [0, 18862261) and [18862261, 19333817) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.608458
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.013 seconds
[default0]:    total number of samples: 19333818
[default0]:    total number of epochs: 41
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.014306 seconds
[default0]:    number of documents: 2218089
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 2104966) total of 2104966 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (190457) is smaller than 95.0% of number of samples per epoch (209202), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 2.091640
[default0]:    using:
[default0]:     number of documents:       2104966
[default0]:     number of epochs:          22
[default0]:     sequence length:           2048
[default0]:     total number of samples:   4602460
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.130691
[default0]: > building shuffle index with split [0, 4393257) and [4393257, 4602460) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.105553
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.015 seconds
[default0]:    total number of samples: 4602461
[default0]:    total number of epochs: 22
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.019053 seconds
[default0]:    number of documents: 14716427
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 13965889) total of 13965889 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (774480) is smaller than 95.0% of number of samples per epoch (8932197), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 2.399262
[default0]:    using:
[default0]:     number of documents:       13965889
[default0]:     number of epochs:          4
[default0]:     sequence length:           2048
[default0]:     total number of samples:   35728791
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.862422
[default0]: > building shuffle index with split [0, 26796593) and [26796593, 35728791) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 1.125560
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.015 seconds
[default0]:    total number of samples: 35728792
[default0]:    total number of epochs: 4
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.059332 seconds
[default0]:    number of documents: 2767535
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 2626391) total of 2626391 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (322204) is smaller than 95.0% of number of samples per epoch (1004978), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 3.681606
[default0]:    using:
[default0]:     number of documents:       2626391
[default0]:     number of epochs:          28
[default0]:     sequence length:           2048
[default0]:     total number of samples:   28139392
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.438673
[default0]: > building shuffle index with split [0, 27134414) and [27134414, 28139392) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.986919
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.016 seconds
[default0]:    total number of samples: 28139393
[default0]:    total number of epochs: 28
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.013899 seconds
[default0]:    number of documents: 786245
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 746147) total of 746147 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (2279) is smaller than 95.0% of number of samples per epoch (30472), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.569235
[default0]:    using:
[default0]:     number of documents:       746147
[default0]:     number of epochs:          22
[default0]:     sequence length:           2048
[default0]:     total number of samples:   670403
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.032689
[default0]: > building shuffle index with split [0, 639930) and [639930, 670403) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.015091
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.010 seconds
[default0]:    total number of samples: 670404
[default0]:    total number of epochs: 22
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.013124 seconds
[default0]:    number of documents: 1748556
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 1659380) total of 1659380 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (118198) is smaller than 95.0% of number of samples per epoch (499143), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 4.923787
[default0]:    using:
[default0]:     number of documents:       1659380
[default0]:     number of epochs:          56
[default0]:     sequence length:           2048
[default0]:     total number of samples:   27952019
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.411989
[default0]: > building shuffle index with split [0, 27452875) and [27452875, 27952019) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.987607
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.016 seconds
[default0]:    total number of samples: 27952020
[default0]:    total number of epochs: 56
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.028527 seconds
[default0]:    number of documents: 29464287
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 27961608) total of 27961608 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (286305) is smaller than 95.0% of number of samples per epoch (348542), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 68.374838
[default0]:    using:
[default0]:     number of documents:       27961608
[default0]:     number of epochs:          42
[default0]:     sequence length:           2048
[default0]:     total number of samples:   14638799
[default0]: > elasped time to build and save sample-idx mapping (seconds): 10.501170
[default0]: > building shuffle index with split [0, 14290257) and [14290257, 14638799) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.391336
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.019 seconds
[default0]:    total number of samples: 14638800
[default0]:    total number of epochs: 42
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.006352 seconds
[default0]:    number of documents: 38304059
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 36350552) total of 36350552 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (24801) is smaller than 95.0% of number of samples per epoch (593669), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 101.234838
[default0]:    using:
[default0]:     number of documents:       36350552
[default0]:     number of epochs:          46
[default0]:     sequence length:           2048
[default0]:     total number of samples:   27308814
[default0]: > elasped time to build and save sample-idx mapping (seconds): 15.445697
[default0]: > building shuffle index with split [0, 26715144) and [26715144, 27308814) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.968035
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.022 seconds
[default0]:    total number of samples: 27308815
[default0]:    total number of epochs: 46
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.003736 seconds
[default0]:    number of documents: 729667
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 692454) total of 692454 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (294445) is smaller than 95.0% of number of samples per epoch (313064), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.496277
[default0]:    using:
[default0]:     number of documents:       692454
[default0]:     number of epochs:          22
[default0]:     sequence length:           2048
[default0]:     total number of samples:   6887420
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.105572
[default0]: > building shuffle index with split [0, 6574355) and [6574355, 6887420) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.151943
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.013 seconds
[default0]:    total number of samples: 6887421
[default0]:    total number of epochs: 22
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.017804 seconds
[default0]:    number of documents: 24265522
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 23027980) total of 23027980 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (159718) is smaller than 95.0% of number of samples per epoch (412173), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 33.578622
[default0]:    using:
[default0]:     number of documents:       23027980
[default0]:     number of epochs:          25
[default0]:     sequence length:           2048
[default0]:     total number of samples:   10304342
[default0]: > elasped time to build and save sample-idx mapping (seconds): 5.224553
[default0]: > building shuffle index with split [0, 9892169) and [9892169, 10304342) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.236011
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.016 seconds
[default0]:    total number of samples: 10304343
[default0]:    total number of epochs: 25
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.009976 seconds
[default0]:    number of documents: 9587455
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 9098495) total of 9098495 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (2061556) is smaller than 95.0% of number of samples per epoch (2892475), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 4.566206
[default0]:    using:
[default0]:     number of documents:       9098495
[default0]:     number of epochs:          10
[default0]:     sequence length:           2048
[default0]:     total number of samples:   28924754
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.934278
[default0]: > building shuffle index with split [0, 26032279) and [26032279, 28924754) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.985260
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.017 seconds
[default0]:    total number of samples: 28924755
[default0]:    total number of epochs: 10
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.021048 seconds
[default0]:    number of documents: 4335929
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 4114797) total of 4114797 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (362105) is smaller than 95.0% of number of samples per epoch (2720896), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 2.043312
[default0]:    using:
[default0]:     number of documents:       4114797
[default0]:     number of epochs:          11
[default0]:     sequence length:           2048
[default0]:     total number of samples:   29929865
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.432344
[default0]: > building shuffle index with split [0, 27208968) and [27208968, 29929865) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 1.032287
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.015 seconds
[default0]:    total number of samples: 29929866
[default0]:    total number of epochs: 11
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.006202 seconds
[default0]:    number of documents: 149731
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 142095) total of 142095 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (1829) is smaller than 95.0% of number of samples per epoch (7103), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.060044
[default0]:    using:
[default0]:     number of documents:       142095
[default0]:     number of epochs:          18
[default0]:     sequence length:           2048
[default0]:     total number of samples:   127854
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.017316
[default0]: > building shuffle index with split [0, 120751) and [120751, 127854) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.004353
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.009 seconds
[default0]:    total number of samples: 127855
[default0]:    total number of epochs: 18
[default0]:> building indices for blendable datasets ...
[default0]: > sample ratios:
[default0]:   dataset 0, input: 0.0870676, achieved: 0.0870676
[default0]:   dataset 1, input: 0.0207314, achieved: 0.0207314
[default0]:   dataset 2, input: 0.1247, achieved: 0.1247
[default0]:   dataset 3, input: 0.124182, achieved: 0.124182
[default0]:   dataset 4, input: 0.0029046, achieved: 0.0029046
[default0]:   dataset 5, input: 0.1247, achieved: 0.1247
[default0]:   dataset 6, input: 0.0659275, achieved: 0.0659275
[default0]:   dataset 7, input: 0.120941, achieved: 0.120941
[default0]:   dataset 8, input: 0.0310665, achieved: 0.0310665
[default0]:   dataset 9, input: 0.0454631, achieved: 0.0454631
[default0]:   dataset 10, input: 0.127064, achieved: 0.127064
[default0]:   dataset 11, input: 0.1247, achieved: 0.1247
[default0]:   dataset 12, input: 0.000554406, achieved: 0.000554405
[default0]:> elapsed time for building blendable dataset indices: 4.04 (sec)
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002366 seconds
[default0]:    number of documents: 1276214
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [1211127, 1274938) total of 63811 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (3428) is smaller than 95.0% of number of samples per epoch (13396), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.026619
[default0]:    using:
[default0]:     number of documents:       63811
[default0]:     number of epochs:          18
[default0]:     sequence length:           2048
[default0]:     total number of samples:   241145
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.005174
[default0]: > building shuffle index with split [0, 227748) and [227748, 241145) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.006570
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.003 seconds
[default0]:    total number of samples: 241146
[default0]:    total number of epochs: 18
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002403 seconds
[default0]:    number of documents: 2218089
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [2104966, 2215871) total of 110905 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (10348) is smaller than 95.0% of number of samples per epoch (11174), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.015049
[default0]:    using:
[default0]:     number of documents:       110905
[default0]:     number of epochs:          5
[default0]:     sequence length:           2048
[default0]:     total number of samples:   55871
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.003057
[default0]: > building shuffle index with split [0, 44697) and [44697, 55871) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002864
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 55872
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.009845 seconds
[default0]:    number of documents: 14716427
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [13965889, 14701711) total of 735822 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > only one epoch required, setting separate_last_epoch to False
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.018397
[default0]:    using:
[default0]:     number of documents:       735822
[default0]:     number of epochs:          1
[default0]:     sequence length:           2048
[default0]:     total number of samples:   1880534
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.017694
[default0]: > building shuffle index with split [0, 1880534) and [1880534, 1880534) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.034689
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.009 seconds
[default0]:    total number of samples: 1880535
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.003629 seconds
[default0]:    number of documents: 2767535
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [2626391, 2764767) total of 138376 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (89572) is smaller than 95.0% of number of samples per epoch (240148), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.008387
[default0]:    using:
[default0]:     number of documents:       138376
[default0]:     number of epochs:          2
[default0]:     sequence length:           2048
[default0]:     total number of samples:   480296
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.005186
[default0]: > building shuffle index with split [0, 240148) and [240148, 480296) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.009917
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 480297
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002055 seconds
[default0]:    number of documents: 786245
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [746147, 785459) total of 39312 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (288) is smaller than 95.0% of number of samples per epoch (1060), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.009303
[default0]:    using:
[default0]:     number of documents:       39312
[default0]:     number of epochs:          8
[default0]:     sequence length:           2048
[default0]:     total number of samples:   8486
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001861
[default0]: > building shuffle index with split [0, 7425) and [7425, 8486) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.001641
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 8487
[default0]:    total number of epochs: 8
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002356 seconds
[default0]:    number of documents: 1748556
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [1659380, 1746807) total of 87427 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > only one epoch required, setting separate_last_epoch to False
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.004679
[default0]:    using:
[default0]:     number of documents:       87427
[default0]:     number of epochs:          1
[default0]:     sequence length:           2048
[default0]:     total number of samples:   907156
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.004551
[default0]: > building shuffle index with split [0, 907156) and [907156, 907156) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.017715
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 907157
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.009854 seconds
[default0]:    number of documents: 29464287
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [27961608, 29434823) total of 1473215 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (3929) is smaller than 95.0% of number of samples per epoch (15556), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.595725
[default0]:    using:
[default0]:     number of documents:       1473215
[default0]:     number of epochs:          12
[default0]:     sequence length:           2048
[default0]:     total number of samples:   186674
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.030645
[default0]: > building shuffle index with split [0, 171117) and [171117, 186674) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.005210
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.007 seconds
[default0]:    total number of samples: 186675
[default0]:    total number of epochs: 12
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.009861 seconds
[default0]:    number of documents: 38304059
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [36350552, 38265755) total of 1915203 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (13053) is smaller than 95.0% of number of samples per epoch (25671), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.961445
[default0]:    using:
[default0]:     number of documents:       1915203
[default0]:     number of epochs:          13
[default0]:     sequence length:           2048
[default0]:     total number of samples:   333732
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.045044
[default0]: > building shuffle index with split [0, 308060) and [308060, 333732) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.008467
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.021 seconds
[default0]:    total number of samples: 333733
[default0]:    total number of epochs: 13
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001923 seconds
[default0]:    number of documents: 729667
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [692454, 728937) total of 36483 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (3876) is smaller than 95.0% of number of samples per epoch (19652), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.007104
[default0]:    using:
[default0]:     number of documents:       36483
[default0]:     number of epochs:          5
[default0]:     sequence length:           2048
[default0]:     total number of samples:   98263
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.003654
[default0]: > building shuffle index with split [0, 78610) and [78610, 98263) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.003778
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 98264
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.010075 seconds
[default0]:    number of documents: 24265522
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [23027980, 24241256) total of 1213276 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (13145) is smaller than 95.0% of number of samples per epoch (21513), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.200171
[default0]:    using:
[default0]:     number of documents:       1213276
[default0]:     number of epochs:          6
[default0]:     sequence length:           2048
[default0]:     total number of samples:   129079
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.015500
[default0]: > building shuffle index with split [0, 107566) and [107566, 129079) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.003973
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.007 seconds
[default0]:    total number of samples: 129080
[default0]:    total number of epochs: 6
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002745 seconds
[default0]:    number of documents: 9587455
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [9098495, 9577868) total of 479373 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (24678) is smaller than 95.0% of number of samples per epoch (156347), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.032711
[default0]:    using:
[default0]:     number of documents:       479373
[default0]:     number of epochs:          3
[default0]:     sequence length:           2048
[default0]:     total number of samples:   469041
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.009577
[default0]: > building shuffle index with split [0, 312694) and [312694, 469041) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.010022
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 469042
[default0]:    total number of epochs: 3
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002281 seconds
[default0]:    number of documents: 4335929
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [4114797, 4331593) total of 216796 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (131990) is smaller than 95.0% of number of samples per epoch (199104), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.010530
[default0]:    using:
[default0]:     number of documents:       216796
[default0]:     number of epochs:          2
[default0]:     sequence length:           2048
[default0]:     total number of samples:   398208
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.006054
[default0]: > building shuffle index with split [0, 199104) and [199104, 398208) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.008991
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.003 seconds
[default0]:    total number of samples: 398209
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.000586 seconds
[default0]:    number of documents: 149731
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [142095, 149581) total of 7486 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (188) is smaller than 95.0% of number of samples per epoch (257), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.003164
[default0]:    using:
[default0]:     number of documents:       7486
[default0]:     number of epochs:          6
[default0]:     sequence length:           2048
[default0]:     total number of samples:   1543
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001714
[default0]: > building shuffle index with split [0, 1285) and [1285, 1543) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.001593
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 1544
[default0]:    total number of epochs: 6
[default0]:> building indices for blendable datasets ...
[default0]: > sample ratios:
[default0]:   dataset 0, input: 0.0870676, achieved: 0.0870675
[default0]:   dataset 1, input: 0.0207314, achieved: 0.0207315
[default0]:   dataset 2, input: 0.1247, achieved: 0.1247
[default0]:   dataset 3, input: 0.124182, achieved: 0.124182
[default0]:   dataset 4, input: 0.0029046, achieved: 0.00290461
[default0]:   dataset 5, input: 0.1247, achieved: 0.1247
[default0]:   dataset 6, input: 0.0659275, achieved: 0.0659274
[default0]:   dataset 7, input: 0.120941, achieved: 0.120941
[default0]:   dataset 8, input: 0.0310665, achieved: 0.0310665
[default0]:   dataset 9, input: 0.0454631, achieved: 0.0454631
[default0]:   dataset 10, input: 0.127064, achieved: 0.127064
[default0]:   dataset 11, input: 0.1247, achieved: 0.1247
[default0]:   dataset 12, input: 0.000554406, achieved: 0.000554525
[default0]:> elapsed time for building blendable dataset indices: 0.09 (sec)
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.003739 seconds
[default0]:    number of documents: 1276214
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [1274938, 1276214) total of 1276 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > only one epoch required, setting separate_last_epoch to False
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.001685
[default0]:    using:
[default0]:     number of documents:       1276
[default0]:     number of epochs:          1
[default0]:     sequence length:           2048
[default0]:     total number of samples:   202914
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.002362
[default0]: > building shuffle index with split [0, 202914) and [202914, 202914) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.005445
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.003 seconds
[default0]:    total number of samples: 202915
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002196 seconds
[default0]:    number of documents: 2218089
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [2215871, 2218089) total of 2218 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (4) is smaller than 95.0% of number of samples per epoch (35), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.002126
[default0]:    using:
[default0]:     number of documents:       2218
[default0]:     number of epochs:          13
[default0]:     sequence length:           2048
[default0]:     total number of samples:   458
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.000543
[default0]: > building shuffle index with split [0, 423) and [423, 458) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.000888
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 459
[default0]:    total number of epochs: 13
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001928 seconds
[default0]:    number of documents: 14716427
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [14701711, 14716427) total of 14716 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > only one epoch required, setting separate_last_epoch to False
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.001813
[default0]:    using:
[default0]:     number of documents:       14716
[default0]:     number of epochs:          1
[default0]:     sequence length:           2048
[default0]:     total number of samples:   37486
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001882
[default0]: > building shuffle index with split [0, 37486) and [37486, 37486) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002234
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 37487
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001981 seconds
[default0]:    number of documents: 2767535
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [2764767, 2767535) total of 2768 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > only one epoch required, setting separate_last_epoch to False
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.002009
[default0]:    using:
[default0]:     number of documents:       2768
[default0]:     number of epochs:          1
[default0]:     sequence length:           2048
[default0]:     total number of samples:   9925
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001703
[default0]: > building shuffle index with split [0, 9925) and [9925, 9925) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002443
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 9926
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001934 seconds
[default0]:    number of documents: 786245
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [785459, 786245) total of 786 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (2) is smaller than 95.0% of number of samples per epoch (19), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.002496
[default0]:    using:
[default0]:     number of documents:       786
[default0]:     number of epochs:          4
[default0]:     sequence length:           2048
[default0]:     total number of samples:   78
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.000519
[default0]: > building shuffle index with split [0, 58) and [58, 78) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.000472
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 79
[default0]:    total number of epochs: 4
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001941 seconds
[default0]:    number of documents: 1748556
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [1746807, 1748556) total of 1749 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > only one epoch required, setting separate_last_epoch to False
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.002136
[default0]:    using:
[default0]:     number of documents:       1749
[default0]:     number of epochs:          1
[default0]:     sequence length:           2048
[default0]:     total number of samples:   34095
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.002118
[default0]: > building shuffle index with split [0, 34095) and [34095, 34095) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002671
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 34096
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002022 seconds
[default0]:    number of documents: 29464287
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [29434823, 29464287) total of 29464 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (42) is smaller than 95.0% of number of samples per epoch (328), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.005099
[default0]:    using:
[default0]:     number of documents:       29464
[default0]:     number of epochs:          5
[default0]:     sequence length:           2048
[default0]:     total number of samples:   1644
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.002219
[default0]: > building shuffle index with split [0, 1315) and [1315, 1644) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002041
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 1645
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001980 seconds
[default0]:    number of documents: 38304059
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [38265755, 38304059) total of 38304 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (268) is smaller than 95.0% of number of samples per epoch (555), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.006891
[default0]:    using:
[default0]:     number of documents:       38304
[default0]:     number of epochs:          5
[default0]:     sequence length:           2048
[default0]:     total number of samples:   2777
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001766
[default0]: > building shuffle index with split [0, 2222) and [2222, 2777) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.001649
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 2778
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001740 seconds
[default0]:    number of documents: 729667
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [728937, 729667) total of 730 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (283) is smaller than 95.0% of number of samples per epoch (357), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.001725
[default0]:    using:
[default0]:     number of documents:       730
[default0]:     number of epochs:          2
[default0]:     sequence length:           2048
[default0]:     total number of samples:   715
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001854
[default0]: > building shuffle index with split [0, 357) and [357, 715) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.000560
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 716
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001954 seconds
[default0]:    number of documents: 24265522
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [24241256, 24265522) total of 24266 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (62) is smaller than 95.0% of number of samples per epoch (437), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.003073
[default0]:    using:
[default0]:     number of documents:       24266
[default0]:     number of epochs:          3
[default0]:     sequence length:           2048
[default0]:     total number of samples:   1311
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001647
[default0]: > building shuffle index with split [0, 874) and [874, 1311) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002045
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 1312
[default0]:    total number of epochs: 3
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002023 seconds
[default0]:    number of documents: 9587455
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [9577868, 9587455) total of 9587 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (955) is smaller than 95.0% of number of samples per epoch (1661), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.001864
[default0]:    using:
[default0]:     number of documents:       9587
[default0]:     number of epochs:          2
[default0]:     sequence length:           2048
[default0]:     total number of samples:   3323
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.002170
[default0]: > building shuffle index with split [0, 1661) and [1661, 3323) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002701
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 3324
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002703 seconds
[default0]:    number of documents: 4335929
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [4331593, 4335929) total of 4336 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > only one epoch required, setting separate_last_epoch to False
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.002090
[default0]:    using:
[default0]:     number of documents:       4336
[default0]:     number of epochs:          1
[default0]:     sequence length:           2048
[default0]:     total number of samples:   3963
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001478
[default0]: > building shuffle index with split [0, 3963) and [3963, 3963) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002203
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 3964
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.000794 seconds
[default0]:    number of documents: 149731
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [149581, 149731) total of 150 documents
[default0]: > WARNING: could not find index map files, building the indices on rank 0 ...
[default0]: > last epoch number of samples (5) is smaller than 95.0% of number of samples per epoch (7), setting separate_last_epoch to True
[default0]: > elasped time to build and save doc-idx mapping (seconds): 0.000444
[default0]:    using:
[default0]:     number of documents:       150
[default0]:     number of epochs:          2
[default0]:     sequence length:           2048
[default0]:     total number of samples:   14
[default0]: > elasped time to build and save sample-idx mapping (seconds): 0.000561
[default0]: > building shuffle index with split [0, 7) and [7, 14) ...
[default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.000543
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 15
[default0]:    total number of epochs: 2
[default0]:> building indices for blendable datasets ...
[default0]: > sample ratios:
[default0]:   dataset 0, input: 0.0870676, achieved: 0.0870664
[default0]:   dataset 1, input: 0.0207314, achieved: 0.020733
[default0]:   dataset 2, input: 0.1247, achieved: 0.124699
[default0]:   dataset 3, input: 0.124182, achieved: 0.12418
[default0]:   dataset 4, input: 0.0029046, achieved: 0.0029059
[default0]:   dataset 5, input: 0.1247, achieved: 0.124699
[default0]:   dataset 6, input: 0.0659275, achieved: 0.0659284
[default0]:   dataset 7, input: 0.120941, achieved: 0.12094
[default0]:   dataset 8, input: 0.0310665, achieved: 0.0310676
[default0]:   dataset 9, input: 0.0454631, achieved: 0.0454632
[default0]:   dataset 10, input: 0.127064, achieved: 0.127063
[default0]:   dataset 11, input: 0.1247, achieved: 0.124699
[default0]:   dataset 12, input: 0.000554406, achieved: 0.000555736
[default0]:> elapsed time for building blendable dataset indices: 0.01 (sec)
[default0]:> finished creating GPT datasets ...
[default1]:[001-004] 177.6021B / 177.6021B
[default0]:[000-004] 177.6021B / 177.6021B
[default2]:[002-004] 177.6021B / 177.6021B
[default0]:[000-009] 177.6021B / 177.6021B
[default1]:[001-007] 177.6021B / 177.6021B
[default1]:[001-009] 177.6021B / 177.6021B
[default1]:[001-003] 177.6021B / 177.6021B
[default3]:[003-004] 177.6021B / 177.6021B
[default0]:[000-007] 177.6021B / 177.6021B
[default3]:[003-009] 177.6021B / 177.6021B
[default0]:[000-001] 177.6021B / 177.6021B
[default0]:[000-003] 177.6021B / 177.6021B
[default2]:[002-006] 177.6021B / 177.6021B
[default1]:[001-005] 177.6021B / 177.6021B
[default3]:[003-005] 177.6021B / 177.6021B
[default2]:[002-005] 177.6021B / 177.6021B
[default2]:[002-003] 177.6021B / 177.6021B
[default1]:[001-002] 177.6021B / 177.6021B
[default3]:[003-010] 177.6021B / 177.6021B
[default3]:[003-007] 177.6021B / 177.6021B
[default2]:[002-001] 177.6021B / 177.6021B
[default3]:[003-001] 177.6021B / 177.6021B
[default1]:[001-006] 177.6021B / 177.6021B
[default0]:[000-006] 177.6021B / 177.6021B
[default2]:[002-010] 177.6021B / 177.6021B
[default0]:[000-002] 177.6021B / 177.6021B
[default2]:[002-009] 177.6021B / 177.6021B
[default7]:time (ms) | model-and-optimizer-setup: 7875.21 | train/valid/test-data-iterators-setup: 280711.35
[default0]:[000-010] 177.6021B / 177.6021B
[default0]:[000-011] 191.1639B / 148.0045B
[default0]:[000-005] 177.6021B / 177.6021B
[default3]:[003-006] 177.6021B / 177.6021B
[default3]:[003-002] 177.6021B / 177.6021B
[default3]:[003-003] 177.6021B / 177.6021B
[default2]:[002-011] 191.1639B / 148.0045B
[default1]:[001-010] 177.6021B / 177.6021B
[default1]:[001-011] 191.1639B / 148.0045B
[default2]:[002-007] 177.6021B / 177.6021B
[default1]:[001-001] 177.6021B / 177.6021B
[default2]:[002-002] 177.6021B / 177.6021B
[default3]:[003-011] 191.1639B / 148.0045B
[default2]:[002-000] 191.1625B / 148.0031B
[default1]:[001-000] 191.1625B / 148.0031B
[default3]:[003-000] 191.1625B / 148.0031B
[default0]:[after dataloaders are built] datetime: 2022-03-03 05:49:58 
[default0]:done with setup ...
[default0]:training ...
[default0]:Number of parameters: [tensor rank - pipeline rank] w/ and w/o embeddings:
[default0]:[000-000] 191.1625B / 148.0031B
[default0]:[before the start of training step] datetime: 2022-03-03 05:49:58 
[default0]:[2022-03-03 05:49:58,855] [INFO] [checkpointing.py:547:forward] Activation Checkpointing Information
[default0]:[2022-03-03 05:49:58,855] [INFO] [checkpointing.py:548:forward] ----Partition Activations False, CPU CHECKPOINTING False
[default0]:[2022-03-03 05:49:58,855] [INFO] [checkpointing.py:551:forward] ----contiguous Memory Checkpointing False with 70 total layers
[default0]:[2022-03-03 05:49:58,855] [INFO] [checkpointing.py:554:forward] ----Synchronization False
[default0]:[2022-03-03 05:49:58,855] [INFO] [checkpointing.py:555:forward] ----Profiling time in checkpointing False
[default1]:[001-008] 177.6021B / 177.6021B
[default2]:[002-008] 177.6021B / 177.6021B
[default0]:[000-008] 177.6021B / 177.6021B
[default3]:[003-008] 177.6021B / 177.6021B
[default3]:[Rank 323] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default3]:[Rank 35] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default3]:[Rank 227] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default3]:[Rank 163] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default3]:[Rank 195] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default7]: iteration        1/  128728 | consumed samples:           16 | consumed tokens:        32768 | elapsed time per iteration (s): 40.31 | learning rate: 5.243E-09 | global batch size:    16 | lm loss: 6.158806E+01 | grad norm: 17.319 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.397 | TFLOPs: 3.04 |
[default3]:[Rank 67] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default3]:[Rank 99] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default3]:[Rank 355] (after 1 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48100.0 | max reserved: 48100.0
[default3]:[Rank 3] (after 1 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 47960.0 | max reserved: 47960.0
[default3]:[Rank 259] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default3]:[Rank 131] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default3]:[Rank 291] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default1]:[Rank 97] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default1]:[Rank 65] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default1]:[Rank 161] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default1]:[Rank 193] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default1]:[Rank 321] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default1]:[Rank 353] (after 1 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48100.0 | max reserved: 48100.0
[default1]:[Rank 33] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default1]:[Rank 1] (after 1 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 47960.0 | max reserved: 47960.0
[default1]:[Rank 257] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default1]:[Rank 129] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default1]:[Rank 225] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default1]:[Rank 289] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default0]:[Rank 256] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default0]:[Rank 128] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default0]:[Rank 288] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default0]:[Rank 224] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default0]:[Rank 32] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default0]:[Rank 96] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default0]:[Rank 192] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default0]:[Rank 64] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default0]:[Rank 320] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default0]:[Rank 352] (after 1 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48100.0 | max reserved: 48100.0
[default0]:[Rank 160] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default2]:[Rank 2] (after 1 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 47960.0 | max reserved: 47960.0
[default0]:[Rank 0] (after 1 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 47960.0 | max reserved: 47960.0
[default2]:[Rank 258] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default2]:[Rank 130] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default2]:[Rank 194] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default2]:[Rank 34] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default2]:[Rank 98] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default2]:[Rank 322] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default2]:[Rank 290] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default2]:[Rank 162] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default2]:[Rank 354] (after 1 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48100.0 | max reserved: 48100.0
[default2]:[Rank 226] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default2]:[Rank 66] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0
[default7]: iteration        2/  128728 | consumed samples:           32 | consumed tokens:        65536 | elapsed time per iteration (s): 14.54 | learning rate: 1.049E-08 | global batch size:    16 | lm loss: 6.161202E+01 | grad norm: 17.327 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.101 | TFLOPs: 8.43 |
[default7]: iteration        3/  128728 | consumed samples:           48 | consumed tokens:        98304 | elapsed time per iteration (s): 14.86 | learning rate: 1.573E-08 | global batch size:    16 | lm loss: 6.159873E+01 | grad norm: 17.827 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.076 | TFLOPs: 8.24 |
[default7]: iteration        4/  128728 | consumed samples:           64 | consumed tokens:       131072 | elapsed time per iteration (s): 14.80 | learning rate: 2.097E-08 | global batch size:    16 | lm loss: 6.156909E+01 | grad norm: 17.233 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.081 | TFLOPs: 8.28 |
[default7]: iteration        5/  128728 | consumed samples:           80 | consumed tokens:       163840 | elapsed time per iteration (s): 14.83 | learning rate: 2.621E-08 | global batch size:    16 | lm loss: 6.158672E+01 | grad norm: 17.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.079 | TFLOPs: 8.26 |
[default7]: iteration        6/  128728 | consumed samples:           96 | consumed tokens:       196608 | elapsed time per iteration (s): 14.79 | learning rate: 3.146E-08 | global batch size:    16 | lm loss: 6.160669E+01 | grad norm: 17.643 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.082 | TFLOPs: 8.28 |
[default7]: iteration        7/  128728 | consumed samples:          112 | consumed tokens:       229376 | elapsed time per iteration (s): 14.87 | learning rate: 3.670E-08 | global batch size:    16 | lm loss: 6.159612E+01 | grad norm: 17.265 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.076 | TFLOPs: 8.24 |
[default7]: iteration        8/  128728 | consumed samples:          128 | consumed tokens:       262144 | elapsed time per iteration (s): 14.78 | learning rate: 4.194E-08 | global batch size:    16 | lm loss: 6.157154E+01 | grad norm: 17.928 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.082 | TFLOPs: 8.29 |
[default7]: iteration        9/  128728 | consumed samples:          144 | consumed tokens:       294912 | elapsed time per iteration (s): 14.90 | learning rate: 4.719E-08 | global batch size:    16 | lm loss: 6.151357E+01 | grad norm: 18.992 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.074 | TFLOPs: 8.22 |
[default7]: iteration       10/  128728 | consumed samples:          160 | consumed tokens:       327680 | elapsed time per iteration (s): 14.84 | learning rate: 5.243E-08 | global batch size:    16 | lm loss: 6.143620E+01 | grad norm: 19.201 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.078 | TFLOPs: 8.26 |
[default7]: iteration       11/  128728 | consumed samples:          176 | consumed tokens:       360448 | elapsed time per iteration (s): 14.92 | learning rate: 5.767E-08 | global batch size:    16 | lm loss: 6.150426E+01 | grad norm: 20.036 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.072 | TFLOPs: 8.21 |
[default7]: iteration       12/  128728 | consumed samples:          192 | consumed tokens:       393216 | elapsed time per iteration (s): 14.78 | learning rate: 6.291E-08 | global batch size:    16 | lm loss: 6.130256E+01 | grad norm: 22.535 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.082 | TFLOPs: 8.29 |
[default7]: iteration       13/  128728 | consumed samples:          208 | consumed tokens:       425984 | elapsed time per iteration (s): 14.86 | learning rate: 6.816E-08 | global batch size:    16 | lm loss: 6.122111E+01 | grad norm: 23.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.077 | TFLOPs: 8.24 |
[default7]: iteration       14/  128728 | consumed samples:          224 | consumed tokens:       458752 | elapsed time per iteration (s): 14.80 | learning rate: 7.340E-08 | global batch size:    16 | lm loss: 6.115615E+01 | grad norm: 25.145 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.081 | TFLOPs: 8.28 |
[default7]: iteration       15/  128728 | consumed samples:          240 | consumed tokens:       491520 | elapsed time per iteration (s): 14.79 | learning rate: 7.864E-08 | global batch size:    16 | lm loss: 6.112857E+01 | grad norm: 24.395 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.082 | TFLOPs: 8.28 |
[default7]: iteration       16/  128728 | consumed samples:          256 | consumed tokens:       524288 | elapsed time per iteration (s): 14.80 | learning rate: 8.389E-08 | global batch size:    16 | lm loss: 5.982215E+01 | grad norm: 40.392 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.081 | TFLOPs: 8.28 |
[default7]: iteration       17/  128728 | consumed samples:          272 | consumed tokens:       557056 | elapsed time per iteration (s): 14.90 | learning rate: 8.913E-08 | global batch size:    16 | lm loss: 5.965714E+01 | grad norm: 43.187 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.074 | TFLOPs: 8.22 |
[default7]: iteration       18/  128728 | consumed samples:          288 | consumed tokens:       589824 | elapsed time per iteration (s): 14.98 | learning rate: 9.437E-08 | global batch size:    16 | lm loss: 5.951318E+01 | grad norm: 44.380 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.068 | TFLOPs: 8.18 |
[default7]: iteration       19/  128728 | consumed samples:          304 | consumed tokens:       622592 | elapsed time per iteration (s): 14.78 | learning rate: 9.961E-08 | global batch size:    16 | lm loss: 5.903408E+01 | grad norm: 48.915 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.082 | TFLOPs: 8.29 |
[default7]: iteration       20/  128728 | consumed samples:          320 | consumed tokens:       655360 | elapsed time per iteration (s): 14.87 | learning rate: 1.049E-07 | global batch size:    16 | lm loss: 5.875332E+01 | grad norm: 50.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.076 | TFLOPs: 8.24 |
[default7]: iteration       21/  128728 | consumed samples:          336 | consumed tokens:       688128 | elapsed time per iteration (s): 14.86 | learning rate: 1.101E-07 | global batch size:    16 | lm loss: 5.413025E+01 | grad norm: 85.585 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.076 | TFLOPs: 8.24 |
[default7]: iteration       22/  128728 | consumed samples:          352 | consumed tokens:       720896 | elapsed time per iteration (s): 14.92 | learning rate: 1.153E-07 | global batch size:    16 | lm loss: 5.085058E+01 | grad norm: 93.942 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.072 | TFLOPs: 8.21 |
[default7]: iteration       23/  128728 | consumed samples:          368 | consumed tokens:       753664 | elapsed time per iteration (s): 14.81 | learning rate: 1.206E-07 | global batch size:    16 | lm loss: 4.981078E+01 | grad norm: 96.976 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.080 | TFLOPs: 8.27 |
[default7]: iteration       24/  128728 | consumed samples:          384 | consumed tokens:       786432 | elapsed time per iteration (s): 14.82 | learning rate: 1.258E-07 | global batch size:    16 | lm loss: 4.871767E+01 | grad norm: 99.005 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.080 | TFLOPs: 8.27 |
[default7]: iteration       25/  128728 | consumed samples:          400 | consumed tokens:       819200 | elapsed time per iteration (s): 14.83 | learning rate: 1.311E-07 | global batch size:    16 | lm loss: 4.742308E+01 | grad norm: 101.865 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.079 | TFLOPs: 8.26 |
[default7]: iteration       26/  128728 | consumed samples:          416 | consumed tokens:       851968 | elapsed time per iteration (s): 14.79 | learning rate: 1.363E-07 | global batch size:    16 | lm loss: 4.459019E+01 | grad norm: 103.893 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.082 | TFLOPs: 8.28 |
[default7]: iteration       27/  128728 | consumed samples:          432 | consumed tokens:       884736 | elapsed time per iteration (s): 14.85 | learning rate: 1.416E-07 | global batch size:    16 | lm loss: 4.345989E+01 | grad norm: 103.374 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.077 | TFLOPs: 8.25 |
[default7]: iteration       28/  128728 | consumed samples:          448 | consumed tokens:       917504 | elapsed time per iteration (s): 14.86 | learning rate: 1.468E-07 | global batch size:    16 | lm loss: 4.248281E+01 | grad norm: 102.881 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.077 | TFLOPs: 8.25 |
[default7]: iteration       29/  128728 | consumed samples:          464 | consumed tokens:       950272 | elapsed time per iteration (s): 14.87 | learning rate: 1.520E-07 | global batch size:    16 | lm loss: 3.440926E+01 | grad norm: 90.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.076 | TFLOPs: 8.24 |
[default7]: iteration       30/  128728 | consumed samples:          480 | consumed tokens:       983040 | elapsed time per iteration (s): 14.79 | learning rate: 1.573E-07 | global batch size:    16 | lm loss: 3.089366E+01 | grad norm: 79.215 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.081 | TFLOPs: 8.28 |
[default7]: iteration       31/  128728 | consumed samples:          496 | consumed tokens:      1015808 | elapsed time per iteration (s): 14.79 | learning rate: 1.625E-07 | global batch size:    16 | lm loss: 2.933587E+01 | grad norm: 73.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.082 | TFLOPs: 8.28 |
[default7]: iteration       32/  128728 | consumed samples:          512 | consumed tokens:      1048576 | elapsed time per iteration (s): 14.83 | learning rate: 1.678E-07 | global batch size:    16 | lm loss: 2.763102E+01 | grad norm: 68.144 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.079 | TFLOPs: 8.26 |
[default7]: iteration       33/  128728 | consumed samples:          528 | consumed tokens:      1081344 | elapsed time per iteration (s): 14.77 | learning rate: 1.730E-07 | global batch size:    16 | lm loss: 2.619627E+01 | grad norm: 63.092 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.084 | TFLOPs: 8.30 |
[default7]: iteration       34/  128728 | consumed samples:          544 | consumed tokens:      1114112 | elapsed time per iteration (s): 14.77 | learning rate: 1.783E-07 | global batch size:    16 | lm loss: 2.509729E+01 | grad norm: 59.336 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.083 | TFLOPs: 8.29 |
[default7]: iteration       35/  128728 | consumed samples:          560 | consumed tokens:      1146880 | elapsed time per iteration (s): 14.86 | learning rate: 1.835E-07 | global batch size:    16 | lm loss: 2.208402E+01 | grad norm: 48.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.077 | TFLOPs: 8.25 |
[default7]: iteration       36/  128728 | consumed samples:          576 | consumed tokens:      1179648 | elapsed time per iteration (s): 14.79 | learning rate: 1.887E-07 | global batch size:    16 | lm loss: 2.048165E+01 | grad norm: 43.191 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.082 | TFLOPs: 8.28 |
[default7]: iteration       37/  128728 | consumed samples:          592 | consumed tokens:      1212416 | elapsed time per iteration (s): 14.90 | learning rate: 1.940E-07 | global batch size:    16 | lm loss: 1.919763E+01 | grad norm: 38.862 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.074 | TFLOPs: 8.22 |
[default7]: iteration       38/  128728 | consumed samples:          608 | consumed tokens:      1245184 | elapsed time per iteration (s): 14.76 | learning rate: 1.992E-07 | global batch size:    16 | lm loss: 1.835708E+01 | grad norm: 35.805 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.084 | TFLOPs: 8.30 |
[default7]: iteration       39/  128728 | consumed samples:          624 | consumed tokens:      1277952 | elapsed time per iteration (s): 14.85 | learning rate: 2.045E-07 | global batch size:    16 | lm loss: 1.753267E+01 | grad norm: 33.059 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.077 | TFLOPs: 8.25 |
[default7]: iteration       40/  128728 | consumed samples:          640 | consumed tokens:      1310720 | elapsed time per iteration (s): 14.90 | learning rate: 2.097E-07 | global batch size:    16 | lm loss: 1.669237E+01 | grad norm: 30.094 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.074 | TFLOPs: 8.22 |
[default7]: iteration       41/  128728 | consumed samples:          656 | consumed tokens:      1343488 | elapsed time per iteration (s): 14.76 | learning rate: 2.150E-07 | global batch size:    16 | lm loss: 1.602054E+01 | grad norm: 27.577 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.084 | TFLOPs: 8.30 |
[default7]: iteration       42/  128728 | consumed samples:          672 | consumed tokens:      1376256 | elapsed time per iteration (s): 14.78 | learning rate: 2.202E-07 | global batch size:    16 | lm loss: 1.524471E+01 | grad norm: 24.230 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.083 | TFLOPs: 8.29 |
[default7]: iteration       43/  128728 | consumed samples:          688 | consumed tokens:      1409024 | elapsed time per iteration (s): 14.73 | learning rate: 2.254E-07 | global batch size:    16 | lm loss: 1.467593E+01 | grad norm: 21.341 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.087 | TFLOPs: 8.32 |
[default7]: iteration       44/  128728 | consumed samples:          704 | consumed tokens:      1441792 | elapsed time per iteration (s): 14.84 | learning rate: 2.307E-07 | global batch size:    16 | lm loss: 1.369703E+01 | grad norm: 15.454 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.078 | TFLOPs: 8.26 |
[default7]: iteration       45/  128728 | consumed samples:          720 | consumed tokens:      1474560 | elapsed time per iteration (s): 14.87 | learning rate: 2.359E-07 | global batch size:    16 | lm loss: 1.321554E+01 | grad norm: 12.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.076 | TFLOPs: 8.24 |
[default7]: iteration       46/  128728 | consumed samples:          736 | consumed tokens:      1507328 | elapsed time per iteration (s): 14.77 | learning rate: 2.412E-07 | global batch size:    16 | lm loss: 1.281323E+01 | grad norm: 11.024 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.084 | TFLOPs: 8.30 |
[default7]: iteration       47/  128728 | consumed samples:          752 | consumed tokens:      1540096 | elapsed time per iteration (s): 14.91 | learning rate: 2.464E-07 | global batch size:    16 | lm loss: 1.263766E+01 | grad norm: 8.627 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.073 | TFLOPs: 8.22 |
[default7]: iteration       48/  128728 | consumed samples:          768 | consumed tokens:      1572864 | elapsed time per iteration (s): 14.98 | learning rate: 2.517E-07 | global batch size:    16 | lm loss: 1.236759E+01 | grad norm: 4.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.068 | TFLOPs: 8.18 |
[default7]: iteration       49/  128728 | consumed samples:          784 | consumed tokens:      1605632 | elapsed time per iteration (s): 14.91 | learning rate: 2.569E-07 | global batch size:    16 | lm loss: 1.218161E+01 | grad norm: 3.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.073 | TFLOPs: 8.22 |
[default7]: iteration       50/  128728 | consumed samples:          800 | consumed tokens:      1638400 | elapsed time per iteration (s): 14.78 | learning rate: 2.621E-07 | global batch size:    16 | lm loss: 1.218425E+01 | grad norm: 2.795 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.083 | TFLOPs: 8.29 |
[default0]:saving checkpoint at iteration      50 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:[2022-03-03 06:03:02,742] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/mp_rank_00_model_states.pt
[default1]:[2022-03-03 06:03:02,974] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/mp_rank_01_model_states.pt
[default7]:[2022-03-03 06:03:08,125] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt
[default1]:[2022-03-03 06:03:08,268] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 06:03:08,588] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt
[default0]:[2022-03-03 06:03:08,609] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 06:03:08,602] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt
[default5]:[2022-03-03 06:03:08,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt
[default1]:[2022-03-03 06:03:08,802] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt
[default4]:[2022-03-03 06:03:08,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt
[default6]:[2022-03-03 06:03:08,837] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt
[default5]:[2022-03-03 06:03:08,964] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt
[default6]:[2022-03-03 06:03:09,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt
[default7]:[2022-03-03 06:03:09,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt
[default0]:[2022-03-03 06:03:09,190] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt
[default1]:[2022-03-03 06:03:09,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt
[default5]:[2022-03-03 06:03:09,154] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt
[default6]:[2022-03-03 06:03:09,177] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 06:03:09,309] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 06:03:09,343] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 06:03:09,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt
[default4]:[2022-03-03 06:03:09,531] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 06:03:09,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 06:03:09,541] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt
[default7]:[2022-03-03 06:03:09,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt
[default4]:[2022-03-03 06:03:09,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt
[default2]:[2022-03-03 06:03:11,155] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt
[default5]:[2022-03-03 06:03:11,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt
[default7]:[2022-03-03 06:03:11,241] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 06:03:11,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt
[default6]:[2022-03-03 06:03:11,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt
[default5]:[2022-03-03 06:03:11,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt
[default3]:[2022-03-03 06:03:11,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt
[default5]:[2022-03-03 06:03:11,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt
[default7]:[2022-03-03 06:03:11,637] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt
[default4]:[2022-03-03 06:03:11,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 06:03:11,672] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt
[default5]:[2022-03-03 06:03:11,758] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt
[default5]:[2022-03-03 06:03:11,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 06:03:11,898] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 06:03:11,791] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt
[default2]:[2022-03-03 06:03:11,859] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt
[default1]:[2022-03-03 06:03:11,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 06:03:11,949] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt
[default0]:[2022-03-03 06:03:12,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt
[default4]:[2022-03-03 06:03:11,950] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt
[default4]:[2022-03-03 06:03:12,036] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 06:03:12,064] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 06:03:12,118] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt
[default6]:[2022-03-03 06:03:12,182] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt
[default0]:[2022-03-03 06:03:12,140] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt
[default3]:[2022-03-03 06:03:12,082] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt
[default1]:[2022-03-03 06:03:12,225] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt
[default0]:[2022-03-03 06:03:12,297] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt
[default2]:[2022-03-03 06:03:12,311] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt
[default4]:[2022-03-03 06:03:12,293] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt
[default2]:[2022-03-03 06:03:12,376] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt
[default3]:[2022-03-03 06:03:12,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt
[default4]:[2022-03-03 06:03:12,366] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
[default1]:[2022-03-03 06:03:12,379] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt
[default0]:[2022-03-03 06:03:12,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt
[default4]:[2022-03-03 06:03:12,424] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt
[default6]:[2022-03-03 06:03:12,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt
[default5]:[2022-03-03 06:03:12,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt
[default1]:[2022-03-03 06:03:12,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt
[default3]:[2022-03-03 06:03:12,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt
[default0]:[2022-03-03 06:03:12,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt
[default2]:[2022-03-03 06:03:12,594] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt
[default2]:[2022-03-03 06:03:12,550] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 06:03:12,624] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt
[default6]:[2022-03-03 06:03:12,662] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt
[default6]:[2022-03-03 06:03:12,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 06:03:12,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt
[default6]:[2022-03-03 06:03:12,698] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt
[default2]:[2022-03-03 06:03:12,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt
[default5]:[2022-03-03 06:03:12,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt
[default6]:[2022-03-03 06:03:12,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt
[default5]:[2022-03-03 06:03:13,012] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt
[default1]:[2022-03-03 06:03:12,990] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt
[default4]:[2022-03-03 06:03:13,031] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt
[default2]:[2022-03-03 06:03:13,099] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 06:03:13,101] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 06:03:13,116] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt
[default3]:[2022-03-03 06:03:13,145] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt
[default7]:[2022-03-03 06:03:13,130] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt
[default7]:[2022-03-03 06:03:13,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt
[default3]:[2022-03-03 06:03:13,338] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 06:03:13,316] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt
[default6]:[2022-03-03 06:03:13,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 06:03:13,324] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 06:03:13,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt
[default6]:[2022-03-03 06:03:13,414] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt
[default4]:[2022-03-03 06:03:13,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 06:03:13,467] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt
[default5]:[2022-03-03 06:03:13,429] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt
[default1]:[2022-03-03 06:03:13,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt
[default2]:[2022-03-03 06:03:13,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 06:03:13,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 06:03:13,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt
[default0]:[2022-03-03 06:03:13,444] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt
[default7]:[2022-03-03 06:03:13,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 06:03:13,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt
[default3]:[2022-03-03 06:03:13,596] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt
[default7]:[2022-03-03 06:03:13,615] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt
[default5]:[2022-03-03 06:03:13,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 06:03:13,687] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt
[default3]:[2022-03-03 06:03:13,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt
[default6]:[2022-03-03 06:03:13,736] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt
[default6]:[2022-03-03 06:03:13,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 06:03:13,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt
[default6]:[2022-03-03 06:03:13,785] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt
[default2]:[2022-03-03 06:03:13,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt
[default5]:[2022-03-03 06:03:13,849] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 06:03:13,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt
[default4]:[2022-03-03 06:03:13,800] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt
[default1]:[2022-03-03 06:03:13,884] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 06:03:13,964] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 06:03:13,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt
[default2]:[2022-03-03 06:03:14,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt
[default4]:[2022-03-03 06:03:14,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt
[default0]:[2022-03-03 06:03:14,166] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
[default0]:[2022-03-03 06:03:14,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt
[default0]:[2022-03-03 06:03:14,609] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt
[default1]:[2022-03-03 06:03:14,699] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt
[default0]:[2022-03-03 06:03:14,680] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt
[default3]:[2022-03-03 06:03:14,714] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt
[default0]:[2022-03-03 06:03:14,767] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
[default1]:[2022-03-03 06:03:14,757] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt
[default3]:[2022-03-03 06:03:14,815] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt
[default2]:[2022-03-03 06:03:14,793] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt
[default6]:[2022-03-03 06:03:14,711] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt
[default1]:[2022-03-03 06:03:14,789] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt
[default2]:[2022-03-03 06:03:14,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt
[default1]:[2022-03-03 06:03:14,880] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt
[default2]:[2022-03-03 06:03:14,809] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt
[default4]:[2022-03-03 06:03:14,821] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt
[default4]:[2022-03-03 06:03:14,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
[default0]:[2022-03-03 06:03:14,850] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
[default4]:[2022-03-03 06:03:14,894] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt
[default1]:[2022-03-03 06:03:14,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt
[default5]:[2022-03-03 06:03:14,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt
[default0]:[2022-03-03 06:03:14,955] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt
[default3]:[2022-03-03 06:03:14,955] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt
[default5]:[2022-03-03 06:03:15,018] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 06:03:15,018] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt
[default1]:[2022-03-03 06:03:14,981] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 06:03:15,081] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt
[default5]:[2022-03-03 06:03:15,103] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 06:03:15,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt
[default1]:[2022-03-03 06:03:15,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt
[default5]:[2022-03-03 06:03:15,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt
[default2]:[2022-03-03 06:03:15,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt
[default7]:[2022-03-03 06:03:15,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt
[default1]:[2022-03-03 06:03:15,294] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt
[default7]:[2022-03-03 06:03:15,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt
[default2]:[2022-03-03 06:03:15,345] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 06:03:15,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 06:03:15,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt
[default6]:[2022-03-03 06:03:15,413] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 06:03:15,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt
[default3]:[2022-03-03 06:03:15,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt
[default5]:[2022-03-03 06:03:15,423] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt
[default5]:[2022-03-03 06:03:15,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt
[default5]:[2022-03-03 06:03:15,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt
[default3]:[2022-03-03 06:03:15,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt
[default2]:[2022-03-03 06:03:15,495] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt
[default4]:[2022-03-03 06:03:15,483] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt
[default5]:[2022-03-03 06:03:15,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt
[default1]:[2022-03-03 06:03:15,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt
[default5]:[2022-03-03 06:03:15,520] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 06:03:15,555] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt
[default3]:[2022-03-03 06:03:15,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt
[default7]:[2022-03-03 06:03:15,545] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt
[default7]:[2022-03-03 06:03:15,646] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt
[default6]:[2022-03-03 06:03:15,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 06:03:15,666] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 06:03:15,673] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 06:03:15,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt
[default0]:[2022-03-03 06:03:15,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt
[default4]:[2022-03-03 06:03:15,661] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
[default5]:[2022-03-03 06:03:15,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt
[default1]:[2022-03-03 06:03:15,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt
[default3]:[2022-03-03 06:03:15,920] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt
[default3]:[2022-03-03 06:03:15,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt
[default4]:[2022-03-03 06:03:15,849] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 06:03:15,904] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt
[default5]:[2022-03-03 06:03:16,017] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt
[default1]:[2022-03-03 06:03:15,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt
[default6]:[2022-03-03 06:03:15,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt
[default0]:[2022-03-03 06:03:15,942] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 06:03:15,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 06:03:15,971] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt
[default2]:[2022-03-03 06:03:15,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 06:03:16,068] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt
[default4]:[2022-03-03 06:03:16,081] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 06:03:16,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt
[default7]:[2022-03-03 06:03:16,129] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt
[default4]:[2022-03-03 06:03:16,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt
[default4]:[2022-03-03 06:03:16,121] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt
[default0]:[2022-03-03 06:03:16,169] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt
[default4]:[2022-03-03 06:03:16,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 06:03:16,241] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt
[default0]:[2022-03-03 06:03:16,279] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt
[default7]:[2022-03-03 06:03:16,285] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt
[default5]:[2022-03-03 06:03:16,271] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 06:03:16,368] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt
[default3]:[2022-03-03 06:03:16,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 06:03:16,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt
[default4]:[2022-03-03 06:03:16,392] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt
[default0]:[2022-03-03 06:03:16,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt
[default1]:[2022-03-03 06:03:16,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 06:03:16,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
[default5]:[2022-03-03 06:03:16,507] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
[default1]:[2022-03-03 06:03:16,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt
[default6]:[2022-03-03 06:03:16,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt
[default1]:[2022-03-03 06:03:16,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt
[default6]:[2022-03-03 06:03:16,476] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 06:03:16,517] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt
[default2]:[2022-03-03 06:03:16,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt
[default1]:[2022-03-03 06:03:16,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 06:03:16,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt
[default1]:[2022-03-03 06:03:16,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt
[default6]:[2022-03-03 06:03:16,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 06:03:16,824] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt
[default0]:[2022-03-03 06:03:16,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt
[default7]:[2022-03-03 06:03:16,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt
[default3]:[2022-03-03 06:03:16,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt
[default6]:[2022-03-03 06:03:17,015] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 06:03:16,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt
[default5]:[2022-03-03 06:03:17,042] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt
[default2]:[2022-03-03 06:03:17,075] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 06:03:17,137] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 06:03:17,139] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt
[default0]:[2022-03-03 06:03:17,106] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[default3]:[2022-03-03 06:03:17,167] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt
[default3]:[2022-03-03 06:03:17,232] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt
[default7]:[2022-03-03 06:03:17,193] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt
[default5]:[2022-03-03 06:03:17,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 06:03:17,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt
[default6]:[2022-03-03 06:03:17,357] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt
[default6]:[2022-03-03 06:03:17,357] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt
[default5]:[2022-03-03 06:03:17,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt
[default2]:[2022-03-03 06:03:17,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt
[default4]:[2022-03-03 06:03:17,412] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt
[default0]:[2022-03-03 06:03:17,414] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 06:03:17,456] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt
[default0]:[2022-03-03 06:03:17,388] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt
[default2]:[2022-03-03 06:03:17,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt
[default0]:[2022-03-03 06:03:17,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt
[default6]:[2022-03-03 06:03:17,660] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 06:03:17,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt
[default1]:[2022-03-03 06:03:17,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt
[default7]:[2022-03-03 06:03:17,707] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt
[default0]:[2022-03-03 06:03:17,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt
[default6]:[2022-03-03 06:03:17,752] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 06:03:17,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt
[default6]:[2022-03-03 06:03:17,897] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt
[default7]:[2022-03-03 06:03:17,879] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 06:03:17,896] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 06:03:17,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 06:03:17,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt
[default2]:[2022-03-03 06:03:18,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 06:03:18,037] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt
[default2]:[2022-03-03 06:03:18,077] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt
[default1]:[2022-03-03 06:03:18,094] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt
[default0]:[2022-03-03 06:03:18,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt
[default4]:[2022-03-03 06:03:18,141] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 06:03:18,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt
[default1]:[2022-03-03 06:03:18,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt
[default0]:[2022-03-03 06:03:18,202] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 06:03:18,189] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt
[default7]:[2022-03-03 06:03:18,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt
[default3]:[2022-03-03 06:03:18,208] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt
[default1]:[2022-03-03 06:03:18,291] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 06:03:18,347] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt
[default2]:[2022-03-03 06:03:18,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt
[default0]:[2022-03-03 06:03:18,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt
[default2]:[2022-03-03 06:03:18,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt
[default3]:[2022-03-03 06:03:18,423] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt
[default5]:[2022-03-03 06:03:18,444] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt
[default7]:[2022-03-03 06:03:18,450] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt
[default7]:[2022-03-03 06:03:18,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt
[default4]:[2022-03-03 06:03:18,441] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt
[default7]:[2022-03-03 06:03:18,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 06:03:18,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt
[default3]:[2022-03-03 06:03:18,560] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 06:03:18,588] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt
[default6]:[2022-03-03 06:03:18,566] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt
[default4]:[2022-03-03 06:03:18,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt
[default4]:[2022-03-03 06:03:18,613] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 06:03:18,578] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 06:03:18,630] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt
[default3]:[2022-03-03 06:03:18,600] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt
[default7]:[2022-03-03 06:03:18,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 06:03:18,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt
[default7]:[2022-03-03 06:03:18,635] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt
[default1]:[2022-03-03 06:03:18,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt
[default0]:[2022-03-03 06:03:18,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt
[default5]:[2022-03-03 06:03:18,720] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt
[default6]:[2022-03-03 06:03:18,823] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt
[default5]:[2022-03-03 06:03:18,764] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 06:03:18,799] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt
[default1]:[2022-03-03 06:03:18,830] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 06:03:18,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt
[default3]:[2022-03-03 06:03:18,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt
[default5]:[2022-03-03 06:03:18,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 06:03:18,949] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt
[default7]:[2022-03-03 06:03:18,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt
[default6]:[2022-03-03 06:03:18,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt
[default1]:[2022-03-03 06:03:18,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt
[default5]:[2022-03-03 06:03:19,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 06:03:19,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt
[default6]:[2022-03-03 06:03:19,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 06:03:19,234] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt
[default5]:[2022-03-03 06:03:19,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 06:03:19,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt
[default0]:[2022-03-03 06:03:19,225] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 06:03:19,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 06:03:19,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt
[default5]:[2022-03-03 06:03:19,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 06:03:19,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt
[default4]:[2022-03-03 06:03:19,607] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt
[default2]:[2022-03-03 06:03:19,627] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt
[default6]:[2022-03-03 06:03:19,616] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt
[default3]:[2022-03-03 06:03:19,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt
[default7]:[2022-03-03 06:03:19,571] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt
[default2]:[2022-03-03 06:03:19,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt
[default2]:[2022-03-03 06:03:19,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt
[default4]:[2022-03-03 06:03:19,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt
[default0]:[2022-03-03 06:03:19,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt
[default5]:[2022-03-03 06:03:20,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt
[default2]:[2022-03-03 06:03:20,039] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt
[default4]:[2022-03-03 06:03:20,099] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt
[default3]:[2022-03-03 06:03:20,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 06:03:20,178] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt
[default5]:[2022-03-03 06:03:20,285] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 06:03:20,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt
[default2]:[2022-03-03 06:03:20,260] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt
[default0]:[2022-03-03 06:03:20,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt
[default1]:[2022-03-03 06:03:20,522] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt
[default7]:[2022-03-03 06:03:20,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 06:03:20,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt
[default6]:[2022-03-03 06:03:20,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt
[default3]:[2022-03-03 06:03:20,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt
[default1]:[2022-03-03 06:03:20,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt
[default6]:[2022-03-03 06:03:20,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt
[default3]:[2022-03-03 06:03:20,736] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 06:03:20,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt
[default1]:[2022-03-03 06:03:20,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt
[default2]:[2022-03-03 06:03:20,883] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt
[default3]:[2022-03-03 06:03:21,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt
[default2]:[2022-03-03 06:03:21,228] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt
[default6]:[2022-03-03 06:03:21,360] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt
[default4]:[2022-03-03 06:03:21,419] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt
[default0]:[2022-03-03 06:03:21,397] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt
[default5]:[2022-03-03 06:03:21,435] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 06:03:21,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt
[default2]:[2022-03-03 06:03:21,497] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt
[default5]:[2022-03-03 06:03:21,550] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 06:03:21,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt
[default4]:[2022-03-03 06:03:21,628] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt
[default6]:[2022-03-03 06:03:21,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt
[default1]:[2022-03-03 06:03:21,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt
[default7]:[2022-03-03 06:03:21,753] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt
[default5]:[2022-03-03 06:03:21,785] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt
[default7]:[2022-03-03 06:03:21,858] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt
[default5]:[2022-03-03 06:03:21,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt
[default6]:[2022-03-03 06:03:21,906] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 06:03:21,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt
[default6]:[2022-03-03 06:03:22,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt
[default6]:[2022-03-03 06:03:22,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt
[default7]:[2022-03-03 06:03:22,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt
[default4]:[2022-03-03 06:03:22,500] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt
[default5]:[2022-03-03 06:03:22,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 06:03:23,066] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt
[default0]:[2022-03-03 06:03:23,429] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt
[default1]:[2022-03-03 06:03:23,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt
[default3]:[2022-03-03 06:03:23,567] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt
[default1]:[2022-03-03 06:03:23,965] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 06:03:23,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt
[default6]:[2022-03-03 06:03:24,063] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 06:03:24,799] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt
[default6]:[2022-03-03 06:03:26,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 06:03:26,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt
[default7]:[2022-03-03 06:03:27,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt
[default0]:  successfully saved checkpoint at iteration      50 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default6]:[2022-03-03 06:03:27,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt
[default7]:time (ms) | save-checkpoint: 40699.24
[default7]: iteration       51/  128728 | consumed samples:          816 | consumed tokens:      1671168 | elapsed time per iteration (s): 55.58 | learning rate: 2.674E-07 | global batch size:    16 | lm loss: 1.196870E+01 | grad norm: 2.511 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.288 | TFLOPs: 2.20 |
[default7]: iteration       52/  128728 | consumed samples:          832 | consumed tokens:      1703936 | elapsed time per iteration (s): 14.83 | learning rate: 2.726E-07 | global batch size:    16 | lm loss: 1.192159E+01 | grad norm: 2.002 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.079 | TFLOPs: 8.26 |
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178153 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178154 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178261 closing signal SIGTERM
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160808 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178155 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182110 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225345 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178262 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194919 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225346 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178301 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202752 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160809 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206323 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178156 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182111 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194920 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225347 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178302 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202753 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199596 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178263 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160810 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206324 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199185 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194807 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178157 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182112 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194921 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225348 closing signal SIGTERM
slurmstepd: error: *** STEP 176449.0 ON jean-zay-iam01 CANCELLED AT 2022-03-03T06:04:05 ***
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199757 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178683 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198618 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178303 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205449 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203096 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219084 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38228 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181690 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40089 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79441 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202754 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198711 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205490 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204987 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178264 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151809 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197446 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197516 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197071 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197108 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197562 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198050 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160811 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206325 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199186 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194808 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178158 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182113 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194922 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225349 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199758 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178684 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198619 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178304 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205450 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203097 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219085 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38229 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181691 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40090 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79442 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202755 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198712 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205491 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204988 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178265 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151810 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197447 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205487 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58953 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197517 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197072 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197109 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197563 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198051 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202191 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204306 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206326 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199187 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194809 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178159 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194923 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225350 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199759 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178685 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198620 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178305 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205451 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203098 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219086 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38230 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181692 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40091 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79443 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202756 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205492 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178266 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151811 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205488 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58954 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197518 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197073 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197110 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197564 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198052 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202192 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204307 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206327 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199188 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194810 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178161 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194924 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225351 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199760 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178686 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198621 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178306 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205452 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181693 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40092 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79444 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202757 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199600 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205493 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204989 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58955 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197519 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197565 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198053 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160812 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202193 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204308 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206328 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199189 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194811 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194925 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225352 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199761 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178687 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198622 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178307 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205453 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40093 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79445 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202758 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199601 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205494 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178267 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151812 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197520 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197074 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198054 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160813 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202194 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204309 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206329 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199190 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194812 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194926 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199762 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178688 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198623 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178308 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205454 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203099 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219087 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38231 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181694 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40094 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79446 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202759 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199602 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205495 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178268 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197448 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205489 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197521 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197075 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197111 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198055 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160814 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202195 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204310 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206330 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199191 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194813 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182114 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199763 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178689 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198624 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205455 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219088 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181695 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40095 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79447 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198713 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199603 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205496 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204990 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205490 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58956 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197522 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197076 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197566 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198056 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160815 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204311 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199192 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194814 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199764 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178690 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198625 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205456 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219089 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181696 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79448 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205497 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205491 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197523 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197077 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197567 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198057 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202196 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204312 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182115 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203100 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219090 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38232 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181697 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40096 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204991 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197449 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205492 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58957 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197078 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197112 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197568 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202197 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204313 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209280 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203101 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219091 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198714 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204992 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58958 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197113 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191841 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197569 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202198 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244854 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209281 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205493 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58959 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209282 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209283 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209284 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209285 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204993 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209286 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197114 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209287 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58960 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38233 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205494 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151813 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191842 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244855 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204994 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197115 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197450 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197451 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203102 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197452 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203103 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197454 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191843 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176305 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198715 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191844 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151814 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176306 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198716 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191845 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38234 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198717 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191846 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198718 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226691 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226692 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191847 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226693 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226694 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226695 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202145 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226696 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226697 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226698 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192914 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38235 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192915 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197784 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151815 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201473 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191848 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197785 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192916 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202146 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201474 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182116 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201475 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244856 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202147 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201476 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182117 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192917 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201477 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176307 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197786 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192918 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201478 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192919 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202412 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202413 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201479 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176308 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192920 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244857 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151816 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201480 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197787 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244858 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176309 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192921 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203947 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244859 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202148 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202414 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176310 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197788 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203948 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202149 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176311 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244860 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197789 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197790 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244861 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197791 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176312 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203949 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202415 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202416 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203950 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202150 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202417 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203951 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202151 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202418 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203952 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202152 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188076 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203953 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202419 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203954 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188077 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188078 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188079 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188080 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188081 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188082 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188083 closing signal SIGTERM
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[default7]:> setting tensorboard ...
[default0]:using world size: 384, data-parallel-size: 8, tensor-model-parallel size: 4, pipeline-model-parallel size: 12 
[default0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:PretrainedFromHF
[default0]:accumulate and all-reduce gradients in fp32 for bfloat16 data type.
[default0]:using torch.bfloat16 for parameters ...
[default0]:------------------------ arguments ------------------------
[default0]:  abort_on_unmet_fused_kernel_constraints ......... True
[default0]:  accumulate_allreduce_grads_in_fp32 .............. True
[default0]:  adam_beta1 ...................................... 0.9
[default0]:  adam_beta2 ...................................... 0.95
[default0]:  adam_eps ........................................ 1e-08
[default0]:  adlr_autoresume ................................. False
[default0]:  adlr_autoresume_interval ........................ 1000
[default0]:  apply_query_key_layer_scaling ................... True
[default0]:  apply_residual_connection_post_layernorm ........ False
[default0]:  attention_dropout ............................... 0.1
[default0]:  attention_softmax_in_fp32 ....................... False
[default0]:  bert_binary_head ................................ True
[default0]:  bert_load ....................................... None
[default0]:  bf16 ............................................ True
[default0]:  bias_dropout_fusion ............................. True
[default0]:  bias_gelu_fusion ................................ True
[default0]:  biencoder_projection_dim ........................ 0
[default0]:  biencoder_shared_query_context_model ............ False
[default0]:  block_data_path ................................. None
[default0]:  checkpoint_activations .......................... True
[default0]:  checkpoint_in_cpu ............................... False
[default0]:  checkpoint_num_layers ........................... 1
[default0]:  clip_grad ....................................... 1.0
[default0]:  codecarbon_dir .................................. None
[default0]:  consumed_train_samples .......................... 0
[default0]:  consumed_train_tokens ........................... 0
[default0]:  consumed_valid_samples .......................... 0
[default0]:  contigious_checkpointing ........................ False
[default0]:  cpu_optimizer ................................... False
[default0]:  cpu_torch_adam .................................. False
[default0]:  curriculum_learning ............................. False
[default0]:  data_impl ....................................... mmap
[default0]:  data_parallel_size .............................. 8
[default0]:  data_path ....................................... None
[default0]:  dataloader_type ................................. single
[default0]:  DDP_impl ........................................ local
[default0]:  decoder_seq_length .............................. None
[default0]:  deepscale ....................................... False
[default0]:  deepscale_config ................................ None
[default0]:  deepspeed ....................................... True
[default0]:  deepspeed_activation_checkpointing .............. True
[default0]:  deepspeed_config ................................ ./ds_config.176547.json
[default0]:  deepspeed_mpi ................................... False
[default0]:  distribute_checkpointed_activations ............. False
[default0]:  distributed_backend ............................. nccl
[default0]:  embed_layernorm ................................. True
[default0]:  embedding_path .................................. None
[default0]:  encoder_seq_length .............................. 2048
[default0]:  eod_mask_loss ................................... False
[default0]:  eval_interval ................................... 1000
[default0]:  eval_iters ...................................... 10
[default0]:  eval_only ....................................... None
[default0]:  evidence_data_path .............................. None
[default0]:  exit_duration_in_mins ........................... 1190
[default0]:  exit_interval ................................... None
[default0]:  ffn_hidden_size ................................. 57344
[default0]:  finetune ........................................ False
[default0]:  fp16 ............................................ False
[default0]:  fp16_lm_cross_entropy ........................... False
[default0]:  fp32_residual_connection ........................ False
[default0]:  gigaflos_no_embeds .............................. 0
[default0]:  global_batch_size ............................... 2048
[default0]:  glu_activation .................................. None
[default0]:  hidden_dropout .................................. 0.1
[default0]:  hidden_size ..................................... 14336
[default0]:  hysteresis ...................................... 2
[default0]:  ict_head_size ................................... None
[default0]:  ict_load ........................................ None
[default0]:  img_dim ......................................... 224
[default0]:  indexer_batch_size .............................. 128
[default0]:  indexer_log_interval ............................ 1000
[default0]:  init_method_std ................................. 0.0048
[default0]:  init_method_xavier_uniform ...................... False
[default0]:  initial_loss_scale .............................. 4294967296
[default0]:  kill_switch_path ................................ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/kill-switch-tr11-176B-exp1
[default0]:  kv_channels ..................................... 128
[default0]:  layernorm_epsilon ............................... 1e-05
[default0]:  lazy_mpu_init ................................... None
[default0]:  load ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:  local_rank ...................................... None
[default0]:  log_batch_size_to_tensorboard ................... True
[default0]:  log_interval .................................... 1
[default0]:  log_learning_rate_to_tensorboard ................ True
[default0]:  log_level ....................................... None
[default0]:  log_level_replica ............................... None
[default0]:  log_loss_scale_to_tensorboard ................... True
[default0]:  log_num_zeros_in_grad ........................... False
[default0]:  log_params_norm ................................. False
[default0]:  log_path ........................................ None
[default0]:  log_timers_to_tensorboard ....................... True
[default0]:  log_validation_ppl_to_tensorboard ............... True
[default0]:  loss_on_targets_only ............................ False
[default0]:  loss_scale ...................................... None
[default0]:  loss_scale_window ............................... 1000
[default0]:  lr .............................................. 6e-05
[default0]:  lr_decay_iters .................................. None
[default0]:  lr_decay_samples ................................ 200000000
[default0]:  lr_decay_style .................................. cosine
[default0]:  lr_decay_tokens ................................. None
[default0]:  lr_warmup_fraction .............................. None
[default0]:  lr_warmup_iters ................................. 0
[default0]:  lr_warmup_samples ............................... 183105
[default0]:  make_vocab_size_divisible_by .................... 128
[default0]:  mask_prob ....................................... 0.15
[default0]:  masked_softmax_fusion ........................... True
[default0]:  max_position_embeddings ......................... 2048
[default0]:  memory_centric_tiled_linear ..................... False
[default0]:  merge_file ...................................... None
[default0]:  micro_batch_size ................................ 2
[default0]:  min_loss_scale .................................. 1.0
[default0]:  min_lr .......................................... 6e-06
[default0]:  mmap_warmup ..................................... False
[default0]:  no_load_optim ................................... None
[default0]:  no_load_rng ..................................... None
[default0]:  no_save_optim ................................... None
[default0]:  no_save_rng ..................................... None
[default0]:  num_attention_heads ............................. 112
[default0]:  num_channels .................................... 3
[default0]:  num_classes ..................................... 1000
[default0]:  num_layers ...................................... 70
[default0]:  num_layers_per_virtual_pipeline_stage ........... None
[default0]:  num_workers ..................................... 2
[default0]:  onnx_safe ....................................... None
[default0]:  openai_gelu ..................................... False
[default0]:  optimizer ....................................... adam
[default0]:  override_lr_scheduler ........................... False
[default0]:  pad_vocab_size_to ............................... 250880
[default0]:  params_dtype .................................... torch.bfloat16
[default0]:  partition_activations ........................... False
[default0]:  patch_dim ....................................... 16
[default0]:  pipeline_model_parallel_size .................... 12
[default0]:  position_embedding_type ......................... PositionEmbeddingType.alibi
[default0]:  pp_partition_method ............................. type:transformer|embedding
[default0]:  profile_backward ................................ False
[default0]:  query_in_block_prob ............................. 0.1
[default0]:  rampup_batch_size ............................... ['16', '16', '9_765_625']
[default0]:  rank ............................................ 0
[default0]:  remote_device ................................... none
[default0]:  reset_attention_mask ............................ False
[default0]:  reset_position_ids .............................. False
[default0]:  retriever_report_topk_accuracies ................ []
[default0]:  retriever_score_scaling ......................... False
[default0]:  retriever_seq_length ............................ 256
[default0]:  reweight_loss_based_on_position_frequency ....... False
[default0]:  sample_rate ..................................... 1.0
[default0]:  save ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:  save_interval ................................... 500
[default0]:  scatter_gather_tensors_in_pipeline .............. True
[default0]:  scattered_embeddings ............................ False
[default0]:  seed ............................................ 42
[default0]:  seq_length ...................................... 2048
[default0]:  sgd_momentum .................................... 0.9
[default0]:  short_seq_prob .................................. 0.1
[default0]:  skip_train_iteration_range ...................... None
[default0]:  split ........................................... None
[default0]:  split_transformers .............................. False
[default0]:  synchronize_each_layer .......................... False
[default0]:  tensor_model_parallel_size ...................... 4
[default0]:  tensorboard_dir ................................. /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/tr11-176B-ml-logs/tensorboard
[default0]:  tensorboard_log_interval ........................ 1
[default0]:  tensorboard_queue_size .......................... 5
[default0]:  test_weighted_split_names ....................... ['test']
[default0]:  test_weighted_split_paths ....................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  test_weighted_split_paths_path .................. None
[default0]:  test_weighted_split_splits ...................... [['0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0']]
[default0]:  test_weighted_split_weights ..................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  tile_factor ..................................... 1
[default0]:  titles_data_path ................................ None
[default0]:  tokenizer_name_or_path .......................... bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k
[default0]:  tokenizer_type .................................. PretrainedFromHF
[default0]:  train_iters ..................................... None
[default0]:  train_samples ................................... 220000000
[default0]:  train_tokens .................................... None
[default0]:  train_weighted_split_names ...................... ['train']
[default0]:  train_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  train_weighted_split_paths_path ................. None
[default0]:  train_weighted_split_splits ..................... [['0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949']]
[default0]:  train_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  use_bnb_optimizer ............................... False
[default0]:  use_checkpoint_lr_scheduler ..................... False
[default0]:  use_contiguous_buffers_in_ddp ................... True
[default0]:  use_cpu_initialization .......................... None
[default0]:  use_one_sent_docs ............................... False
[default0]:  use_pin_memory .................................. False
[default0]:  valid_weighted_split_names ...................... ['valid']
[default0]:  valid_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  valid_weighted_split_paths_path ................. None
[default0]:  valid_weighted_split_splits ..................... [['0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999']]
[default0]:  valid_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  virtual_pipeline_model_parallel_size ............ None
[default0]:  vocab_extra_ids ................................. 0
[default0]:  vocab_file ...................................... None
[default0]:  weight_decay .................................... 0.1
[default0]:  world_size ...................................... 384
[default0]:  zero_allgather_bucket_size ...................... 0.0
[default0]:  zero_contigious_gradients ....................... False
[default0]:  zero_reduce_bucket_size ......................... 0.0
[default0]:  zero_reduce_scatter ............................. False
[default0]:  zero_stage ...................................... 0
[default0]:-------------------- end of arguments ---------------------
[default0]:will use batch size rampup starting from global batch size 16 to global batch size 2048 with batch size increments 16 over 9765625 samples.
[default0]:> building PretrainedFromHF tokenizer ...
[default0]: vocab file is un-used. loading tokenizer from pre-trained model
[default0]:Offline mode: forcing local_files_only=True
[default0]:Offline mode: forcing local_files_only=True
[default0]:Can't load following files from cache: ['added_tokens_file'] and cannot check if these files are necessary for the tokenizer to operate.
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/special_tokens_map.json from cache at /gpfswork/rech/six/commun/models/b0b3428eb9bea3ef62a6e9983742117e4860f4ec1af66eebce1702b8ec7cb364.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer_config.json from cache at /gpfswork/rech/six/commun/models/31fb66a88196017b3a12c4798e55bcf8a11b312b42dd9429c83f7237c0a8a807.e683c1a11fe6388761e34fd7cddbcd77f3552cefb70e9aca4a4cc72c027c8f40
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer.json from cache at /gpfswork/rech/six/commun/models/b28b4c1d8aed4c72b765cce6a9a7ce8c5460d05a5b4ea6fa5855dff6a721d171.397b0d7316cb89fa15f0bebce2bd6c5e71e92a14e95de167940173a60253b03e
[default0]: > padded vocab (size: 250680) with 200 dummy tokens (new size: 250880)
[default0]:DeepSpeed general environment info:
[default0]:torch install path ............... ['/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch']
[default0]:torch version .................... 1.11.0+cu115
[default0]:torch cuda version ............... 11.5
[default0]:nvcc version ..................... 11.4
[default0]:deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed']
[default0]:deepspeed info ................... 0.6.0+ed26ef4, ed26ef4, olruwase/bf16-updates
[default0]:deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
[default0]:**** Git info for Megatron: git_hash=0415583 git_branch=sync-meg-lm ****
[default0]:> initializing torch distributed ...
[default0]:> initializing tensor model parallel with size 4
[default0]:> initializing pipeline model parallel with size 12
[default0]:> setting random seeds to 42 ...
[default0]:[2022-03-03 06:05:07,541] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2760 and data parallel seed: 42
[default0]:> compiling dataset index builder ...
[default0]:make: Entering directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data'
[default0]:make: Nothing to be done for 'default'.
[default0]:make: Leaving directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data'
[default0]:>>> done with dataset index builder. Compilation time: 0.100 seconds
[default0]:> compiling and loading fused kernels ...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module scaled_upper_triang_masked_softmax_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module scaled_upper_triang_masked_softmax_cuda...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module scaled_masked_softmax_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module scaled_masked_softmax_cuda...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module fused_mix_prec_layer_norm_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module fused_mix_prec_layer_norm_cuda...
[default0]:>>> done with compiling and loading fused kernels. Compilation time: 8.454 seconds
[default0]:time to initialize megatron (seconds): 12.097
[default0]:[after megatron is initialized] datetime: 2022-03-03 06:05:16 
[default0]:building GPT model ...
[default0]:[2022-03-03 06:05:16,133] [INFO] [utils.py:828:see_memory_usage] Before Building Model
[default0]:[2022-03-03 06:05:16,134] [INFO] [utils.py:829:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[default0]:[2022-03-03 06:05:16,134] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.19 GB, percent = 8.6%
[default0]:SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
[default0]:Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=0, model=1): 1, ProcessCoord(pipe=0, data=0, model=2): 2, ProcessCoord(pipe=0, data=0, model=3): 3, ProcessCoord(pipe=0, data=1, model=0): 4, ProcessCoord(pipe=0, data=1, model=1): 5, ProcessCoord(pipe=0, data=1, model=2): 6, ProcessCoord(pipe=0, data=1, model=3): 7, ProcessCoord(pipe=0, data=2, model=0): 8, ProcessCoord(pipe=0, data=2, model=1): 9, ProcessCoord(pipe=0, data=2, model=2): 10, ProcessCoord(pipe=0, data=2, model=3): 11, ProcessCoord(pipe=0, data=3, model=0): 12, ProcessCoord(pipe=0, data=3, model=1): 13, ProcessCoord(pipe=0, data=3, model=2): 14, ProcessCoord(pipe=0, data=3, model=3): 15, ProcessCoord(pipe=0, data=4, model=0): 16, ProcessCoord(pipe=0, data=4, model=1): 17, ProcessCoord(pipe=0, data=4, model=2): 18, ProcessCoord(pipe=0, data=4, model=3): 19, ProcessCoord(pipe=0, data=5, model=0): 20, ProcessCoord(pipe=0, data=5, model=1): 21, ProcessCoord(pipe=0, data=5, model=2): 22, ProcessCoord(pipe=0, data=5, model=3): 23, ProcessCoord(pipe=0, data=6, model=0): 24, ProcessCoord(pipe=0, data=6, model=1): 25, ProcessCoord(pipe=0, data=6, model=2): 26, ProcessCoord(pipe=0, data=6, model=3): 27, ProcessCoord(pipe=0, data=7, model=0): 28, ProcessCoord(pipe=0, data=7, model=1): 29, ProcessCoord(pipe=0, data=7, model=2): 30, ProcessCoord(pipe=0, data=7, model=3): 31, ProcessCoord(pipe=1, data=0, model=0): 32, ProcessCoord(pipe=1, data=0, model=1): 33, ProcessCoord(pipe=1, data=0, model=2): 34, ProcessCoord(pipe=1, data=0, model=3): 35, ProcessCoord(pipe=1, data=1, model=0): 36, ProcessCoord(pipe=1, data=1, model=1): 37, ProcessCoord(pipe=1, data=1, model=2): 38, ProcessCoord(pipe=1, data=1, model=3): 39, ProcessCoord(pipe=1, data=2, model=0): 40, ProcessCoord(pipe=1, data=2, model=1): 41, ProcessCoord(pipe=1, data=2, model=2): 42, ProcessCoord(pipe=1, data=2, model=3): 43, ProcessCoord(pipe=1, data=3, model=0): 44, ProcessCoord(pipe=1, data=3, model=1): 45, ProcessCoord(pipe=1, data=3, model=2): 46, ProcessCoord(pipe=1, data=3, model=3): 47, ProcessCoord(pipe=1, data=4, model=0): 48, ProcessCoord(pipe=1, data=4, model=1): 49, ProcessCoord(pipe=1, data=4, model=2): 50, ProcessCoord(pipe=1, data=4, model=3): 51, ProcessCoord(pipe=1, data=5, model=0): 52, ProcessCoord(pipe=1, data=5, model=1): 53, ProcessCoord(pipe=1, data=5, model=2): 54, ProcessCoord(pipe=1, data=5, model=3): 55, ProcessCoord(pipe=1, data=6, model=0): 56, ProcessCoord(pipe=1, data=6, model=1): 57, ProcessCoord(pipe=1, data=6, model=2): 58, ProcessCoord(pipe=1, data=6, model=3): 59, ProcessCoord(pipe=1, data=7, model=0): 60, ProcessCoord(pipe=1, data=7, model=1): 61, ProcessCoord(pipe=1, data=7, model=2): 62, ProcessCoord(pipe=1, data=7, model=3): 63, ProcessCoord(pipe=2, data=0, model=0): 64, ProcessCoord(pipe=2, data=0, model=1): 65, ProcessCoord(pipe=2, data=0, model=2): 66, ProcessCoord(pipe=2, data=0, model=3): 67, ProcessCoord(pipe=2, data=1, model=0): 68, ProcessCoord(pipe=2, data=1, model=1): 69, ProcessCoord(pipe=2, data=1, model=2): 70, ProcessCoord(pipe=2, data=1, model=3): 71, ProcessCoord(pipe=2, data=2, model=0): 72, ProcessCoord(pipe=2, data=2, model=1): 73, ProcessCoord(pipe=2, data=2, model=2): 74, ProcessCoord(pipe=2, data=2, model=3): 75, ProcessCoord(pipe=2, data=3, model=0): 76, ProcessCoord(pipe=2, data=3, model=1): 77, ProcessCoord(pipe=2, data=3, model=2): 78, ProcessCoord(pipe=2, data=3, model=3): 79, ProcessCoord(pipe=2, data=4, model=0): 80, ProcessCoord(pipe=2, data=4, model=1): 81, ProcessCoord(pipe=2, data=4, model=2): 82, ProcessCoord(pipe=2, data=4, model=3): 83, ProcessCoord(pipe=2, data=5, model=0): 84, ProcessCoord(pipe=2, data=5, model=1): 85, ProcessCoord(pipe=2, data=5, model=2): 86, ProcessCoord(pipe=2, data=5, model=3): 87, ProcessCoord(pipe=2, data=6, model=0): 88, ProcessCoord(pipe=2, data=6, model=1): 89, ProcessCoord(pipe=2, data=6, model=2): 90, ProcessCoord(pipe=2, data=6, model=3): 91, ProcessCoord(pipe=2, data=7, model=0): 92, ProcessCoord(pipe=2, data=7, model=1): 93, ProcessCoord(pipe=2, data=7, model=2): 94, ProcessCoord(pipe=2, data=7, model=3): 95, ProcessCoord(pipe=3, data=0, model=0): 96, ProcessCoord(pipe=3, data=0, model=1): 97, ProcessCoord(pipe=3, data=0, model=2): 98, ProcessCoord(pipe=3, data=0, model=3): 99, ProcessCoord(pipe=3, data=1, model=0): 100, ProcessCoord(pipe=3, data=1, model=1): 101, ProcessCoord(pipe=3, data=1, model=2): 102, ProcessCoord(pipe=3, data=1, model=3): 103, ProcessCoord(pipe=3, data=2, model=0): 104, ProcessCoord(pipe=3, data=2, model=1): 105, ProcessCoord(pipe=3, data=2, model=2): 106, ProcessCoord(pipe=3, data=2, model=3): 107, ProcessCoord(pipe=3, data=3, model=0): 108, ProcessCoord(pipe=3, data=3, model=1): 109, ProcessCoord(pipe=3, data=3, model=2): 110, ProcessCoord(pipe=3, data=3, model=3): 111, ProcessCoord(pipe=3, data=4, model=0): 112, ProcessCoord(pipe=3, data=4, model=1): 113, ProcessCoord(pipe=3, data=4, model=2): 114, ProcessCoord(pipe=3, data=4, model=3): 115, ProcessCoord(pipe=3, data=5, model=0): 116, ProcessCoord(pipe=3, data=5, model=1): 117, ProcessCoord(pipe=3, data=5, model=2): 118, ProcessCoord(pipe=3, data=5, model=3): 119, ProcessCoord(pipe=3, data=6, model=0): 120, ProcessCoord(pipe=3, data=6, model=1): 121, ProcessCoord(pipe=3, data=6, model=2): 122, ProcessCoord(pipe=3, data=6, model=3): 123, ProcessCoord(pipe=3, data=7, model=0): 124, ProcessCoord(pipe=3, data=7, model=1): 125, ProcessCoord(pipe=3, data=7, model=2): 126, ProcessCoord(pipe=3, data=7, model=3): 127, ProcessCoord(pipe=4, data=0, model=0): 128, ProcessCoord(pipe=4, data=0, model=1): 129, ProcessCoord(pipe=4, data=0, model=2): 130, ProcessCoord(pipe=4, data=0, model=3): 131, ProcessCoord(pipe=4, data=1, model=0): 132, ProcessCoord(pipe=4, data=1, model=1): 133, ProcessCoord(pipe=4, data=1, model=2): 134, ProcessCoord(pipe=4, data=1, model=3): 135, ProcessCoord(pipe=4, data=2, model=0): 136, ProcessCoord(pipe=4, data=2, model=1): 137, ProcessCoord(pipe=4, data=2, model=2): 138, ProcessCoord(pipe=4, data=2, model=3): 139, ProcessCoord(pipe=4, data=3, model=0): 140, ProcessCoord(pipe=4, data=3, model=1): 141, ProcessCoord(pipe=4, data=3, model=2): 142, ProcessCoord(pipe=4, data=3, model=3): 143, ProcessCoord(pipe=4, data=4, model=0): 144, ProcessCoord(pipe=4, data=4, model=1): 145, ProcessCoord(pipe=4, data=4, model=2): 146, ProcessCoord(pipe=4, data=4, model=3): 147, ProcessCoord(pipe=4, data=5, model=0): 148, ProcessCoord(pipe=4, data=5, model=1): 149, ProcessCoord(pipe=4, data=5, model=2): 150, ProcessCoord(pipe=4, data=5, model=3): 151, ProcessCoord(pipe=4, data=6, model=0): 152, ProcessCoord(pipe=4, data=6, model=1): 153, ProcessCoord(pipe=4, data=6, model=2): 154, ProcessCoord(pipe=4, data=6, model=3): 155, ProcessCoord(pipe=4, data=7, model=0): 156, ProcessCoord(pipe=4, data=7, model=1): 157, ProcessCoord(pipe=4, data=7, model=2): 158, ProcessCoord(pipe=4, data=7, model=3): 159, ProcessCoord(pipe=5, data=0, model=0): 160, ProcessCoord(pipe=5, data=0, model=1): 161, ProcessCoord(pipe=5, data=0, model=2): 162, ProcessCoord(pipe=5, data=0, model=3): 163, ProcessCoord(pipe=5, data=1, model=0): 164, ProcessCoord(pipe=5, data=1, model=1): 165, ProcessCoord(pipe=5, data=1, model=2): 166, ProcessCoord(pipe=5, data=1, model=3): 167, ProcessCoord(pipe=5, data=2, model=0): 168, ProcessCoord(pipe=5, data=2, model=1): 169, ProcessCoord(pipe=5, data=2, model=2): 170, ProcessCoord(pipe=5, data=2, model=3): 171, ProcessCoord(pipe=5, data=3, model=0): 172, ProcessCoord(pipe=5, data=3, model=1): 173, ProcessCoord(pipe=5, data=3, model=2): 174, ProcessCoord(pipe=5, data=3, model=3): 175, ProcessCoord(pipe=5, data=4, model=0): 176, ProcessCoord(pipe=5, data=4, model=1): 177, ProcessCoord(pipe=5, data=4, model=2): 178, ProcessCoord(pipe=5, data=4, model=3): 179, ProcessCoord(pipe=5, data=5, model=0): 180, ProcessCoord(pipe=5, data=5, model=1): 181, ProcessCoord(pipe=5, data=5, model=2): 182, ProcessCoord(pipe=5, data=5, model=3): 183, ProcessCoord(pipe=5, data=6, model=0): 184, ProcessCoord(pipe=5, data=6, model=1): 185, ProcessCoord(pipe=5, data=6, model=2): 186, ProcessCoord(pipe=5, data=6, model=3): 187, ProcessCoord(pipe=5, data=7, model=0): 188, ProcessCoord(pipe=5, data=7, model=1): 189, ProcessCoord(pipe=5, data=7, model=2): 190, ProcessCoord(pipe=5, data=7, model=3): 191, ProcessCoord(pipe=6, data=0, model=0): 192, ProcessCoord(pipe=6, data=0, model=1): 193, ProcessCoord(pipe=6, data=0, model=2): 194, ProcessCoord(pipe=6, data=0, model=3): 195, ProcessCoord(pipe=6, data=1, model=0): 196, ProcessCoord(pipe=6, data=1, model=1): 197, ProcessCoord(pipe=6, data=1, model=2): 198, ProcessCoord(pipe=6, data=1, model=3): 199, ProcessCoord(pipe=6, data=2, model=0): 200, ProcessCoord(pipe=6, data=2, model=1): 201, ProcessCoord(pipe=6, data=2, model=2): 202, ProcessCoord(pipe=6, data=2, model=3): 203, ProcessCoord(pipe=6, data=3, model=0): 204, ProcessCoord(pipe=6, data=3, model=1): 205, ProcessCoord(pipe=6, data=3, model=2): 206, ProcessCoord(pipe=6, data=3, model=3): 207, ProcessCoord(pipe=6, data=4, model=0): 208, ProcessCoord(pipe=6, data=4, model=1): 209, ProcessCoord(pipe=6, data=4, model=2): 210, ProcessCoord(pipe=6, data=4, model=3): 211, ProcessCoord(pipe=6, data=5, model=0): 212, ProcessCoord(pipe=6, data=5, model=1): 213, ProcessCoord(pipe=6, data=5, model=2): 214, ProcessCoord(pipe=6, data=5, model=3): 215, ProcessCoord(pipe=6, data=6, model=0): 216, ProcessCoord(pipe=6, data=6, model=1): 217, ProcessCoord(pipe=6, data=6, model=2): 218, ProcessCoord(pipe=6, data=6, model=3): 219, ProcessCoord(pipe=6, data=7, model=0): 220, ProcessCoord(pipe=6, data=7, model=1): 221, ProcessCoord(pipe=6, data=7, model=2): 222, ProcessCoord(pipe=6, data=7, model=3): 223, ProcessCoord(pipe=7, data=0, model=0): 224, ProcessCoord(pipe=7, data=0, model=1): 225, ProcessCoord(pipe=7, data=0, model=2): 226, ProcessCoord(pipe=7, data=0, model=3): 227, ProcessCoord(pipe=7, data=1, model=0): 228, ProcessCoord(pipe=7, data=1, model=1): 229, ProcessCoord(pipe=7, data=1, model=2): 230, ProcessCoord(pipe=7, data=1, model=3): 231, ProcessCoord(pipe=7, data=2, model=0): 232, ProcessCoord(pipe=7, data=2, model=1): 233, ProcessCoord(pipe=7, data=2, model=2): 234, ProcessCoord(pipe=7, data=2, model=3): 235, ProcessCoord(pipe=7, data=3, model=0): 236, ProcessCoord(pipe=7, data=3, model=1): 237, ProcessCoord(pipe=7, data=3, model=2): 238, ProcessCoord(pipe=7, data=3, model=3): 239, ProcessCoord(pipe=7, data=4, model=0): 240, ProcessCoord(pipe=7, data=4, model=1): 241, ProcessCoord(pipe=7, data=4, model=2): 242, ProcessCoord(pipe=7, data=4, model=3): 243, ProcessCoord(pipe=7, data=5, model=0): 244, ProcessCoord(pipe=7, data=5, model=1): 245, ProcessCoord(pipe=7, data=5, model=2): 246, ProcessCoord(pipe=7, data=5, model=3): 247, ProcessCoord(pipe=7, data=6, model=0): 248, ProcessCoord(pipe=7, data=6, model=1): 249, ProcessCoord(pipe=7, data=6, model=2): 250, ProcessCoord(pipe=7, data=6, model=3): 251, ProcessCoord(pipe=7, data=7, model=0): 252, ProcessCoord(pipe=7, data=7, model=1): 253, ProcessCoord(pipe=7, data=7, model=2): 254, ProcessCoord(pipe=7, data=7, model=3): 255, ProcessCoord(pipe=8, data=0, model=0): 256, ProcessCoord(pipe=8, data=0, model=1): 257, ProcessCoord(pipe=8, data=0, model=2): 258, ProcessCoord(pipe=8, data=0, model=3): 259, ProcessCoord(pipe=8, data=1, model=0): 260, ProcessCoord(pipe=8, data=1, model=1): 261, ProcessCoord(pipe=8, data=1, model=2): 262, ProcessCoord(pipe=8, data=1, model=3): 263, ProcessCoord(pipe=8, data=2, model=0): 264, ProcessCoord(pipe=8, data=2, model=1): 265, ProcessCoord(pipe=8, data=2, model=2): 266, ProcessCoord(pipe=8, data=2, model=3): 267, ProcessCoord(pipe=8, data=3, model=0): 268, ProcessCoord(pipe=8, data=3, model=1): 269, ProcessCoord(pipe=8, data=3, model=2): 270, ProcessCoord(pipe=8, data=3, model=3): 271, ProcessCoord(pipe=8, data=4, model=0): 272, ProcessCoord(pipe=8, data=4, model=1): 273, ProcessCoord(pipe=8, data=4, model=2): 274, ProcessCoord(pipe=8, data=4, model=3): 275, ProcessCoord(pipe=8, data=5, model=0): 276, ProcessCoord(pipe=8, data=5, model=1): 277, ProcessCoord(pipe=8, data=5, model=2): 278, ProcessCoord(pipe=8, data=5, model=3): 279, ProcessCoord(pipe=8, data=6, model=0): 280, ProcessCoord(pipe=8, data=6, model=1): 281, ProcessCoord(pipe=8, data=6, model=2): 282, ProcessCoord(pipe=8, data=6, model=3): 283, ProcessCoord(pipe=8, data=7, model=0): 284, ProcessCoord(pipe=8, data=7, model=1): 285, ProcessCoord(pipe=8, data=7, model=2): 286, ProcessCoord(pipe=8, data=7, model=3): 287, ProcessCoord(pipe=9, data=0, model=0): 288, ProcessCoord(pipe=9, data=0, model=1): 289, ProcessCoord(pipe=9, data=0, model=2): 290, ProcessCoord(pipe=9, data=0, model=3): 291, ProcessCoord(pipe=9, data=1, model=0): 292, ProcessCoord(pipe=9, data=1, model=1): 293, ProcessCoord(pipe=9, data=1, model=2): 294, ProcessCoord(pipe=9, data=1, model=3): 295, ProcessCoord(pipe=9, data=2, model=0): 296, ProcessCoord(pipe=9, data=2, model=1): 297, ProcessCoord(pipe=9, data=2, model=2): 298, ProcessCoord(pipe=9, data=2, model=3): 299, ProcessCoord(pipe=9, data=3, model=0): 300, ProcessCoord(pipe=9, data=3, model=1): 301, ProcessCoord(pipe=9, data=3, model=2): 302, ProcessCoord(pipe=9, data=3, model=3): 303, ProcessCoord(pipe=9, data=4, model=0): 304, ProcessCoord(pipe=9, data=4, model=1): 305, ProcessCoord(pipe=9, data=4, model=2): 306, ProcessCoord(pipe=9, data=4, model=3): 307, ProcessCoord(pipe=9, data=5, model=0): 308, ProcessCoord(pipe=9, data=5, model=1): 309, ProcessCoord(pipe=9, data=5, model=2): 310, ProcessCoord(pipe=9, data=5, model=3): 311, ProcessCoord(pipe=9, data=6, model=0): 312, ProcessCoord(pipe=9, data=6, model=1): 313, ProcessCoord(pipe=9, data=6, model=2): 314, ProcessCoord(pipe=9, data=6, model=3): 315, ProcessCoord(pipe=9, data=7, model=0): 316, ProcessCoord(pipe=9, data=7, model=1): 317, ProcessCoord(pipe=9, data=7, model=2): 318, ProcessCoord(pipe=9, data=7, model=3): 319, ProcessCoord(pipe=10, data=0, model=0): 320, ProcessCoord(pipe=10, data=0, model=1): 321, ProcessCoord(pipe=10, data=0, model=2): 322, ProcessCoord(pipe=10, data=0, model=3): 323, ProcessCoord(pipe=10, data=1, model=0): 324, ProcessCoord(pipe=10, data=1, model=1): 325, ProcessCoord(pipe=10, data=1, model=2): 326, ProcessCoord(pipe=10, data=1, model=3): 327, ProcessCoord(pipe=10, data=2, model=0): 328, ProcessCoord(pipe=10, data=2, model=1): 329, ProcessCoord(pipe=10, data=2, model=2): 330, ProcessCoord(pipe=10, data=2, model=3): 331, ProcessCoord(pipe=10, data=3, model=0): 332, ProcessCoord(pipe=10, data=3, model=1): 333, ProcessCoord(pipe=10, data=3, model=2): 334, ProcessCoord(pipe=10, data=3, model=3): 335, ProcessCoord(pipe=10, data=4, model=0): 336, ProcessCoord(pipe=10, data=4, model=1): 337, ProcessCoord(pipe=10, data=4, model=2): 338, ProcessCoord(pipe=10, data=4, model=3): 339, ProcessCoord(pipe=10, data=5, model=0): 340, ProcessCoord(pipe=10, data=5, model=1): 341, ProcessCoord(pipe=10, data=5, model=2): 342, ProcessCoord(pipe=10, data=5, model=3): 343, ProcessCoord(pipe=10, data=6, model=0): 344, ProcessCoord(pipe=10, data=6, model=1): 345, ProcessCoord(pipe=10, data=6, model=2): 346, ProcessCoord(pipe=10, data=6, model=3): 347, ProcessCoord(pipe=10, data=7, model=0): 348, ProcessCoord(pipe=10, data=7, model=1): 349, ProcessCoord(pipe=10, data=7, model=2): 350, ProcessCoord(pipe=10, data=7, model=3): 351, ProcessCoord(pipe=11, data=0, model=0): 352, ProcessCoord(pipe=11, data=0, model=1): 353, ProcessCoord(pipe=11, data=0, model=2): 354, ProcessCoord(pipe=11, data=0, model=3): 355, ProcessCoord(pipe=11, data=1, model=0): 356, ProcessCoord(pipe=11, data=1, model=1): 357, ProcessCoord(pipe=11, data=1, model=2): 358, ProcessCoord(pipe=11, data=1, model=3): 359, ProcessCoord(pipe=11, data=2, model=0): 360, ProcessCoord(pipe=11, data=2, model=1): 361, ProcessCoord(pipe=11, data=2, model=2): 362, ProcessCoord(pipe=11, data=2, model=3): 363, ProcessCoord(pipe=11, data=3, model=0): 364, ProcessCoord(pipe=11, data=3, model=1): 365, ProcessCoord(pipe=11, data=3, model=2): 366, ProcessCoord(pipe=11, data=3, model=3): 367, ProcessCoord(pipe=11, data=4, model=0): 368, ProcessCoord(pipe=11, data=4, model=1): 369, ProcessCoord(pipe=11, data=4, model=2): 370, ProcessCoord(pipe=11, data=4, model=3): 371, ProcessCoord(pipe=11, data=5, model=0): 372, ProcessCoord(pipe=11, data=5, model=1): 373, ProcessCoord(pipe=11, data=5, model=2): 374, ProcessCoord(pipe=11, data=5, model=3): 375, ProcessCoord(pipe=11, data=6, model=0): 376, ProcessCoord(pipe=11, data=6, model=1): 377, ProcessCoord(pipe=11, data=6, model=2): 378, ProcessCoord(pipe=11, data=6, model=3): 379, ProcessCoord(pipe=11, data=7, model=0): 380, ProcessCoord(pipe=11, data=7, model=1): 381, ProcessCoord(pipe=11, data=7, model=2): 382, ProcessCoord(pipe=11, data=7, model=3): 383}
[default0]:[2022-03-03 06:05:18,115] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method type:transformer|embedding
[default0]:stage=0 layers=8
[default0]:     0: _to_float16
[default0]:     1: EmbeddingPipe
[default0]:     2: <lambda>
[default0]:     3: ParallelTransformerLayerPipe
[default0]:     4: ParallelTransformerLayerPipe
[default0]:     5: ParallelTransformerLayerPipe
[default0]:     6: ParallelTransformerLayerPipe
[default0]:     7: ParallelTransformerLayerPipe
[default0]:stage=1 layers=6
[default0]:     8: ParallelTransformerLayerPipe
[default0]:     9: ParallelTransformerLayerPipe
[default0]:    10: ParallelTransformerLayerPipe
[default0]:    11: ParallelTransformerLayerPipe
[default0]:    12: ParallelTransformerLayerPipe
[default0]:    13: ParallelTransformerLayerPipe
[default0]:stage=2 layers=6
[default0]:    14: ParallelTransformerLayerPipe
[default0]:    15: ParallelTransformerLayerPipe
[default0]:    16: ParallelTransformerLayerPipe
[default0]:    17: ParallelTransformerLayerPipe
[default0]:    18: ParallelTransformerLayerPipe
[default0]:    19: ParallelTransformerLayerPipe
[default0]:stage=3 layers=6
[default0]:    20: ParallelTransformerLayerPipe
[default0]:    21: ParallelTransformerLayerPipe
[default0]:    22: ParallelTransformerLayerPipe
[default0]:    23: ParallelTransformerLayerPipe
[default0]:    24: ParallelTransformerLayerPipe
[default0]:    25: ParallelTransformerLayerPipe
[default0]:stage=4 layers=6
[default0]:    26: ParallelTransformerLayerPipe
[default0]:    27: ParallelTransformerLayerPipe
[default0]:    28: ParallelTransformerLayerPipe
[default0]:    29: ParallelTransformerLayerPipe
[default0]:    30: ParallelTransformerLayerPipe
[default0]:    31: ParallelTransformerLayerPipe
[default0]:stage=5 layers=6
[default0]:    32: ParallelTransformerLayerPipe
[default0]:    33: ParallelTransformerLayerPipe
[default0]:    34: ParallelTransformerLayerPipe
[default0]:    35: ParallelTransformerLayerPipe
[default0]:    36: ParallelTransformerLayerPipe
[default0]:    37: ParallelTransformerLayerPipe
[default0]:stage=6 layers=6
[default0]:    38: ParallelTransformerLayerPipe
[default0]:    39: ParallelTransformerLayerPipe
[default0]:    40: ParallelTransformerLayerPipe
[default0]:    41: ParallelTransformerLayerPipe
[default0]:    42: ParallelTransformerLayerPipe
[default0]:    43: ParallelTransformerLayerPipe
[default0]:stage=7 layers=6
[default0]:    44: ParallelTransformerLayerPipe
[default0]:    45: ParallelTransformerLayerPipe
[default0]:    46: ParallelTransformerLayerPipe
[default0]:    47: ParallelTransformerLayerPipe
[default0]:    48: ParallelTransformerLayerPipe
[default0]:    49: ParallelTransformerLayerPipe
[default0]:stage=8 layers=6
[default0]:    50: ParallelTransformerLayerPipe
[default0]:    51: ParallelTransformerLayerPipe
[default0]:    52: ParallelTransformerLayerPipe
[default0]:    53: ParallelTransformerLayerPipe
[default0]:    54: ParallelTransformerLayerPipe
[default0]:    55: ParallelTransformerLayerPipe
[default0]:stage=9 layers=6
[default0]:    56: ParallelTransformerLayerPipe
[default0]:    57: ParallelTransformerLayerPipe
[default0]:    58: ParallelTransformerLayerPipe
[default0]:    59: ParallelTransformerLayerPipe
[default0]:    60: ParallelTransformerLayerPipe
[default0]:    61: ParallelTransformerLayerPipe
[default0]:stage=10 layers=6
[default0]:    62: ParallelTransformerLayerPipe
[default0]:    63: ParallelTransformerLayerPipe
[default0]:    64: ParallelTransformerLayerPipe
[default0]:    65: ParallelTransformerLayerPipe
[default0]:    66: ParallelTransformerLayerPipe
[default0]:    67: ParallelTransformerLayerPipe
[default0]:stage=11 layers=9
[default0]:    68: ParallelTransformerLayerPipe
[default0]:    69: ParallelTransformerLayerPipe
[default0]:    70: ParallelTransformerLayerPipe
[default0]:    71: ParallelTransformerLayerPipe
[default0]:    72: ParallelTransformerLayerPipe
[default0]:    73: <lambda>
[default0]:    74: MixedFusedLayerNorm
[default0]:    75: EmbeddingPipe
[default0]:    76: float16_to_fp32
[default0]:  loss: CrossEntropy
[default0]:[2022-03-03 06:05:19,292] [INFO] [utils.py:828:see_memory_usage] After Building Model
[default0]:[2022-03-03 06:05:19,293] [INFO] [utils.py:829:see_memory_usage] MA 7.43 GB         Max_MA 7.43 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-03 06:05:19,293] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.6 GB, percent = 8.7%
[default0]:setting training iterations to 128728
[default0]:> learning rate decay style: cosine
[default0]:DeepSpeed is enabled.
[default0]:[2022-03-03 06:05:19,315] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+ed26ef4, git-hash=ed26ef4, git-branch=olruwase/bf16-updates
[default0]:[2022-03-03 06:05:21,109] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[default0]:[2022-03-03 06:05:21,109] [INFO] [engine.py:1092:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[default0]:[2022-03-03 06:05:21,109] [INFO] [engine.py:1098:_configure_optimizer] Using client Optimizer as basic optimizer
[default0]:[2022-03-03 06:05:21,110] [INFO] [engine.py:1114:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
[default0]:[2022-03-03 06:05:21,110] [INFO] [engine.py:1328:_configure_bf16_optimizer] Creating unfused BF16 optimizer
[default0]:[2022-03-03 06:05:21,137] [INFO] [utils.py:828:see_memory_usage] begin bf16_optimizer
[default0]:[2022-03-03 06:05:21,138] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB         Max_MA 7.43 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-03 06:05:21,138] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 06:05:21,159] [INFO] [utils.py:828:see_memory_usage] before initializing group 0
[default0]:[2022-03-03 06:05:21,160] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB         Max_MA 7.42 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-03 06:05:21,160] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 06:05:21,227] [INFO] [utils.py:828:see_memory_usage] after initializing group 0
[default0]:[2022-03-03 06:05:21,228] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB         Max_MA 17.01 GB         CA 20.23 GB         Max_CA 20 GB 
[default0]:[2022-03-03 06:05:21,228] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 06:05:21,248] [INFO] [utils.py:828:see_memory_usage] before initializing group 1
[default0]:[2022-03-03 06:05:21,248] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB         Max_MA 17.01 GB         CA 20.23 GB         Max_CA 20 GB 
[default0]:[2022-03-03 06:05:21,248] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 06:05:21,290] [INFO] [utils.py:828:see_memory_usage] after initializing group 1
[default0]:[2022-03-03 06:05:21,291] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB         Max_MA 24.11 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-03 06:05:21,291] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 06:05:21,310] [INFO] [utils.py:828:see_memory_usage] before initializing group 2
[default0]:[2022-03-03 06:05:21,311] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB         Max_MA 24.11 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-03 06:05:21,311] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 06:05:21,331] [INFO] [utils.py:828:see_memory_usage] after initializing group 2
[default0]:[2022-03-03 06:05:21,332] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB         Max_MA 24.12 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-03 06:05:21,332] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 06:05:21,352] [INFO] [utils.py:828:see_memory_usage] before initialize_optimizer
[default0]:[2022-03-03 06:05:21,352] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB         Max_MA 24.12 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-03 06:05:21,352] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 06:05:21,398] [INFO] [utils.py:828:see_memory_usage] end initialize_optimizer
[default0]:[2022-03-03 06:05:21,398] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB         Max_MA 27.82 GB         CA 34.21 GB         Max_CA 34 GB 
[default0]:[2022-03-03 06:05:21,398] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 06:05:21,417] [INFO] [utils.py:828:see_memory_usage] end bf16_optimizer
[default0]:[2022-03-03 06:05:21,418] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB         Max_MA 27.82 GB         CA 34.21 GB         Max_CA 34 GB 
[default0]:[2022-03-03 06:05:21,418] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.95 GB, percent = 8.7%
[default0]:[2022-03-03 06:05:21,418] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[default0]:[2022-03-03 06:05:21,418] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[default0]:[2022-03-03 06:05:21,418] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x147227cac8b0>
[default0]:[2022-03-03 06:05:21,418] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
[default0]:[2022-03-03 06:05:21,418] [INFO] [config.py:1057:print] DeepSpeedEngine configuration:
[default0]:[2022-03-03 06:05:21,418] [INFO] [config.py:1061:print]   activation_checkpointing_config  {
[default0]:    "partition_activations": false, 
[default0]:    "contiguous_memory_optimization": false, 
[default0]:    "cpu_checkpointing": false, 
[default0]:    "number_checkpoints": null, 
[default0]:    "synchronize_checkpoint_boundary": false, 
[default0]:    "profile": false
[default0]:}
[default0]:[2022-03-03 06:05:21,418] [INFO] [config.py:1061:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[default0]:[2022-03-03 06:05:21,418] [INFO] [config.py:1061:print]   amp_enabled .................. False
[default0]:[2022-03-03 06:05:21,418] [INFO] [config.py:1061:print]   amp_params ................... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   autotuning_config ............ {
[default0]:    "enabled": false, 
[default0]:    "start_step": null, 
[default0]:    "end_step": null, 
[default0]:    "metric_path": null, 
[default0]:    "arg_mappings": null, 
[default0]:    "metric": "throughput", 
[default0]:    "model_info": null, 
[default0]:    "results_dir": null, 
[default0]:    "exps_dir": null, 
[default0]:    "overwrite": true, 
[default0]:    "fast": true, 
[default0]:    "start_profile_step": 3, 
[default0]:    "end_profile_step": 5, 
[default0]:    "tuner_type": "gridsearch", 
[default0]:    "tuner_early_stopping": 5, 
[default0]:    "tuner_num_trials": 50, 
[default0]:    "model_info_path": null, 
[default0]:    "mp_size": 1, 
[default0]:    "max_train_batch_size": null, 
[default0]:    "min_train_batch_size": 1, 
[default0]:    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
[default0]:    "min_train_micro_batch_size_per_gpu": 1, 
[default0]:    "num_tuning_micro_batch_sizes": 3
[default0]:}
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   bfloat16_enabled ............. True
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   checkpoint_tag_validation_enabled  True
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   checkpoint_tag_validation_fail  False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   communication_data_type ...... None
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   curriculum_enabled ........... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   curriculum_params ............ False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   dataloader_drop_last ......... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   disable_allgather ............ False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   dump_state ................... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   dynamic_loss_scale_args ...... None
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   eigenvalue_enabled ........... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   eigenvalue_gas_boundary_resolution  1
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   eigenvalue_layer_name ........ bert.encoder.layer
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   eigenvalue_layer_num ......... 0
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   eigenvalue_max_iter .......... 100
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   eigenvalue_stability ......... 1e-06
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   eigenvalue_tol ............... 0.01
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   eigenvalue_verbose ........... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   elasticity_enabled ........... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   flops_profiler_config ........ {
[default0]:    "enabled": false, 
[default0]:    "profile_step": 1, 
[default0]:    "module_depth": -1, 
[default0]:    "top_modules": 1, 
[default0]:    "detailed": true, 
[default0]:    "output_file": null
[default0]:}
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   fp16_enabled ................. False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   fp16_master_weights_and_gradients  False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   fp16_mixed_quantize .......... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   global_rank .................. 0
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   gradient_accumulation_steps .. 128
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   gradient_clipping ............ 1.0
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   gradient_predivide_factor .... 1.0
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   initial_dynamic_scale ........ 1
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   loss_scale ................... 1.0
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   memory_breakdown ............. False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   optimizer_legacy_fusion ...... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   optimizer_name ............... None
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   optimizer_params ............. None
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   pld_enabled .................. False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   pld_params ................... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   prescale_gradients ........... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   quantize_change_rate ......... 0.001
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   quantize_groups .............. 1
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   quantize_offset .............. 1000
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   quantize_period .............. 1000
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   quantize_rounding ............ 0
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   quantize_start_bits .......... 16
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   quantize_target_bits ......... 8
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   quantize_training_enabled .... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   quantize_type ................ 0
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   quantize_verbose ............. False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   scheduler_name ............... None
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   scheduler_params ............. None
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   sparse_attention ............. None
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   sparse_gradients_enabled ..... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   steps_per_print .............. 2000
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   tensorboard_enabled .......... False
[default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print]   tensorboard_job_name ......... DeepSpeedJobName
[default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print]   tensorboard_output_path ...... 
[default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print]   train_batch_size ............. 2048
[default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print]   train_micro_batch_size_per_gpu  2
[default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print]   use_quantizer_kernel ......... False
[default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print]   wall_clock_breakdown ......... False
[default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print]   world_size ................... 8
[default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print]   zero_allow_untested_optimizer  False
[default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print]   zero_config .................. {
[default0]:    "stage": 0, 
[default0]:    "contiguous_gradients": true, 
[default0]:    "reduce_scatter": true, 
[default0]:    "reduce_bucket_size": 5.000000e+08, 
[default0]:    "allgather_partitions": true, 
[default0]:    "allgather_bucket_size": 5.000000e+08, 
[default0]:    "overlap_comm": false, 
[default0]:    "load_from_fp32_weights": true, 
[default0]:    "elastic_checkpoint": false, 
[default0]:    "offload_param": null, 
[default0]:    "offload_optimizer": null, 
[default0]:    "sub_group_size": 1.000000e+09, 
[default0]:    "prefetch_bucket_size": 5.000000e+07, 
[default0]:    "param_persistence_threshold": 1.000000e+05, 
[default0]:    "max_live_parameters": 1.000000e+09, 
[default0]:    "max_reuse_distance": 1.000000e+09, 
[default0]:    "gather_16bit_weights_on_model_save": false, 
[default0]:    "ignore_unused_parameters": true, 
[default0]:    "round_robin_gradients": false, 
[default0]:    "legacy_stage1": false
[default0]:}
[default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print]   zero_enabled ................. False
[default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print]   zero_optimization_stage ...... 0
[default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1063:print]   json = {
[default0]:    "train_micro_batch_size_per_gpu": 2, 
[default0]:    "train_batch_size": 2.048000e+03, 
[default0]:    "gradient_clipping": 1.0, 
[default0]:    "zero_optimization": {
[default0]:        "stage": 0
[default0]:    }, 
[default0]:    "bf16": {
[default0]:        "enabled": true
[default0]:    }, 
[default0]:    "steps_per_print": 2.000000e+03, 
[default0]:    "wall_clock_breakdown": false
[default0]:}
[default0]:[2022-03-03 06:05:21,420] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=128 micro_batch_size=2
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=65 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=67 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=66 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=97 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=99 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=96 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=320 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=323 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=321 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=194 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=192 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=193 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=195 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=289 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=291 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=288 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=131 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=128 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=290 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=129 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=130 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=225 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=34 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=35 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=33 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=32 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=322 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=226 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=227 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=224 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=2 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=352 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=64 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=258 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=256 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=259 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=257 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=98 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=355 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=354 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=353 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=1 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=3 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=160 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=161 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=162 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=163 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]: > using checkpoint value 6e-05 for learning rate
[default0]: > using checkpoint value 6e-06 for minimum learning rate
[default0]: > using checkpoint value 183105 for warmup iterations
[default0]: > using checkpoint value 200000000 for total number of iterations
[default0]: > using checkpoint value cosine for decay style
[default4]:[2022-03-03 06:05:39,293] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 332
[default5]:[2022-03-03 06:05:39,708] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 181
[default0]:[2022-03-03 06:05:39,888] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 72
[default0]:[2022-03-03 06:05:39,945] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 176
[default1]:[2022-03-03 06:05:40,144] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 177
[default4]:[2022-03-03 06:05:40,155] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 332
[default7]:[2022-03-03 06:05:40,361] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 335
[default0]:[2022-03-03 06:05:40,467] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 328
[default1]:[2022-03-03 06:05:40,508] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 329
[default4]:[2022-03-03 06:05:40,621] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 180
[default0]:[2022-03-03 06:05:40,644] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 352
[default0]:[2022-03-03 06:05:40,665] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 184
[default0]:[2022-03-03 06:05:40,755] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 72
[default5]:[2022-03-03 06:05:40,689] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 181
[default2]:[2022-03-03 06:05:40,717] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 330
[default3]:[2022-03-03 06:05:40,800] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 331
[default0]:[2022-03-03 06:05:41,187] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 176
[default6]:[2022-03-03 06:05:41,255] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 78
[default1]:[2022-03-03 06:05:41,314] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 177
[default7]:[2022-03-03 06:05:41,314] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 335
[default7]:[2022-03-03 06:05:41,392] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 351
[default4]:[2022-03-03 06:05:41,446] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 284
[default0]:[2022-03-03 06:05:41,402] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 328
[default6]:[2022-03-03 06:05:41,553] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 182
[default0]:[2022-03-03 06:05:41,495] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 184
[default1]:[2022-03-03 06:05:41,501] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 329
[default0]:[2022-03-03 06:05:41,574] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 352
[default0]:[2022-03-03 06:05:41,623] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 344
[default0]:[2022-03-03 06:05:41,650] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 168
[default4]:[2022-03-03 06:05:41,697] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 180
[default4]:[2022-03-03 06:05:41,727] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 364
[default4]:[2022-03-03 06:05:41,745] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 356
[default5]:[2022-03-03 06:05:41,699] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 333
[default4]:[2022-03-03 06:05:41,794] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 36
[default6]:[2022-03-03 06:05:41,843] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 334
[default0]:[2022-03-03 06:05:41,896] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 336
[default3]:[2022-03-03 06:05:41,921] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 339
[default4]:[2022-03-03 06:05:41,891] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 188
[default2]:[2022-03-03 06:05:41,893] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 330
[default3]:[2022-03-03 06:05:41,950] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 331
[default2]:[2022-03-03 06:05:41,978] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 178
[default0]:[2022-03-03 06:05:42,142] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 80
[default6]:[2022-03-03 06:05:42,157] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 78
[default5]:[2022-03-03 06:05:42,141] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 173
[default0]:[2022-03-03 06:05:42,215] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 288
[default5]:[2022-03-03 06:05:42,228] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 77
[default3]:[2022-03-03 06:05:42,327] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 179
[default7]:[2022-03-03 06:05:42,330] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 343
[default7]:[2022-03-03 06:05:42,319] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 351
[default4]:[2022-03-03 06:05:42,305] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 284
[default6]:[2022-03-03 06:05:42,408] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 182
[default4]:[2022-03-03 06:05:42,428] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 76
[default0]:[2022-03-03 06:05:42,408] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 248
[default0]:[2022-03-03 06:05:42,383] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 120
[default2]:[2022-03-03 06:05:42,495] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 74
[default2]:[2022-03-03 06:05:42,565] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 250
[default0]:[2022-03-03 06:05:42,559] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 344
[default1]:[2022-03-03 06:05:42,504] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 33
[default4]:[2022-03-03 06:05:42,630] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 252
[default1]:[2022-03-03 06:05:42,602] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 73
[default7]:[2022-03-03 06:05:42,600] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 191
[default5]:[2022-03-03 06:05:42,640] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 349
[default1]:[2022-03-03 06:05:42,661] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 305
[default4]:[2022-03-03 06:05:42,730] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 124
[default4]:[2022-03-03 06:05:42,716] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 340
[default3]:[2022-03-03 06:05:42,753] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 347
[default5]:[2022-03-03 06:05:42,747] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 189
[default4]:[2022-03-03 06:05:42,728] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 308
[default4]:[2022-03-03 06:05:42,689] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 356
[default4]:[2022-03-03 06:05:42,691] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 36
[default4]:[2022-03-03 06:05:42,768] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 188
[default0]:[2022-03-03 06:05:42,716] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 168
[default5]:[2022-03-03 06:05:42,698] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 333
[default7]:[2022-03-03 06:05:42,769] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 183
[default3]:[2022-03-03 06:05:42,832] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 339
[default4]:[2022-03-03 06:05:42,798] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 364
[default6]:[2022-03-03 06:05:42,865] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 334
[default0]:[2022-03-03 06:05:42,883] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 272
[default2]:[2022-03-03 06:05:42,933] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 178
[default3]:[2022-03-03 06:05:42,901] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 251
[default0]:[2022-03-03 06:05:42,980] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 336
[default0]:[2022-03-03 06:05:43,025] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 80
[default7]:[2022-03-03 06:05:42,988] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 39
[default3]:[2022-03-03 06:05:43,014] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 35
[default4]:[2022-03-03 06:05:43,057] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 300
[default4]:[2022-03-03 06:05:43,079] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 52
[default0]:[2022-03-03 06:05:43,117] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 288
[default0]:[2022-03-03 06:05:43,166] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 280
[default4]:[2022-03-03 06:05:43,118] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 348
[default0]:[2022-03-03 06:05:43,148] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 32
[default3]:[2022-03-03 06:05:43,205] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 179
[default1]:[2022-03-03 06:05:43,186] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 345
[default7]:[2022-03-03 06:05:43,248] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 311
[default6]:[2022-03-03 06:05:43,239] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 310
[default5]:[2022-03-03 06:05:43,256] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 173
[default6]:[2022-03-03 06:05:43,250] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 190
[default4]:[2022-03-03 06:05:43,346] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 84
[default7]:[2022-03-03 06:05:43,309] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 343
[default5]:[2022-03-03 06:05:43,297] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 77
[default7]:[2022-03-03 06:05:43,343] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 79
[default6]:[2022-03-03 06:05:43,328] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 254
[default0]:[2022-03-03 06:05:43,300] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 304
[default5]:[2022-03-03 06:05:43,351] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 37
[default4]:[2022-03-03 06:05:43,441] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 76
[default2]:[2022-03-03 06:05:43,423] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 74
[default6]:[2022-03-03 06:05:43,464] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 350
[default0]:[2022-03-03 06:05:43,416] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 120
[default2]:[2022-03-03 06:05:43,516] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 282
[default7]:[2022-03-03 06:05:43,512] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 255
[default4]:[2022-03-03 06:05:43,523] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 228
[default1]:[2022-03-03 06:05:43,529] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 33
[default1]:[2022-03-03 06:05:43,549] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 305
[default3]:[2022-03-03 06:05:43,537] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 283
[default4]:[2022-03-03 06:05:43,547] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 372
[default0]:[2022-03-03 06:05:43,522] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 256
[default4]:[2022-03-03 06:05:43,597] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 276
[default7]:[2022-03-03 06:05:43,648] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 183
[default4]:[2022-03-03 06:05:43,656] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 124
[default4]:[2022-03-03 06:05:43,647] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 292
[default2]:[2022-03-03 06:05:43,643] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 186
[default6]:[2022-03-03 06:05:43,611] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 342
[default1]:[2022-03-03 06:05:43,613] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 73
[default0]:[2022-03-03 06:05:43,634] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 192
[default7]:[2022-03-03 06:05:43,583] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 191
[default5]:[2022-03-03 06:05:43,588] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 349
[default4]:[2022-03-03 06:05:43,605] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 324
[default1]:[2022-03-03 06:05:43,639] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 169
[default2]:[2022-03-03 06:05:43,652] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 202
[default4]:[2022-03-03 06:05:43,646] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 204
[default5]:[2022-03-03 06:05:43,584] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 53
[default4]:[2022-03-03 06:05:43,680] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 340
[default3]:[2022-03-03 06:05:43,692] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 347
[default5]:[2022-03-03 06:05:43,677] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 189
[default7]:[2022-03-03 06:05:43,753] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 287
[default4]:[2022-03-03 06:05:43,687] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 308
[default0]:[2022-03-03 06:05:43,687] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 248
[default2]:[2022-03-03 06:05:43,755] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 346
[default0]:[2022-03-03 06:05:43,706] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 264
[default4]:[2022-03-03 06:05:43,721] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 268
[default4]:[2022-03-03 06:05:43,794] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 148
[default2]:[2022-03-03 06:05:43,831] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 338
[default2]:[2022-03-03 06:05:43,796] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 250
[default2]:[2022-03-03 06:05:43,787] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 306
[default0]:[2022-03-03 06:05:43,803] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 40
[default0]:[2022-03-03 06:05:43,921] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 272
[default3]:[2022-03-03 06:05:43,867] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 75
[default4]:[2022-03-03 06:05:43,868] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 252
[default7]:[2022-03-03 06:05:43,942] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 127
[default3]:[2022-03-03 06:05:43,930] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 307
[default4]:[2022-03-03 06:05:43,884] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 172
[default5]:[2022-03-03 06:05:43,935] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 309
[default4]:[2022-03-03 06:05:43,970] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 300
[default0]:[2022-03-03 06:05:43,914] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 368
[default6]:[2022-03-03 06:05:44,058] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 150
[default1]:[2022-03-03 06:05:44,046] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 249
[default1]:[2022-03-03 06:05:44,038] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 185
[default3]:[2022-03-03 06:05:43,971] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 251
[default0]:[2022-03-03 06:05:44,065] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 88
[default5]:[2022-03-03 06:05:44,157] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 277
[default1]:[2022-03-03 06:05:44,071] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 345
[default4]:[2022-03-03 06:05:44,126] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 348
[default4]:[2022-03-03 06:05:44,074] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 60
[default0]:[2022-03-03 06:05:44,081] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 224
[default7]:[2022-03-03 06:05:44,120] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 39
[default4]:[2022-03-03 06:05:44,118] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 44
[default0]:[2022-03-03 06:05:44,171] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 48
[default0]:[2022-03-03 06:05:44,186] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 360
[default5]:[2022-03-03 06:05:44,184] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 253
[default7]:[2022-03-03 06:05:44,247] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 79
[default7]:[2022-03-03 06:05:44,262] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 87
[default4]:[2022-03-03 06:05:44,180] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 196
[default0]:[2022-03-03 06:05:44,179] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 280
[default6]:[2022-03-03 06:05:44,199] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 38
[default2]:[2022-03-03 06:05:44,219] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 34
[default5]:[2022-03-03 06:05:44,242] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 205
[default5]:[2022-03-03 06:05:44,260] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 45
[default7]:[2022-03-03 06:05:44,267] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 175
[default6]:[2022-03-03 06:05:44,194] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 190
[default4]:[2022-03-03 06:05:44,185] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 52
[default6]:[2022-03-03 06:05:44,369] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 254
[default4]:[2022-03-03 06:05:44,362] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 228
[default6]:[2022-03-03 06:05:44,289] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 174
[default0]:[2022-03-03 06:05:44,365] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 200
[default0]:[2022-03-03 06:05:44,372] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 232
[default5]:[2022-03-03 06:05:44,377] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 341
[default1]:[2022-03-03 06:05:44,444] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 121
[default0]:[2022-03-03 06:05:44,409] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 304
[default0]:[2022-03-03 06:05:44,455] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 256
[default5]:[2022-03-03 06:05:44,473] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 53
[default4]:[2022-03-03 06:05:44,517] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 276
[default2]:[2022-03-03 06:05:44,526] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 122
[default7]:[2022-03-03 06:05:44,523] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 279
[default3]:[2022-03-03 06:05:44,554] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 187
[default2]:[2022-03-03 06:05:44,547] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 186
[default4]:[2022-03-03 06:05:44,548] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 84
[default6]:[2022-03-03 06:05:44,473] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 350
[default2]:[2022-03-03 06:05:44,570] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 282
[default5]:[2022-03-03 06:05:44,484] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 285
[default4]:[2022-03-03 06:05:44,508] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 324
[default0]:[2022-03-03 06:05:44,497] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 32
[default7]:[2022-03-03 06:05:44,544] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 255
[default4]:[2022-03-03 06:05:44,530] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 372
[default3]:[2022-03-03 06:05:44,625] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 275
[default4]:[2022-03-03 06:05:44,578] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 292
[default3]:[2022-03-03 06:05:44,627] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 123
[default6]:[2022-03-03 06:05:44,608] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 342
[default2]:[2022-03-03 06:05:44,651] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 338
[default0]:[2022-03-03 06:05:44,652] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 192
[default6]:[2022-03-03 06:05:44,573] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 286
[default1]:[2022-03-03 06:05:44,606] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 209
[default3]:[2022-03-03 06:05:44,664] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 291
[default2]:[2022-03-03 06:05:44,667] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 346
[default0]:[2022-03-03 06:05:44,647] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 264
[default6]:[2022-03-03 06:05:44,595] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 310
[default3]:[2022-03-03 06:05:44,583] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 283
[default4]:[2022-03-03 06:05:44,624] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 268
[default0]:[2022-03-03 06:05:44,665] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 216
[default2]:[2022-03-03 06:05:44,582] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 202
[default4]:[2022-03-03 06:05:44,620] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 156
[default0]:[2022-03-03 06:05:44,623] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 40
[default6]:[2022-03-03 06:05:44,627] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 206
[default5]:[2022-03-03 06:05:44,700] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 149
[default1]:[2022-03-03 06:05:44,761] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 337
[default4]:[2022-03-03 06:05:44,721] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 236
[default4]:[2022-03-03 06:05:44,728] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 148
[default5]:[2022-03-03 06:05:44,690] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 125
[default1]:[2022-03-03 06:05:44,683] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 281
[default7]:[2022-03-03 06:05:44,678] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 311
[default3]:[2022-03-03 06:05:44,686] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 35
[default4]:[2022-03-03 06:05:44,755] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 380
[default1]:[2022-03-03 06:05:44,717] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 169
[default1]:[2022-03-03 06:05:44,690] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 201
[default4]:[2022-03-03 06:05:44,711] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 204
[default0]:[2022-03-03 06:05:44,748] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 296
[default3]:[2022-03-03 06:05:44,761] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 211
[default3]:[2022-03-03 06:05:44,742] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 171
[default0]:[2022-03-03 06:05:44,830] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 144
[default3]:[2022-03-03 06:05:44,799] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 75
[default6]:[2022-03-03 06:05:44,820] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 86
[default3]:[2022-03-03 06:05:44,811] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 235
[default0]:[2022-03-03 06:05:44,797] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 320
[default3]:[2022-03-03 06:05:44,791] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 83
[default7]:[2022-03-03 06:05:44,776] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 287
[default0]:[2022-03-03 06:05:44,826] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 56
[default6]:[2022-03-03 06:05:44,776] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 62
[default3]:[2022-03-03 06:05:44,826] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 203
[default2]:[2022-03-03 06:05:44,808] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 170
[default2]:[2022-03-03 06:05:44,783] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 298
[default7]:[2022-03-03 06:05:44,815] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 207
[default6]:[2022-03-03 06:05:44,791] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 54
[default2]:[2022-03-03 06:05:44,904] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 274
[default2]:[2022-03-03 06:05:44,914] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 146
[default1]:[2022-03-03 06:05:44,959] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 81
[default6]:[2022-03-03 06:05:44,871] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 126
[default5]:[2022-03-03 06:05:44,907] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 85
[default6]:[2022-03-03 06:05:44,887] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 198
[default6]:[2022-03-03 06:05:44,970] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 30
[default0]:[2022-03-03 06:05:44,963] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 224
[default3]:[2022-03-03 06:05:44,893] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 227
[default1]:[2022-03-03 06:05:44,972] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 353
[default2]:[2022-03-03 06:05:44,941] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 306
[default5]:[2022-03-03 06:05:44,936] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 381
[default5]:[2022-03-03 06:05:44,956] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 309
[default7]:[2022-03-03 06:05:44,948] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 239
[default1]:[2022-03-03 06:05:44,890] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 161
[default0]:[2022-03-03 06:05:44,984] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 8
[default1]:[2022-03-03 06:05:44,989] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 273
[default2]:[2022-03-03 06:05:45,023] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 82
[default2]:[2022-03-03 06:05:44,998] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 234
[default1]:[2022-03-03 06:05:44,979] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 185
[default4]:[2022-03-03 06:05:45,032] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 196
[default6]:[2022-03-03 06:05:44,992] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 214
[default4]:[2022-03-03 06:05:44,980] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 92
[default0]:[2022-03-03 06:05:45,058] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 88
[default7]:[2022-03-03 06:05:45,006] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 295
[default2]:[2022-03-03 06:05:45,034] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 58
[default4]:[2022-03-03 06:05:45,051] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 60
[default2]:[2022-03-03 06:05:44,986] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 322
[default4]:[2022-03-03 06:05:45,029] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 172
[default3]:[2022-03-03 06:05:45,067] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 259
[default3]:[2022-03-03 06:05:45,046] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 195
[default7]:[2022-03-03 06:05:45,068] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 215
[default0]:[2022-03-03 06:05:45,005] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 368
[default0]:[2022-03-03 06:05:45,052] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 160
[default6]:[2022-03-03 06:05:45,072] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 150
[default6]:[2022-03-03 06:05:45,108] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 278
[default7]:[2022-03-03 06:05:45,074] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 87
[default5]:[2022-03-03 06:05:45,137] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 61
[default6]:[2022-03-03 06:05:45,135] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 94
[default5]:[2022-03-03 06:05:45,094] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 269
[default5]:[2022-03-03 06:05:45,077] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 37
[default5]:[2022-03-03 06:05:45,118] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 261
[default4]:[2022-03-03 06:05:45,108] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 260
[default4]:[2022-03-03 06:05:45,108] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 212
[default1]:[2022-03-03 06:05:45,107] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 297
[default6]:[2022-03-03 06:05:45,108] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 302
[default1]:[2022-03-03 06:05:45,094] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 49
[default0]:[2022-03-03 06:05:45,183] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 48
[default4]:[2022-03-03 06:05:45,235] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 68
[default5]:[2022-03-03 06:05:45,245] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 277
[default6]:[2022-03-03 06:05:45,169] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 238
[default7]:[2022-03-03 06:05:45,175] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 127
[default0]:[2022-03-03 06:05:45,257] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 360
[default5]:[2022-03-03 06:05:45,211] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 253
[default3]:[2022-03-03 06:05:45,178] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 307
[default7]:[2022-03-03 06:05:45,219] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 271
[default3]:[2022-03-03 06:05:45,178] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 267
[default4]:[2022-03-03 06:05:45,191] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 220
[default4]:[2022-03-03 06:05:45,196] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 44
[default1]:[2022-03-03 06:05:45,329] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 249
[default5]:[2022-03-03 06:05:45,276] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 341
[default0]:[2022-03-03 06:05:45,350] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 24
[default7]:[2022-03-03 06:05:45,336] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 383
[default1]:[2022-03-03 06:05:45,340] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 257
[default0]:[2022-03-03 06:05:45,308] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 200
[default5]:[2022-03-03 06:05:45,343] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 45
[default7]:[2022-03-03 06:05:45,376] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 167
[default4]:[2022-03-03 06:05:45,371] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 164
[default4]:[2022-03-03 06:05:45,388] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 12
[default3]:[2022-03-03 06:05:45,388] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 147
[default0]:[2022-03-03 06:05:45,383] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 232
[default1]:[2022-03-03 06:05:45,425] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 233
[default7]:[2022-03-03 06:05:45,450] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 279
[default3]:[2022-03-03 06:05:45,385] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 187
[default7]:[2022-03-03 06:05:45,397] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 95
[default3]:[2022-03-03 06:05:45,437] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 59
[default1]:[2022-03-03 06:05:45,416] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 57
[default5]:[2022-03-03 06:05:45,402] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 205
[default6]:[2022-03-03 06:05:45,398] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 230
[default4]:[2022-03-03 06:05:45,470] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 156
[default7]:[2022-03-03 06:05:45,451] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 175
[default7]:[2022-03-03 06:05:45,468] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 151
[default4]:[2022-03-03 06:05:45,467] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 100
[default1]:[2022-03-03 06:05:45,568] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 289
[default5]:[2022-03-03 06:05:45,568] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 285
[default1]:[2022-03-03 06:05:45,531] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 209
[default5]:[2022-03-03 06:05:45,509] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 221
[default3]:[2022-03-03 06:05:45,573] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 91
[default7]:[2022-03-03 06:05:45,480] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 63
[default1]:[2022-03-03 06:05:45,498] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 225
[default6]:[2022-03-03 06:05:45,494] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 326
[default6]:[2022-03-03 06:05:45,492] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 174
[default2]:[2022-03-03 06:05:45,512] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 106
[default3]:[2022-03-03 06:05:45,493] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 107
[default6]:[2022-03-03 06:05:45,518] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 270
[default7]:[2022-03-03 06:05:45,535] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 263
[default5]:[2022-03-03 06:05:45,487] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 213
[default2]:[2022-03-03 06:05:45,516] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 50
[default3]:[2022-03-03 06:05:45,651] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 275
[default1]:[2022-03-03 06:05:45,614] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 145
[default1]:[2022-03-03 06:05:45,601] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 337
[default2]:[2022-03-03 06:05:45,591] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 194
[default3]:[2022-03-03 06:05:45,598] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 291
[default5]:[2022-03-03 06:05:45,616] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 293
[default4]:[2022-03-03 06:05:45,610] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 116
[default6]:[2022-03-03 06:05:45,672] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 358
[default1]:[2022-03-03 06:05:45,646] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 265
[default2]:[2022-03-03 06:05:45,617] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 378
[default0]:[2022-03-03 06:05:45,673] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 216
[default1]:[2022-03-03 06:05:45,643] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 105
[default6]:[2022-03-03 06:05:45,766] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 86
[default5]:[2022-03-03 06:05:45,696] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 237
[default5]:[2022-03-03 06:05:45,745] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 197
[default1]:[2022-03-03 06:05:45,682] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 281
[default6]:[2022-03-03 06:05:45,730] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 286
[default0]:[2022-03-03 06:05:45,744] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 64
[default5]:[2022-03-03 06:05:45,747] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 93
[default5]:[2022-03-03 06:05:45,760] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 365
[default2]:[2022-03-03 06:05:45,756] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 266
[default1]:[2022-03-03 06:05:45,741] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 377
[default6]:[2022-03-03 06:05:45,730] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 110
[default2]:[2022-03-03 06:05:45,733] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 258
[default6]:[2022-03-03 06:05:45,717] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 46
[default1]:[2022-03-03 06:05:45,728] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 41
[default7]:[2022-03-03 06:05:45,704] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 303
[default4]:[2022-03-03 06:05:45,756] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 20
[default6]:[2022-03-03 06:05:45,697] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 206
[default3]:[2022-03-03 06:05:45,770] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 171
[default6]:[2022-03-03 06:05:45,731] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 54
[default1]:[2022-03-03 06:05:45,836] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 81
[default1]:[2022-03-03 06:05:45,833] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 361
[default1]:[2022-03-03 06:05:45,849] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 121
[default0]:[2022-03-03 06:05:45,789] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 208
[default7]:[2022-03-03 06:05:45,803] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 199
[default3]:[2022-03-03 06:05:45,786] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 227
[default6]:[2022-03-03 06:05:45,865] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 382
[default1]:[2022-03-03 06:05:45,813] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 193
[default2]:[2022-03-03 06:05:45,824] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 170
[default6]:[2022-03-03 06:05:45,795] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 262
[default2]:[2022-03-03 06:05:45,813] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 226
[default7]:[2022-03-03 06:05:45,858] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 47
[default3]:[2022-03-03 06:05:45,792] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 299
[default2]:[2022-03-03 06:05:45,824] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 298
[default5]:[2022-03-03 06:05:45,866] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 229
[default0]:[2022-03-03 06:05:45,831] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 296
[default3]:[2022-03-03 06:05:45,849] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 211
[default3]:[2022-03-03 06:05:45,862] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 51
[default1]:[2022-03-03 06:05:45,868] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 161
[default1]:[2022-03-03 06:05:45,889] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 9
[default1]:[2022-03-03 06:05:45,954] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 321
[default6]:[2022-03-03 06:05:45,965] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 294
[default0]:[2022-03-03 06:05:45,909] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 320
[default2]:[2022-03-03 06:05:45,947] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 290
[default5]:[2022-03-03 06:05:45,943] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 85
[default3]:[2022-03-03 06:05:45,952] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 83
[default6]:[2022-03-03 06:05:45,961] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 198
[default6]:[2022-03-03 06:05:45,923] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 30
[default2]:[2022-03-03 06:05:45,909] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 210
[default2]:[2022-03-03 06:05:45,887] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 90
[default1]:[2022-03-03 06:05:45,901] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 113
[default7]:[2022-03-03 06:05:45,940] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 231
[default5]:[2022-03-03 06:05:45,917] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 325
[default2]:[2022-03-03 06:05:45,929] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 322
[default6]:[2022-03-03 06:05:45,956] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 38
[default2]:[2022-03-03 06:05:45,920] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 34
[default1]:[2022-03-03 06:05:45,928] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 1
[default3]:[2022-03-03 06:05:45,908] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 203
[default5]:[2022-03-03 06:05:45,941] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 109
[default1]:[2022-03-03 06:05:45,948] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 201
[default3]:[2022-03-03 06:05:45,956] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 43
[default2]:[2022-03-03 06:05:45,947] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 42
[default2]:[2022-03-03 06:05:45,938] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 162
[default6]:[2022-03-03 06:05:45,984] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 166
[default3]:[2022-03-03 06:05:45,969] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 163
[default6]:[2022-03-03 06:05:46,004] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 318
[default2]:[2022-03-03 06:05:45,985] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 122
[default3]:[2022-03-03 06:05:46,041] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 323
[default3]:[2022-03-03 06:05:46,003] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 123
[default7]:[2022-03-03 06:05:46,015] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 295
[default4]:[2022-03-03 06:05:46,036] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 92
[default0]:[2022-03-03 06:05:46,032] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 104
[default4]:[2022-03-03 06:05:45,996] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 108
[default7]:[2022-03-03 06:05:46,004] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 207
[default1]:[2022-03-03 06:05:46,082] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 49
[default0]:[2022-03-03 06:05:46,063] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 8
[default4]:[2022-03-03 06:05:46,115] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 316
[default3]:[2022-03-03 06:05:46,117] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 67
[default2]:[2022-03-03 06:05:46,149] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 146
[default2]:[2022-03-03 06:05:46,098] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 82
[default6]:[2022-03-03 06:05:46,071] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 126
[default0]:[2022-03-03 06:05:46,145] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 312
[default7]:[2022-03-03 06:05:46,081] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 31
[default1]:[2022-03-03 06:05:46,090] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 353
[default3]:[2022-03-03 06:05:46,091] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 379
[default3]:[2022-03-03 06:05:46,157] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 259
[default3]:[2022-03-03 06:05:46,124] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 195
[default0]:[2022-03-03 06:05:46,106] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 152
[default5]:[2022-03-03 06:05:46,105] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 301
[default3]:[2022-03-03 06:05:46,117] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 363
[default6]:[2022-03-03 06:05:46,097] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 302
[default5]:[2022-03-03 06:05:46,157] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 165
[default1]:[2022-03-03 06:05:46,166] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 273
[default2]:[2022-03-03 06:05:46,180] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 274
[default5]:[2022-03-03 06:05:46,181] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 149
[default2]:[2022-03-03 06:05:46,265] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 314
[default6]:[2022-03-03 06:05:46,231] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 214
[default6]:[2022-03-03 06:05:46,210] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 94
[default0]:[2022-03-03 06:05:46,235] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 56
[default7]:[2022-03-03 06:05:46,205] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 327
[default3]:[2022-03-03 06:05:46,257] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 267
[default2]:[2022-03-03 06:05:46,179] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 218
[default4]:[2022-03-03 06:05:46,231] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 260
[default5]:[2022-03-03 06:05:46,236] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 261
[default4]:[2022-03-03 06:05:46,232] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 132
[default1]:[2022-03-03 06:05:46,204] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 297
[default6]:[2022-03-03 06:05:46,265] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 374
[default0]:[2022-03-03 06:05:46,256] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 160
[default4]:[2022-03-03 06:05:46,314] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 68
[default6]:[2022-03-03 06:05:46,280] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 278
[default5]:[2022-03-03 06:05:46,300] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 125
[default0]:[2022-03-03 06:05:46,345] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 24
[default2]:[2022-03-03 06:05:46,349] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 114
[default5]:[2022-03-03 06:05:46,314] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 61
[default2]:[2022-03-03 06:05:46,301] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 58
[default5]:[2022-03-03 06:05:46,276] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 269
[default4]:[2022-03-03 06:05:46,319] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 380
[default6]:[2022-03-03 06:05:46,351] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 366
[default7]:[2022-03-03 06:05:46,350] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 167
[default2]:[2022-03-03 06:05:46,360] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 50
[default0]:[2022-03-03 06:05:46,446] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 144
[default3]:[2022-03-03 06:05:46,390] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 147
[default4]:[2022-03-03 06:05:46,407] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 236
[default4]:[2022-03-03 06:05:46,389] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 100
[default1]:[2022-03-03 06:05:46,372] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 25
[default7]:[2022-03-03 06:05:46,448] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 119
[default6]:[2022-03-03 06:05:46,385] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 118
[default1]:[2022-03-03 06:05:46,381] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 89
[default2]:[2022-03-03 06:05:46,444] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 362
[default2]:[2022-03-03 06:05:46,423] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 354
[default4]:[2022-03-03 06:05:46,409] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 220
[default1]:[2022-03-03 06:05:46,450] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 257
[default7]:[2022-03-03 06:05:46,413] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 111
[default7]:[2022-03-03 06:05:46,446] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 367
[default4]:[2022-03-03 06:05:46,403] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 212
[default7]:[2022-03-03 06:05:46,391] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 215
[default7]:[2022-03-03 06:05:46,429] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 55
[default7]:[2022-03-03 06:05:46,482] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 151
[default1]:[2022-03-03 06:05:46,530] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 289
[default0]:[2022-03-03 06:05:46,495] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 0
[default7]:[2022-03-03 06:05:46,549] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 103
[default1]:[2022-03-03 06:05:46,531] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 217
[default5]:[2022-03-03 06:05:46,558] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 381
[default4]:[2022-03-03 06:05:46,550] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 4
[default0]:[2022-03-03 06:05:46,573] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 376
[default6]:[2022-03-03 06:05:46,533] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 222
[default4]:[2022-03-03 06:05:46,527] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 164
[default4]:[2022-03-03 06:05:46,527] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 12
[default2]:[2022-03-03 06:05:46,640] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 66
[default1]:[2022-03-03 06:05:46,597] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 145
[default0]:[2022-03-03 06:05:46,626] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 96
[default6]:[2022-03-03 06:05:46,636] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 102
[default4]:[2022-03-03 06:05:46,582] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 116
[default3]:[2022-03-03 06:05:46,672] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 91
[default6]:[2022-03-03 06:05:46,670] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 62
[default6]:[2022-03-03 06:05:46,619] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 326
[default5]:[2022-03-03 06:05:46,599] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 357
[default7]:[2022-03-03 06:05:46,618] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 359
[default0]:[2022-03-03 06:05:46,673] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 136
[default3]:[2022-03-03 06:05:46,591] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 355
[default3]:[2022-03-03 06:05:46,602] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 107
[default6]:[2022-03-03 06:05:46,677] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 46
[default0]:[2022-03-03 06:05:46,663] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 16
[default5]:[2022-03-03 06:05:46,623] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 213
[default1]:[2022-03-03 06:05:46,685] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 65
[default6]:[2022-03-03 06:05:46,747] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 70
[default3]:[2022-03-03 06:05:46,714] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 315
[default2]:[2022-03-03 06:05:46,764] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 290
[default3]:[2022-03-03 06:05:46,765] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 27
[default0]:[2022-03-03 06:05:46,673] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 112
[default5]:[2022-03-03 06:05:46,703] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 221
[default7]:[2022-03-03 06:05:46,713] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 223
[default1]:[2022-03-03 06:05:46,755] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 57
[default5]:[2022-03-03 06:05:46,686] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 365
[default7]:[2022-03-03 06:05:46,683] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 271
[default5]:[2022-03-03 06:05:46,725] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 117
[default5]:[2022-03-03 06:05:46,761] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 29
[default3]:[2022-03-03 06:05:46,715] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 299
[default7]:[2022-03-03 06:05:46,768] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 303
[default4]:[2022-03-03 06:05:46,689] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 20
[default3]:[2022-03-03 06:05:46,736] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 51
[default6]:[2022-03-03 06:05:46,702] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 6
[default7]:[2022-03-03 06:05:46,863] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 319
[default2]:[2022-03-03 06:05:46,786] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 26
[default5]:[2022-03-03 06:05:46,805] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 69
[default2]:[2022-03-03 06:05:46,862] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 210
[default7]:[2022-03-03 06:05:46,846] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 199
[default7]:[2022-03-03 06:05:46,815] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 95
[default1]:[2022-03-03 06:05:46,857] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 225
[default2]:[2022-03-03 06:05:46,806] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 258
[default1]:[2022-03-03 06:05:46,780] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 41
[default1]:[2022-03-03 06:05:46,796] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 153
[default3]:[2022-03-03 06:05:46,806] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 219
[default1]:[2022-03-03 06:05:46,845] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 9
[default2]:[2022-03-03 06:05:46,873] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 234
[default1]:[2022-03-03 06:05:46,896] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 361
[default0]:[2022-03-03 06:05:46,940] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 240
[default4]:[2022-03-03 06:05:46,909] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 244
[default5]:[2022-03-03 06:05:46,969] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 293
[default7]:[2022-03-03 06:05:46,963] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 63
[default3]:[2022-03-03 06:05:46,948] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 115
[default6]:[2022-03-03 06:05:46,880] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 358
[default1]:[2022-03-03 06:05:46,959] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 1
[default2]:[2022-03-03 06:05:46,897] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 106
[default3]:[2022-03-03 06:05:46,960] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 43
[default7]:[2022-03-03 06:05:46,921] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 47
[default5]:[2022-03-03 06:05:46,885] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 133
[default5]:[2022-03-03 06:05:46,979] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 301
[default3]:[2022-03-03 06:05:46,975] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 3
[default7]:[2022-03-03 06:05:46,910] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 263
[default3]:[2022-03-03 06:05:46,990] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 163
[default7]:[2022-03-03 06:05:46,968] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 71
[default1]:[2022-03-03 06:05:47,044] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 313
[default5]:[2022-03-03 06:05:46,983] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 101
[default1]:[2022-03-03 06:05:47,057] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 233
[default6]:[2022-03-03 06:05:47,010] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 238
[default6]:[2022-03-03 06:05:46,999] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 294
[default4]:[2022-03-03 06:05:46,993] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 28
[default1]:[2022-03-03 06:05:46,985] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 241
[default3]:[2022-03-03 06:05:47,049] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 59
[default2]:[2022-03-03 06:05:46,980] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 90
[default5]:[2022-03-03 06:05:46,983] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 5
[default1]:[2022-03-03 06:05:47,062] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 265
[default0]:[2022-03-03 06:05:47,032] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 104
[default2]:[2022-03-03 06:05:47,057] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 226
[default6]:[2022-03-03 06:05:46,990] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 262
[default6]:[2022-03-03 06:05:47,053] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 110
[default1]:[2022-03-03 06:05:46,983] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 105
[default2]:[2022-03-03 06:05:47,055] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 42
[default2]:[2022-03-03 06:05:47,049] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 370
[default5]:[2022-03-03 06:05:47,058] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 13
[default2]:[2022-03-03 06:05:47,065] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 10
[default0]:[2022-03-03 06:05:47,166] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 312
[default1]:[2022-03-03 06:05:47,120] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 321
[default2]:[2022-03-03 06:05:47,117] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 194
[default5]:[2022-03-03 06:05:47,170] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 93
[default1]:[2022-03-03 06:05:47,124] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 113
[default5]:[2022-03-03 06:05:47,089] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 325
[default2]:[2022-03-03 06:05:47,162] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 266
[default2]:[2022-03-03 06:05:47,096] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 218
[default6]:[2022-03-03 06:05:47,145] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 230
[default6]:[2022-03-03 06:05:47,139] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 270
[default0]:[2022-03-03 06:05:47,111] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 152
[default4]:[2022-03-03 06:05:47,126] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 132
[default3]:[2022-03-03 06:05:47,122] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 363
[default7]:[2022-03-03 06:05:47,129] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 375
[default7]:[2022-03-03 06:05:47,170] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 239
[default6]:[2022-03-03 06:05:47,178] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 318
[default3]:[2022-03-03 06:05:47,226] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 235
[default5]:[2022-03-03 06:05:47,213] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 197
[default5]:[2022-03-03 06:05:47,263] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 245
[default6]:[2022-03-03 06:05:47,268] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 246
[default7]:[2022-03-03 06:05:47,175] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 31
[default0]:[2022-03-03 06:05:47,191] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 208
[default2]:[2022-03-03 06:05:47,195] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 2
[default0]:[2022-03-03 06:05:47,179] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 64
[default7]:[2022-03-03 06:05:47,268] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 327
[default7]:[2022-03-03 06:05:47,196] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 383
[default1]:[2022-03-03 06:05:47,216] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 193
[default5]:[2022-03-03 06:05:47,210] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 157
[default5]:[2022-03-03 06:05:47,208] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 229
[default3]:[2022-03-03 06:05:47,270] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 371
[default1]:[2022-03-03 06:05:47,201] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 369
[default5]:[2022-03-03 06:05:47,233] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 165
[default2]:[2022-03-03 06:05:47,195] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 162
[default6]:[2022-03-03 06:05:47,242] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 166
[default7]:[2022-03-03 06:05:47,275] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 55
[default7]:[2022-03-03 06:05:47,277] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 15
[default3]:[2022-03-03 06:05:47,323] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 67
[default5]:[2022-03-03 06:05:47,358] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 317
[default3]:[2022-03-03 06:05:47,277] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 323
[default2]:[2022-03-03 06:05:47,355] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 242
[default2]:[2022-03-03 06:05:47,282] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 114
[default2]:[2022-03-03 06:05:47,288] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 378
[default1]:[2022-03-03 06:05:47,362] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 377
[default5]:[2022-03-03 06:05:47,376] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 373
[default7]:[2022-03-03 06:05:47,345] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 7
[default3]:[2022-03-03 06:05:47,302] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 11
[default6]:[2022-03-03 06:05:47,331] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 14
[default5]:[2022-03-03 06:05:47,382] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 237
[default2]:[2022-03-03 06:05:47,382] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 314
[default3]:[2022-03-03 06:05:47,441] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 99
[default1]:[2022-03-03 06:05:47,455] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 25
[default2]:[2022-03-03 06:05:47,440] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 98
[default7]:[2022-03-03 06:05:47,468] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 231
[default3]:[2022-03-03 06:05:47,442] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 139
[default0]:[2022-03-03 06:05:47,464] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 128
[default1]:[2022-03-03 06:05:47,489] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 89
[default4]:[2022-03-03 06:05:47,542] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 140
[default5]:[2022-03-03 06:05:47,481] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 109
[default3]:[2022-03-03 06:05:47,495] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 131
[default6]:[2022-03-03 06:05:47,486] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 374
[default1]:[2022-03-03 06:05:47,652] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 97
[default6]:[2022-03-03 06:05:47,663] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 118
[default2]:[2022-03-03 06:05:47,622] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 362
[default3]:[2022-03-03 06:05:47,642] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 379
[default2]:[2022-03-03 06:05:47,630] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 354
[default6]:[2022-03-03 06:05:47,617] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 382
[default5]:[2022-03-03 06:05:47,632] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 117
[default4]:[2022-03-03 06:05:47,594] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 108
[default7]:[2022-03-03 06:05:47,645] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 111
[default1]:[2022-03-03 06:05:47,596] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 129
[default0]:[2022-03-03 06:05:47,624] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 16
[default3]:[2022-03-03 06:05:47,670] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 243
[default7]:[2022-03-03 06:05:47,672] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 135
[default4]:[2022-03-03 06:05:47,755] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 316
[default7]:[2022-03-03 06:05:47,769] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 119
[default7]:[2022-03-03 06:05:47,706] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 367
[default6]:[2022-03-03 06:05:47,743] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 22
[default2]:[2022-03-03 06:05:47,683] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 130
[default7]:[2022-03-03 06:05:47,691] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 143
[default6]:[2022-03-03 06:05:47,686] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 366
[default6]:[2022-03-03 06:05:47,692] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 142
[default0]:[2022-03-03 06:05:47,801] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 112
[default5]:[2022-03-03 06:05:47,842] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 141
[default3]:[2022-03-03 06:05:47,856] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 115
[default3]:[2022-03-03 06:05:47,824] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 355
[default5]:[2022-03-03 06:05:47,876] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 21
[default2]:[2022-03-03 06:05:47,868] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 154
[default3]:[2022-03-03 06:05:47,860] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 155
[default6]:[2022-03-03 06:05:47,830] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 222
[default6]:[2022-03-03 06:05:47,850] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 134
[default1]:[2022-03-03 06:05:47,925] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 65
[default3]:[2022-03-03 06:05:47,965] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 315
[default1]:[2022-03-03 06:05:47,943] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 313
[default5]:[2022-03-03 06:05:47,951] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 101
[default3]:[2022-03-03 06:05:47,913] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 27
[default2]:[2022-03-03 06:05:47,912] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 138
[default0]:[2022-03-03 06:05:47,929] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 0
[default0]: checkpoint version 3.0
[default0]:[2022-03-03 06:05:47,945] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 376
[default2]:[2022-03-03 06:05:47,892] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 18
[default3]:[2022-03-03 06:05:47,970] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 219
[default1]:[2022-03-03 06:05:47,916] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 153
[default6]:[2022-03-03 06:05:47,958] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 6
[default2]:[2022-03-03 06:05:47,971] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 66
[default7]:[2022-03-03 06:05:48,009] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 247
[default7]:[2022-03-03 06:05:48,051] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 223
[default7]:[2022-03-03 06:05:48,055] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 359
[default5]:[2022-03-03 06:05:48,030] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 357
[default0]:[2022-03-03 06:05:48,003] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 136
[default5]:[2022-03-03 06:05:48,015] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 133
[default3]:[2022-03-03 06:05:48,078] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 3
[default2]:[2022-03-03 06:05:48,143] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 26
[default7]:[2022-03-03 06:05:48,111] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 103
[default6]:[2022-03-03 06:05:48,147] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 102
[default1]:[2022-03-03 06:05:48,130] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 217
[default1]:[2022-03-03 06:05:48,094] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 137
[default5]:[2022-03-03 06:05:48,180] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 29
[default7]:[2022-03-03 06:05:48,168] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 159
[default7]:[2022-03-03 06:05:48,138] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 23
[default6]:[2022-03-03 06:05:48,131] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 158
[default0]:[2022-03-03 06:05:48,237] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 96
[default7]:[2022-03-03 06:05:48,183] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 319
[default5]:[2022-03-03 06:05:48,272] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 69
[default4]:[2022-03-03 06:05:48,238] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 244
[default3]:[2022-03-03 06:05:48,226] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 19
[default1]:[2022-03-03 06:05:48,206] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 369
[default2]:[2022-03-03 06:05:48,243] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 10
[default5]:[2022-03-03 06:05:48,318] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 317
[default0]:[2022-03-03 06:05:48,367] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 240
[default1]:[2022-03-03 06:05:48,311] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 17
[default7]:[2022-03-03 06:05:48,447] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 71
[default6]:[2022-03-03 06:05:48,413] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 70
[default4]:[2022-03-03 06:05:48,452] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 28
[default6]:[2022-03-03 06:05:48,470] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 246
[default2]:[2022-03-03 06:05:48,410] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 98
[default1]:[2022-03-03 06:05:48,429] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 241
[default5]:[2022-03-03 06:05:48,414] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 157
[default4]:[2022-03-03 06:05:48,553] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 4
[default3]:[2022-03-03 06:05:48,522] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 139
[default0]:[2022-03-03 06:05:48,510] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 128
[default2]:[2022-03-03 06:05:48,536] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 370
[default3]:[2022-03-03 06:05:48,560] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 371
[default7]:[2022-03-03 06:05:48,510] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 375
[default3]:[2022-03-03 06:05:48,527] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 11
[default5]:[2022-03-03 06:05:48,529] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 13
[default3]:[2022-03-03 06:05:48,656] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 99
[default2]:[2022-03-03 06:05:48,611] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 242
[default5]:[2022-03-03 06:05:48,597] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 245
[default7]:[2022-03-03 06:05:48,673] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 135
[default7]:[2022-03-03 06:05:48,599] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 7
[default5]:[2022-03-03 06:05:48,712] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 5
[default3]:[2022-03-03 06:05:48,733] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 131
[default5]:[2022-03-03 06:05:48,739] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 373
[default7]:[2022-03-03 06:05:48,764] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 15
[default2]:[2022-03-03 06:05:48,788] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 2
[default1]:[2022-03-03 06:05:48,849] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 129
[default7]:[2022-03-03 06:05:48,816] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 143
[default3]:[2022-03-03 06:05:48,829] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 243
[default6]:[2022-03-03 06:05:48,827] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 142
[default6]:[2022-03-03 06:05:48,813] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 14
[default1]:[2022-03-03 06:05:48,909] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 97
[default2]:[2022-03-03 06:05:48,919] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 138
[default5]:[2022-03-03 06:05:49,038] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 141
[default7]:[2022-03-03 06:05:49,015] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 247
[default1]:[2022-03-03 06:05:49,054] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 137
[default4]:[2022-03-03 06:05:48,997] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 140
[default3]:[2022-03-03 06:05:49,069] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 155
[default6]:[2022-03-03 06:05:49,079] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 134
[default2]:[2022-03-03 06:05:49,145] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 18
[default2]:[2022-03-03 06:05:49,088] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 154
[default2]:[2022-03-03 06:05:49,128] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 130
[default3]:[2022-03-03 06:05:49,450] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 19
[default5]:[2022-03-03 06:05:49,463] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 21
[default7]:[2022-03-03 06:05:49,399] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 159
[default6]:[2022-03-03 06:05:49,433] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 158
[default6]:[2022-03-03 06:05:49,541] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 22
[default0]:  successfully loaded checkpoint from /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints at iteration 50
[default0]:estimated model parameters: 191.162474496
[default0]:estimated model parameters without embeddings: 148.003086336
[default0]:[after model, optimizer, and learning rate scheduler are built] datetime: 2022-03-03 06:05:49 
[default0]:> building train, validation, and test datasets ...
[default0]: > datasets target sizes (minimum size):
[default0]:    train:      220000000
[default0]:    validation: 2641920
[default0]:    test:       20480
[default0]:> building train, validation, and test datasets for GPT ...
[default0]: > building dataset index ...
[default7]:time (ms) | load-checkpoint: 25471.88
[default1]:[2022-03-03 06:05:49,617] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 17
[default7]:[2022-03-03 06:05:49,628] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 23
[default0]:/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/utils.py:280: UserWarning: Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings
[default0]:  warnings.warn("Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings")
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.066723 seconds
[default0]:    number of documents: 1276214
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 1211127) total of 1211127 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.052 seconds
[default0]:    total number of samples: 19333818
[default0]:    total number of epochs: 41
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.013180 seconds
[default0]:    number of documents: 2218089
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 2104966) total of 2104966 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.082 seconds
[default0]:    total number of samples: 4602461
[default0]:    total number of epochs: 22
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.015850 seconds
[default0]:    number of documents: 14716427
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 13965889) total of 13965889 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.179 seconds
[default0]:    total number of samples: 35728792
[default0]:    total number of epochs: 4
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002722 seconds
[default0]:    number of documents: 2767535
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 2626391) total of 2626391 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.084 seconds
[default0]:    total number of samples: 28139393
[default0]:    total number of epochs: 28
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.008013 seconds
[default0]:    number of documents: 786245
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 746147) total of 746147 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.124 seconds
[default0]:    total number of samples: 670404
[default0]:    total number of epochs: 22
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.023520 seconds
[default0]:    number of documents: 1748556
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 1659380) total of 1659380 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.098 seconds
[default0]:    total number of samples: 27952020
[default0]:    total number of epochs: 56
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002128 seconds
[default0]:    number of documents: 29464287
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 27961608) total of 27961608 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.159 seconds
[default0]:    total number of samples: 14638800
[default0]:    total number of epochs: 42
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.019843 seconds
[default0]:    number of documents: 38304059
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 36350552) total of 36350552 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.183 seconds
[default0]:    total number of samples: 27308815
[default0]:    total number of epochs: 46
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.013062 seconds
[default0]:    number of documents: 729667
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 692454) total of 692454 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.161 seconds
[default0]:    total number of samples: 6887421
[default0]:    total number of epochs: 22
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.028485 seconds
[default0]:    number of documents: 24265522
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 23027980) total of 23027980 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.135 seconds
[default0]:    total number of samples: 10304343
[default0]:    total number of epochs: 25
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.022085 seconds
[default0]:    number of documents: 9587455
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 9098495) total of 9098495 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.231 seconds
[default0]:    total number of samples: 28924755
[default0]:    total number of epochs: 10
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.011283 seconds
[default0]:    number of documents: 4335929
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 4114797) total of 4114797 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.084 seconds
[default0]:    total number of samples: 29929866
[default0]:    total number of epochs: 11
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002166 seconds
[default0]:    number of documents: 149731
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 142095) total of 142095 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.024 seconds
[default0]:    total number of samples: 127855
[default0]:    total number of epochs: 18
[default0]:> building indices for blendable datasets ...
[default0]: > sample ratios:
[default0]:   dataset 0, input: 0.0870676, achieved: 0.0870676
[default0]:   dataset 1, input: 0.0207314, achieved: 0.0207314
[default0]:   dataset 2, input: 0.1247, achieved: 0.1247
[default0]:   dataset 3, input: 0.124182, achieved: 0.124182
[default0]:   dataset 4, input: 0.0029046, achieved: 0.0029046
[default0]:   dataset 5, input: 0.1247, achieved: 0.1247
[default0]:   dataset 6, input: 0.0659275, achieved: 0.0659275
[default0]:   dataset 7, input: 0.120941, achieved: 0.120941
[default0]:   dataset 8, input: 0.0310665, achieved: 0.0310665
[default0]:   dataset 9, input: 0.0454631, achieved: 0.0454631
[default0]:   dataset 10, input: 0.127064, achieved: 0.127064
[default0]:   dataset 11, input: 0.1247, achieved: 0.1247
[default0]:   dataset 12, input: 0.000554406, achieved: 0.000554405
[default0]:> elapsed time for building blendable dataset indices: 4.04 (sec)
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002967 seconds
[default0]:    number of documents: 1276214
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [1211127, 1274938) total of 63811 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.018 seconds
[default0]:    total number of samples: 241146
[default0]:    total number of epochs: 18
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.076098 seconds
[default0]:    number of documents: 2218089
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [2104966, 2215871) total of 110905 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.010 seconds
[default0]:    total number of samples: 55872
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002478 seconds
[default0]:    number of documents: 14716427
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [13965889, 14701711) total of 735822 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.052 seconds
[default0]:    total number of samples: 1880535
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.007496 seconds
[default0]:    number of documents: 2767535
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [2626391, 2764767) total of 138376 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.030 seconds
[default0]:    total number of samples: 480297
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002187 seconds
[default0]:    number of documents: 786245
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [746147, 785459) total of 39312 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 8487
[default0]:    total number of epochs: 8
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002456 seconds
[default0]:    number of documents: 1748556
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [1659380, 1746807) total of 87427 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.031 seconds
[default0]:    total number of samples: 907157
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.032000 seconds
[default0]:    number of documents: 29464287
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [27961608, 29434823) total of 1473215 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.102 seconds
[default0]:    total number of samples: 186675
[default0]:    total number of epochs: 12
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.007831 seconds
[default0]:    number of documents: 38304059
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [36350552, 38265755) total of 1915203 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.120 seconds
[default0]:    total number of samples: 333733
[default0]:    total number of epochs: 13
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001766 seconds
[default0]:    number of documents: 729667
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [692454, 728937) total of 36483 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.010 seconds
[default0]:    total number of samples: 98264
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.038396 seconds
[default0]:    number of documents: 24265522
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [23027980, 24241256) total of 1213276 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.071 seconds
[default0]:    total number of samples: 129080
[default0]:    total number of epochs: 6
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.007623 seconds
[default0]:    number of documents: 9587455
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [9098495, 9577868) total of 479373 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.027 seconds
[default0]:    total number of samples: 469042
[default0]:    total number of epochs: 3
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.008754 seconds
[default0]:    number of documents: 4335929
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [4114797, 4331593) total of 216796 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.029 seconds
[default0]:    total number of samples: 398209
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.003000 seconds
[default0]:    number of documents: 149731
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [142095, 149581) total of 7486 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 1544
[default0]:    total number of epochs: 6
[default0]:> building indices for blendable datasets ...
[default0]: > sample ratios:
[default0]:   dataset 0, input: 0.0870676, achieved: 0.0870675
[default0]:   dataset 1, input: 0.0207314, achieved: 0.0207315
[default0]:   dataset 2, input: 0.1247, achieved: 0.1247
[default0]:   dataset 3, input: 0.124182, achieved: 0.124182
[default0]:   dataset 4, input: 0.0029046, achieved: 0.00290461
[default0]:   dataset 5, input: 0.1247, achieved: 0.1247
[default0]:   dataset 6, input: 0.0659275, achieved: 0.0659274
[default0]:   dataset 7, input: 0.120941, achieved: 0.120941
[default0]:   dataset 8, input: 0.0310665, achieved: 0.0310665
[default0]:   dataset 9, input: 0.0454631, achieved: 0.0454631
[default0]:   dataset 10, input: 0.127064, achieved: 0.127064
[default0]:   dataset 11, input: 0.1247, achieved: 0.1247
[default0]:   dataset 12, input: 0.000554406, achieved: 0.000554525
[default0]:> elapsed time for building blendable dataset indices: 0.09 (sec)
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002837 seconds
[default0]:    number of documents: 1276214
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [1274938, 1276214) total of 1276 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.038 seconds
[default0]:    total number of samples: 202915
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001934 seconds
[default0]:    number of documents: 2218089
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [2215871, 2218089) total of 2218 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 459
[default0]:    total number of epochs: 13
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001928 seconds
[default0]:    number of documents: 14716427
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [14701711, 14716427) total of 14716 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 37487
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.021217 seconds
[default0]:    number of documents: 2767535
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [2764767, 2767535) total of 2768 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.003 seconds
[default0]:    total number of samples: 9926
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002356 seconds
[default0]:    number of documents: 786245
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [785459, 786245) total of 786 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 79
[default0]:    total number of epochs: 4
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002487 seconds
[default0]:    number of documents: 1748556
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [1746807, 1748556) total of 1749 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.003 seconds
[default0]:    total number of samples: 34096
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002759 seconds
[default0]:    number of documents: 29464287
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [29434823, 29464287) total of 29464 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.003 seconds
[default0]:    total number of samples: 1645
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.010488 seconds
[default0]:    number of documents: 38304059
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [38265755, 38304059) total of 38304 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.006 seconds
[default0]:    total number of samples: 2778
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002211 seconds
[default0]:    number of documents: 729667
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [728937, 729667) total of 730 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 716
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001865 seconds
[default0]:    number of documents: 24265522
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [24241256, 24265522) total of 24266 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.003 seconds
[default0]:    total number of samples: 1312
[default0]:    total number of epochs: 3
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.008090 seconds
[default0]:    number of documents: 9587455
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [9577868, 9587455) total of 9587 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.003 seconds
[default0]:    total number of samples: 3324
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002499 seconds
[default0]:    number of documents: 4335929
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [4331593, 4335929) total of 4336 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.005 seconds
[default0]:    total number of samples: 3964
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.004362 seconds
[default0]:    number of documents: 149731
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [149581, 149731) total of 150 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.001 seconds
[default0]:    total number of samples: 15
[default0]:    total number of epochs: 2
[default0]:> building indices for blendable datasets ...
[default0]: > sample ratios:
[default0]:   dataset 0, input: 0.0870676, achieved: 0.0870664
[default0]:   dataset 1, input: 0.0207314, achieved: 0.020733
[default0]:   dataset 2, input: 0.1247, achieved: 0.124699
[default0]:   dataset 3, input: 0.124182, achieved: 0.12418
[default0]:   dataset 4, input: 0.0029046, achieved: 0.0029059
[default0]:   dataset 5, input: 0.1247, achieved: 0.124699
[default0]:   dataset 6, input: 0.0659275, achieved: 0.0659284
[default0]:   dataset 7, input: 0.120941, achieved: 0.12094
[default0]:   dataset 8, input: 0.0310665, achieved: 0.0310676
[default0]:   dataset 9, input: 0.0454631, achieved: 0.0454632
[default0]:   dataset 10, input: 0.127064, achieved: 0.127063
[default0]:   dataset 11, input: 0.1247, achieved: 0.124699
[default0]:   dataset 12, input: 0.000554406, achieved: 0.000555736
[default0]:> elapsed time for building blendable dataset indices: 0.01 (sec)
[default0]:> finished creating GPT datasets ...
[default1]:[001-002] 177.6021B / 177.6021B
[default2]:[002-002] 177.6021B / 177.6021B
[default3]:[003-002] 177.6021B / 177.6021B
[default0]:[000-009] 177.6021B / 177.6021B
[default0]:[000-003] 177.6021B / 177.6021B
[default0]:[000-011] 191.1639B / 148.0045B
[default0]:[000-010] 177.6021B / 177.6021B
[default3]:[003-009] 177.6021B / 177.6021B
[default2]:[002-000] 191.1625B / 148.0031B
[default0]:[000-002] 177.6021B / 177.6021B
[default3]:[003-003] 177.6021B / 177.6021B
[default1]:[001-003] 177.6021B / 177.6021B
[default2]:[002-009] 177.6021B / 177.6021B
[default1]:[001-009] 177.6021B / 177.6021B
[default1]:[001-011] 191.1639B / 148.0045B
[default0]:[000-001] 177.6021B / 177.6021B
[default3]:[003-001] 177.6021B / 177.6021B
[default2]:[002-010] 177.6021B / 177.6021B
[default1]:[001-000] 191.1625B / 148.0031B
[default3]:[003-010] 177.6021B / 177.6021B
[default1]:[001-010] 177.6021B / 177.6021B
[default0]:[000-007] 177.6021B / 177.6021B
[default1]:[001-007] 177.6021B / 177.6021B
[default3]:[003-007] 177.6021B / 177.6021B
[default0]:[after dataloaders are built] datetime: 2022-03-03 06:06:03 
[default0]:done with setup ...
[default0]:training ...
[default0]:Number of parameters: [tensor rank - pipeline rank] w/ and w/o embeddings:
[default0]:[000-000] 191.1625B / 148.0031B
[default0]:[before the start of training step] datetime: 2022-03-03 06:06:03 
[default3]:[003-008] 177.6021B / 177.6021B
[default0]:[000-006] 177.6021B / 177.6021B
[default2]:[002-004] 177.6021B / 177.6021B
[default2]:[002-007] 177.6021B / 177.6021B
[default2]:[002-003] 177.6021B / 177.6021B
[default2]:[002-001] 177.6021B / 177.6021B
[default1]:[001-001] 177.6021B / 177.6021B
[default3]:[003-004] 177.6021B / 177.6021B
[default2]:[002-011] 191.1639B / 148.0045B
[default1]:[001-008] 177.6021B / 177.6021B
[default3]:[003-011] 191.1639B / 148.0045B
[default0]:[000-008] 177.6021B / 177.6021B
[default2]:[002-008] 177.6021B / 177.6021B
[default0]:[000-004] 177.6021B / 177.6021B
[default1]:[001-006] 177.6021B / 177.6021B
[default2]:[002-006] 177.6021B / 177.6021B
[default2]:[002-005] 177.6021B / 177.6021B
[default1]:[001-005] 177.6021B / 177.6021B
[default3]:[003-005] 177.6021B / 177.6021B
[default1]:[001-004] 177.6021B / 177.6021B
[default3]:[003-006] 177.6021B / 177.6021B
[default7]:time (ms) | model-and-optimizer-setup: 33538.23 | train/valid/test-data-iterators-setup: 13018.50
[default3]:[003-000] 191.1625B / 148.0031B
[default0]:[000-005] 177.6021B / 177.6021B
[default0]:[2022-03-03 06:06:03,406] [INFO] [checkpointing.py:547:forward] Activation Checkpointing Information
[default0]:[2022-03-03 06:06:03,406] [INFO] [checkpointing.py:548:forward] ----Partition Activations False, CPU CHECKPOINTING False
[default0]:[2022-03-03 06:06:03,407] [INFO] [checkpointing.py:551:forward] ----contiguous Memory Checkpointing False with 70 total layers
[default0]:[2022-03-03 06:06:03,407] [INFO] [checkpointing.py:554:forward] ----Synchronization False
[default0]:[2022-03-03 06:06:03,407] [INFO] [checkpointing.py:555:forward] ----Profiling time in checkpointing False
[default3]:[Rank 67] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 99] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 35] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 291] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 323] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 227] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 131] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 355] (after 51 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0
[default3]:[Rank 259] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 195] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 3] (after 51 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0
[default3]:[Rank 163] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default7]: iteration       51/  128728 | consumed samples:          816 | consumed tokens:      1671168 | elapsed time per iteration (s): 39.93 | learning rate: 2.674E-07 | global batch size:    16 | lm loss: 1.196962E+01 | grad norm: 2.520 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.401 | TFLOPs: 3.07 |
[default1]:[Rank 65] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 288] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 96] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 97] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 320] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 353] (after 51 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0
[default2]:[Rank 2] (after 51 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0
[default0]:[Rank 64] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 290] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 289] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 322] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 32] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 224] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 352] (after 51 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0
[default0]:[Rank 0] (after 51 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0
[default1]:[Rank 225] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 98] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 192] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 321] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 226] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 1] (after 51 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0
[default2]:[Rank 130] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 354] (after 51 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0
[default1]:[Rank 33] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 128] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 193] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 194] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 257] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 161] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 162] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 129] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 34] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 256] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 258] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 160] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 66] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default7]: iteration       52/  128728 | consumed samples:          832 | consumed tokens:      1703936 | elapsed time per iteration (s): 14.92 | learning rate: 2.726E-07 | global batch size:    16 | lm loss: 1.192162E+01 | grad norm: 2.002 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.072 | TFLOPs: 8.21 |
[default7]: iteration       53/  128728 | consumed samples:          848 | consumed tokens:      1736704 | elapsed time per iteration (s): 15.24 | learning rate: 2.779E-07 | global batch size:    16 | lm loss: 1.202632E+01 | grad norm: 1.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration       54/  128728 | consumed samples:          864 | consumed tokens:      1769472 | elapsed time per iteration (s): 15.19 | learning rate: 2.831E-07 | global batch size:    16 | lm loss: 1.187102E+01 | grad norm: 2.143 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       55/  128728 | consumed samples:          880 | consumed tokens:      1802240 | elapsed time per iteration (s): 15.20 | learning rate: 2.884E-07 | global batch size:    16 | lm loss: 1.191143E+01 | grad norm: 2.075 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration       56/  128728 | consumed samples:          896 | consumed tokens:      1835008 | elapsed time per iteration (s): 15.21 | learning rate: 2.936E-07 | global batch size:    16 | lm loss: 1.189511E+01 | grad norm: 1.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration       57/  128728 | consumed samples:          912 | consumed tokens:      1867776 | elapsed time per iteration (s): 15.18 | learning rate: 2.988E-07 | global batch size:    16 | lm loss: 1.175074E+01 | grad norm: 2.178 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       58/  128728 | consumed samples:          928 | consumed tokens:      1900544 | elapsed time per iteration (s): 15.18 | learning rate: 3.041E-07 | global batch size:    16 | lm loss: 1.181468E+01 | grad norm: 1.838 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       59/  128728 | consumed samples:          944 | consumed tokens:      1933312 | elapsed time per iteration (s): 15.21 | learning rate: 3.093E-07 | global batch size:    16 | lm loss: 1.167815E+01 | grad norm: 1.872 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration       60/  128728 | consumed samples:          960 | consumed tokens:      1966080 | elapsed time per iteration (s): 15.20 | learning rate: 3.146E-07 | global batch size:    16 | lm loss: 1.176816E+01 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration       61/  128728 | consumed samples:          976 | consumed tokens:      1998848 | elapsed time per iteration (s): 15.20 | learning rate: 3.198E-07 | global batch size:    16 | lm loss: 1.160849E+01 | grad norm: 1.616 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration       62/  128728 | consumed samples:          992 | consumed tokens:      2031616 | elapsed time per iteration (s): 15.19 | learning rate: 3.251E-07 | global batch size:    16 | lm loss: 1.165278E+01 | grad norm: 1.415 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration       63/  128728 | consumed samples:         1008 | consumed tokens:      2064384 | elapsed time per iteration (s): 15.23 | learning rate: 3.303E-07 | global batch size:    16 | lm loss: 1.162152E+01 | grad norm: 1.387 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration       64/  128728 | consumed samples:         1024 | consumed tokens:      2097152 | elapsed time per iteration (s): 15.20 | learning rate: 3.355E-07 | global batch size:    16 | lm loss: 1.163912E+01 | grad norm: 1.323 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration       65/  128728 | consumed samples:         1040 | consumed tokens:      2129920 | elapsed time per iteration (s): 15.19 | learning rate: 3.408E-07 | global batch size:    16 | lm loss: 1.152941E+01 | grad norm: 1.425 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration       66/  128728 | consumed samples:         1056 | consumed tokens:      2162688 | elapsed time per iteration (s): 15.22 | learning rate: 3.460E-07 | global batch size:    16 | lm loss: 1.144800E+01 | grad norm: 1.255 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration       67/  128728 | consumed samples:         1072 | consumed tokens:      2195456 | elapsed time per iteration (s): 15.18 | learning rate: 3.513E-07 | global batch size:    16 | lm loss: 1.142246E+01 | grad norm: 1.453 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       68/  128728 | consumed samples:         1088 | consumed tokens:      2228224 | elapsed time per iteration (s): 15.22 | learning rate: 3.565E-07 | global batch size:    16 | lm loss: 1.147447E+01 | grad norm: 1.255 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration       69/  128728 | consumed samples:         1104 | consumed tokens:      2260992 | elapsed time per iteration (s): 15.23 | learning rate: 3.618E-07 | global batch size:    16 | lm loss: 1.132389E+01 | grad norm: 1.165 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration       70/  128728 | consumed samples:         1120 | consumed tokens:      2293760 | elapsed time per iteration (s): 15.19 | learning rate: 3.670E-07 | global batch size:    16 | lm loss: 1.135389E+01 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration       71/  128728 | consumed samples:         1136 | consumed tokens:      2326528 | elapsed time per iteration (s): 15.28 | learning rate: 3.722E-07 | global batch size:    16 | lm loss: 1.143639E+01 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration       72/  128728 | consumed samples:         1152 | consumed tokens:      2359296 | elapsed time per iteration (s): 15.19 | learning rate: 3.775E-07 | global batch size:    16 | lm loss: 1.144752E+01 | grad norm: 1.220 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration       73/  128728 | consumed samples:         1168 | consumed tokens:      2392064 | elapsed time per iteration (s): 15.17 | learning rate: 3.827E-07 | global batch size:    16 | lm loss: 1.136817E+01 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration       74/  128728 | consumed samples:         1184 | consumed tokens:      2424832 | elapsed time per iteration (s): 15.22 | learning rate: 3.880E-07 | global batch size:    16 | lm loss: 1.132335E+01 | grad norm: 1.098 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration       75/  128728 | consumed samples:         1200 | consumed tokens:      2457600 | elapsed time per iteration (s): 15.16 | learning rate: 3.932E-07 | global batch size:    16 | lm loss: 1.124673E+01 | grad norm: 1.096 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration       76/  128728 | consumed samples:         1216 | consumed tokens:      2490368 | elapsed time per iteration (s): 15.18 | learning rate: 3.985E-07 | global batch size:    16 | lm loss: 1.127481E+01 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       77/  128728 | consumed samples:         1232 | consumed tokens:      2523136 | elapsed time per iteration (s): 15.21 | learning rate: 4.037E-07 | global batch size:    16 | lm loss: 1.117865E+01 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration       78/  128728 | consumed samples:         1248 | consumed tokens:      2555904 | elapsed time per iteration (s): 15.18 | learning rate: 4.089E-07 | global batch size:    16 | lm loss: 1.130504E+01 | grad norm: 1.082 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       79/  128728 | consumed samples:         1264 | consumed tokens:      2588672 | elapsed time per iteration (s): 15.17 | learning rate: 4.142E-07 | global batch size:    16 | lm loss: 1.125540E+01 | grad norm: 1.030 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration       80/  128728 | consumed samples:         1280 | consumed tokens:      2621440 | elapsed time per iteration (s): 15.19 | learning rate: 4.194E-07 | global batch size:    16 | lm loss: 1.120402E+01 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       81/  128728 | consumed samples:         1296 | consumed tokens:      2654208 | elapsed time per iteration (s): 15.18 | learning rate: 4.247E-07 | global batch size:    16 | lm loss: 1.119429E+01 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       82/  128728 | consumed samples:         1312 | consumed tokens:      2686976 | elapsed time per iteration (s): 15.17 | learning rate: 4.299E-07 | global batch size:    16 | lm loss: 1.111624E+01 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration       83/  128728 | consumed samples:         1328 | consumed tokens:      2719744 | elapsed time per iteration (s): 15.16 | learning rate: 4.352E-07 | global batch size:    16 | lm loss: 1.117877E+01 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration       84/  128728 | consumed samples:         1344 | consumed tokens:      2752512 | elapsed time per iteration (s): 15.20 | learning rate: 4.404E-07 | global batch size:    16 | lm loss: 1.093013E+01 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration       85/  128728 | consumed samples:         1360 | consumed tokens:      2785280 | elapsed time per iteration (s): 15.16 | learning rate: 4.456E-07 | global batch size:    16 | lm loss: 1.098155E+01 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration       86/  128728 | consumed samples:         1376 | consumed tokens:      2818048 | elapsed time per iteration (s): 15.18 | learning rate: 4.509E-07 | global batch size:    16 | lm loss: 1.111607E+01 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       87/  128728 | consumed samples:         1392 | consumed tokens:      2850816 | elapsed time per iteration (s): 15.15 | learning rate: 4.561E-07 | global batch size:    16 | lm loss: 1.092821E+01 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration       88/  128728 | consumed samples:         1408 | consumed tokens:      2883584 | elapsed time per iteration (s): 15.17 | learning rate: 4.614E-07 | global batch size:    16 | lm loss: 1.108350E+01 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration       89/  128728 | consumed samples:         1424 | consumed tokens:      2916352 | elapsed time per iteration (s): 15.18 | learning rate: 4.666E-07 | global batch size:    16 | lm loss: 1.089155E+01 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       90/  128728 | consumed samples:         1440 | consumed tokens:      2949120 | elapsed time per iteration (s): 15.19 | learning rate: 4.719E-07 | global batch size:    16 | lm loss: 1.096077E+01 | grad norm: 0.628 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration       91/  128728 | consumed samples:         1456 | consumed tokens:      2981888 | elapsed time per iteration (s): 15.15 | learning rate: 4.771E-07 | global batch size:    16 | lm loss: 1.101388E+01 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration       92/  128728 | consumed samples:         1472 | consumed tokens:      3014656 | elapsed time per iteration (s): 15.12 | learning rate: 4.823E-07 | global batch size:    16 | lm loss: 1.093092E+01 | grad norm: 0.966 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.059 | TFLOPs: 8.10 |
[default7]: iteration       93/  128728 | consumed samples:         1488 | consumed tokens:      3047424 | elapsed time per iteration (s): 15.17 | learning rate: 4.876E-07 | global batch size:    16 | lm loss: 1.113160E+01 | grad norm: 1.102 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       94/  128728 | consumed samples:         1504 | consumed tokens:      3080192 | elapsed time per iteration (s): 15.18 | learning rate: 4.928E-07 | global batch size:    16 | lm loss: 1.098779E+01 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       95/  128728 | consumed samples:         1520 | consumed tokens:      3112960 | elapsed time per iteration (s): 15.19 | learning rate: 4.981E-07 | global batch size:    16 | lm loss: 1.095967E+01 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       96/  128728 | consumed samples:         1536 | consumed tokens:      3145728 | elapsed time per iteration (s): 15.21 | learning rate: 5.033E-07 | global batch size:    16 | lm loss: 1.094612E+01 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration       97/  128728 | consumed samples:         1552 | consumed tokens:      3178496 | elapsed time per iteration (s): 15.18 | learning rate: 5.086E-07 | global batch size:    16 | lm loss: 1.087047E+01 | grad norm: 0.860 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       98/  128728 | consumed samples:         1568 | consumed tokens:      3211264 | elapsed time per iteration (s): 15.19 | learning rate: 5.138E-07 | global batch size:    16 | lm loss: 1.096809E+01 | grad norm: 0.918 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration       99/  128728 | consumed samples:         1584 | consumed tokens:      3244032 | elapsed time per iteration (s): 15.19 | learning rate: 5.190E-07 | global batch size:    16 | lm loss: 1.106409E+01 | grad norm: 0.905 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      100/  128728 | consumed samples:         1600 | consumed tokens:      3276800 | elapsed time per iteration (s): 15.18 | learning rate: 5.243E-07 | global batch size:    16 | lm loss: 1.086620E+01 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      101/  128728 | consumed samples:         1616 | consumed tokens:      3309568 | elapsed time per iteration (s): 15.20 | learning rate: 5.295E-07 | global batch size:    16 | lm loss: 1.089338E+01 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      102/  128728 | consumed samples:         1632 | consumed tokens:      3342336 | elapsed time per iteration (s): 15.17 | learning rate: 5.348E-07 | global batch size:    16 | lm loss: 1.075126E+01 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      103/  128728 | consumed samples:         1648 | consumed tokens:      3375104 | elapsed time per iteration (s): 15.17 | learning rate: 5.400E-07 | global batch size:    16 | lm loss: 1.086857E+01 | grad norm: 0.994 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      104/  128728 | consumed samples:         1664 | consumed tokens:      3407872 | elapsed time per iteration (s): 15.19 | learning rate: 5.453E-07 | global batch size:    16 | lm loss: 1.076913E+01 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      105/  128728 | consumed samples:         1680 | consumed tokens:      3440640 | elapsed time per iteration (s): 15.18 | learning rate: 5.505E-07 | global batch size:    16 | lm loss: 1.085386E+01 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      106/  128728 | consumed samples:         1696 | consumed tokens:      3473408 | elapsed time per iteration (s): 15.17 | learning rate: 5.557E-07 | global batch size:    16 | lm loss: 1.088430E+01 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      107/  128728 | consumed samples:         1712 | consumed tokens:      3506176 | elapsed time per iteration (s): 15.21 | learning rate: 5.610E-07 | global batch size:    16 | lm loss: 1.077884E+01 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      108/  128728 | consumed samples:         1728 | consumed tokens:      3538944 | elapsed time per iteration (s): 15.20 | learning rate: 5.662E-07 | global batch size:    16 | lm loss: 1.084765E+01 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      109/  128728 | consumed samples:         1744 | consumed tokens:      3571712 | elapsed time per iteration (s): 15.18 | learning rate: 5.715E-07 | global batch size:    16 | lm loss: 1.084685E+01 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      110/  128728 | consumed samples:         1760 | consumed tokens:      3604480 | elapsed time per iteration (s): 15.19 | learning rate: 5.767E-07 | global batch size:    16 | lm loss: 1.077808E+01 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration      111/  128728 | consumed samples:         1776 | consumed tokens:      3637248 | elapsed time per iteration (s): 15.21 | learning rate: 5.820E-07 | global batch size:    16 | lm loss: 1.084661E+01 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      112/  128728 | consumed samples:         1792 | consumed tokens:      3670016 | elapsed time per iteration (s): 15.17 | learning rate: 5.872E-07 | global batch size:    16 | lm loss: 1.073598E+01 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration      113/  128728 | consumed samples:         1808 | consumed tokens:      3702784 | elapsed time per iteration (s): 15.19 | learning rate: 5.924E-07 | global batch size:    16 | lm loss: 1.073445E+01 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      114/  128728 | consumed samples:         1824 | consumed tokens:      3735552 | elapsed time per iteration (s): 15.20 | learning rate: 5.977E-07 | global batch size:    16 | lm loss: 1.084661E+01 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      115/  128728 | consumed samples:         1840 | consumed tokens:      3768320 | elapsed time per iteration (s): 15.22 | learning rate: 6.029E-07 | global batch size:    16 | lm loss: 1.072918E+01 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      116/  128728 | consumed samples:         1856 | consumed tokens:      3801088 | elapsed time per iteration (s): 15.18 | learning rate: 6.082E-07 | global batch size:    16 | lm loss: 1.071857E+01 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      117/  128728 | consumed samples:         1872 | consumed tokens:      3833856 | elapsed time per iteration (s): 15.20 | learning rate: 6.134E-07 | global batch size:    16 | lm loss: 1.081528E+01 | grad norm: 0.648 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      118/  128728 | consumed samples:         1888 | consumed tokens:      3866624 | elapsed time per iteration (s): 15.21 | learning rate: 6.187E-07 | global batch size:    16 | lm loss: 1.083505E+01 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      119/  128728 | consumed samples:         1904 | consumed tokens:      3899392 | elapsed time per iteration (s): 15.12 | learning rate: 6.239E-07 | global batch size:    16 | lm loss: 1.081624E+01 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.058 | TFLOPs: 8.10 |
[default7]: iteration      120/  128728 | consumed samples:         1920 | consumed tokens:      3932160 | elapsed time per iteration (s): 15.15 | learning rate: 6.291E-07 | global batch size:    16 | lm loss: 1.068017E+01 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration      121/  128728 | consumed samples:         1936 | consumed tokens:      3964928 | elapsed time per iteration (s): 15.19 | learning rate: 6.344E-07 | global batch size:    16 | lm loss: 1.087509E+01 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration      122/  128728 | consumed samples:         1952 | consumed tokens:      3997696 | elapsed time per iteration (s): 15.20 | learning rate: 6.396E-07 | global batch size:    16 | lm loss: 1.068378E+01 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      123/  128728 | consumed samples:         1968 | consumed tokens:      4030464 | elapsed time per iteration (s): 15.20 | learning rate: 6.449E-07 | global batch size:    16 | lm loss: 1.059418E+01 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      124/  128728 | consumed samples:         1984 | consumed tokens:      4063232 | elapsed time per iteration (s): 15.20 | learning rate: 6.501E-07 | global batch size:    16 | lm loss: 1.072522E+01 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      125/  128728 | consumed samples:         2000 | consumed tokens:      4096000 | elapsed time per iteration (s): 15.16 | learning rate: 6.554E-07 | global batch size:    16 | lm loss: 1.064985E+01 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      126/  128728 | consumed samples:         2016 | consumed tokens:      4128768 | elapsed time per iteration (s): 15.21 | learning rate: 6.606E-07 | global batch size:    16 | lm loss: 1.092184E+01 | grad norm: 1.960 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      127/  128728 | consumed samples:         2032 | consumed tokens:      4161536 | elapsed time per iteration (s): 15.18 | learning rate: 6.658E-07 | global batch size:    16 | lm loss: 1.067953E+01 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      128/  128728 | consumed samples:         2048 | consumed tokens:      4194304 | elapsed time per iteration (s): 15.19 | learning rate: 6.711E-07 | global batch size:    16 | lm loss: 1.074638E+01 | grad norm: 1.256 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration      129/  128728 | consumed samples:         2064 | consumed tokens:      4227072 | elapsed time per iteration (s): 15.18 | learning rate: 6.763E-07 | global batch size:    16 | lm loss: 1.075598E+01 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      130/  128728 | consumed samples:         2080 | consumed tokens:      4259840 | elapsed time per iteration (s): 15.17 | learning rate: 6.816E-07 | global batch size:    16 | lm loss: 1.073375E+01 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      131/  128728 | consumed samples:         2096 | consumed tokens:      4292608 | elapsed time per iteration (s): 15.21 | learning rate: 6.868E-07 | global batch size:    16 | lm loss: 1.056206E+01 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      132/  128728 | consumed samples:         2112 | consumed tokens:      4325376 | elapsed time per iteration (s): 15.17 | learning rate: 6.921E-07 | global batch size:    16 | lm loss: 1.071002E+01 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      133/  128728 | consumed samples:         2128 | consumed tokens:      4358144 | elapsed time per iteration (s): 15.22 | learning rate: 6.973E-07 | global batch size:    16 | lm loss: 1.081503E+01 | grad norm: 0.884 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      134/  128728 | consumed samples:         2144 | consumed tokens:      4390912 | elapsed time per iteration (s): 15.16 | learning rate: 7.025E-07 | global batch size:    16 | lm loss: 1.042019E+01 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration      135/  128728 | consumed samples:         2160 | consumed tokens:      4423680 | elapsed time per iteration (s): 15.20 | learning rate: 7.078E-07 | global batch size:    16 | lm loss: 1.065207E+01 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      136/  128728 | consumed samples:         2176 | consumed tokens:      4456448 | elapsed time per iteration (s): 15.17 | learning rate: 7.130E-07 | global batch size:    16 | lm loss: 1.066140E+01 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      137/  128728 | consumed samples:         2192 | consumed tokens:      4489216 | elapsed time per iteration (s): 15.21 | learning rate: 7.183E-07 | global batch size:    16 | lm loss: 1.060394E+01 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      138/  128728 | consumed samples:         2208 | consumed tokens:      4521984 | elapsed time per iteration (s): 15.14 | learning rate: 7.235E-07 | global batch size:    16 | lm loss: 1.051196E+01 | grad norm: 1.103 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration      139/  128728 | consumed samples:         2224 | consumed tokens:      4554752 | elapsed time per iteration (s): 15.15 | learning rate: 7.288E-07 | global batch size:    16 | lm loss: 1.058902E+01 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration      140/  128728 | consumed samples:         2240 | consumed tokens:      4587520 | elapsed time per iteration (s): 15.19 | learning rate: 7.340E-07 | global batch size:    16 | lm loss: 1.060271E+01 | grad norm: 1.084 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      141/  128728 | consumed samples:         2256 | consumed tokens:      4620288 | elapsed time per iteration (s): 15.20 | learning rate: 7.392E-07 | global batch size:    16 | lm loss: 1.046633E+01 | grad norm: 0.648 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      142/  128728 | consumed samples:         2272 | consumed tokens:      4653056 | elapsed time per iteration (s): 15.20 | learning rate: 7.445E-07 | global batch size:    16 | lm loss: 1.055144E+01 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      143/  128728 | consumed samples:         2288 | consumed tokens:      4685824 | elapsed time per iteration (s): 15.16 | learning rate: 7.497E-07 | global batch size:    16 | lm loss: 1.071862E+01 | grad norm: 0.967 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      144/  128728 | consumed samples:         2304 | consumed tokens:      4718592 | elapsed time per iteration (s): 15.21 | learning rate: 7.550E-07 | global batch size:    16 | lm loss: 1.053111E+01 | grad norm: 0.905 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      145/  128728 | consumed samples:         2320 | consumed tokens:      4751360 | elapsed time per iteration (s): 15.21 | learning rate: 7.602E-07 | global batch size:    16 | lm loss: 1.067661E+01 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      146/  128728 | consumed samples:         2336 | consumed tokens:      4784128 | elapsed time per iteration (s): 15.21 | learning rate: 7.655E-07 | global batch size:    16 | lm loss: 1.046909E+01 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      147/  128728 | consumed samples:         2352 | consumed tokens:      4816896 | elapsed time per iteration (s): 15.20 | learning rate: 7.707E-07 | global batch size:    16 | lm loss: 1.068971E+01 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      148/  128728 | consumed samples:         2368 | consumed tokens:      4849664 | elapsed time per iteration (s): 15.17 | learning rate: 7.759E-07 | global batch size:    16 | lm loss: 1.061832E+01 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      149/  128728 | consumed samples:         2384 | consumed tokens:      4882432 | elapsed time per iteration (s): 15.17 | learning rate: 7.812E-07 | global batch size:    16 | lm loss: 1.059798E+01 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      150/  128728 | consumed samples:         2400 | consumed tokens:      4915200 | elapsed time per iteration (s): 15.17 | learning rate: 7.864E-07 | global batch size:    16 | lm loss: 1.044703E+01 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      151/  128728 | consumed samples:         2416 | consumed tokens:      4947968 | elapsed time per iteration (s): 15.17 | learning rate: 7.917E-07 | global batch size:    16 | lm loss: 1.052176E+01 | grad norm: 0.879 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration      152/  128728 | consumed samples:         2432 | consumed tokens:      4980736 | elapsed time per iteration (s): 15.17 | learning rate: 7.969E-07 | global batch size:    16 | lm loss: 1.056132E+01 | grad norm: 0.632 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      153/  128728 | consumed samples:         2448 | consumed tokens:      5013504 | elapsed time per iteration (s): 15.20 | learning rate: 8.022E-07 | global batch size:    16 | lm loss: 1.038216E+01 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      154/  128728 | consumed samples:         2464 | consumed tokens:      5046272 | elapsed time per iteration (s): 15.16 | learning rate: 8.074E-07 | global batch size:    16 | lm loss: 1.059594E+01 | grad norm: 1.109 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration      155/  128728 | consumed samples:         2480 | consumed tokens:      5079040 | elapsed time per iteration (s): 15.20 | learning rate: 8.126E-07 | global batch size:    16 | lm loss: 1.039668E+01 | grad norm: 0.626 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      156/  128728 | consumed samples:         2496 | consumed tokens:      5111808 | elapsed time per iteration (s): 15.19 | learning rate: 8.179E-07 | global batch size:    16 | lm loss: 1.041435E+01 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      157/  128728 | consumed samples:         2512 | consumed tokens:      5144576 | elapsed time per iteration (s): 15.22 | learning rate: 8.231E-07 | global batch size:    16 | lm loss: 1.060662E+01 | grad norm: 1.020 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      158/  128728 | consumed samples:         2528 | consumed tokens:      5177344 | elapsed time per iteration (s): 15.21 | learning rate: 8.284E-07 | global batch size:    16 | lm loss: 1.036032E+01 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      159/  128728 | consumed samples:         2544 | consumed tokens:      5210112 | elapsed time per iteration (s): 15.20 | learning rate: 8.336E-07 | global batch size:    16 | lm loss: 1.040484E+01 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      160/  128728 | consumed samples:         2560 | consumed tokens:      5242880 | elapsed time per iteration (s): 15.21 | learning rate: 8.389E-07 | global batch size:    16 | lm loss: 1.053427E+01 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      161/  128728 | consumed samples:         2576 | consumed tokens:      5275648 | elapsed time per iteration (s): 15.21 | learning rate: 8.441E-07 | global batch size:    16 | lm loss: 1.045372E+01 | grad norm: 1.023 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      162/  128728 | consumed samples:         2592 | consumed tokens:      5308416 | elapsed time per iteration (s): 15.24 | learning rate: 8.493E-07 | global batch size:    16 | lm loss: 1.044134E+01 | grad norm: 1.012 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      163/  128728 | consumed samples:         2608 | consumed tokens:      5341184 | elapsed time per iteration (s): 15.21 | learning rate: 8.546E-07 | global batch size:    16 | lm loss: 1.037730E+01 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      164/  128728 | consumed samples:         2624 | consumed tokens:      5373952 | elapsed time per iteration (s): 15.19 | learning rate: 8.598E-07 | global batch size:    16 | lm loss: 1.037023E+01 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      165/  128728 | consumed samples:         2640 | consumed tokens:      5406720 | elapsed time per iteration (s): 15.22 | learning rate: 8.651E-07 | global batch size:    16 | lm loss: 1.033101E+01 | grad norm: 1.026 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      166/  128728 | consumed samples:         2656 | consumed tokens:      5439488 | elapsed time per iteration (s): 15.21 | learning rate: 8.703E-07 | global batch size:    16 | lm loss: 1.036347E+01 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      167/  128728 | consumed samples:         2672 | consumed tokens:      5472256 | elapsed time per iteration (s): 15.21 | learning rate: 8.756E-07 | global batch size:    16 | lm loss: 1.042902E+01 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      168/  128728 | consumed samples:         2688 | consumed tokens:      5505024 | elapsed time per iteration (s): 15.23 | learning rate: 8.808E-07 | global batch size:    16 | lm loss: 1.027396E+01 | grad norm: 1.191 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      169/  128728 | consumed samples:         2704 | consumed tokens:      5537792 | elapsed time per iteration (s): 15.14 | learning rate: 8.860E-07 | global batch size:    16 | lm loss: 1.033432E+01 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration      170/  128728 | consumed samples:         2720 | consumed tokens:      5570560 | elapsed time per iteration (s): 15.22 | learning rate: 8.913E-07 | global batch size:    16 | lm loss: 1.026951E+01 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      171/  128728 | consumed samples:         2736 | consumed tokens:      5603328 | elapsed time per iteration (s): 15.18 | learning rate: 8.965E-07 | global batch size:    16 | lm loss: 1.017828E+01 | grad norm: 2.324 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      172/  128728 | consumed samples:         2752 | consumed tokens:      5636096 | elapsed time per iteration (s): 15.18 | learning rate: 9.018E-07 | global batch size:    16 | lm loss: 1.032809E+01 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      173/  128728 | consumed samples:         2768 | consumed tokens:      5668864 | elapsed time per iteration (s): 15.18 | learning rate: 9.070E-07 | global batch size:    16 | lm loss: 1.033579E+01 | grad norm: 0.922 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      174/  128728 | consumed samples:         2784 | consumed tokens:      5701632 | elapsed time per iteration (s): 15.20 | learning rate: 9.123E-07 | global batch size:    16 | lm loss: 1.006872E+01 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      175/  128728 | consumed samples:         2800 | consumed tokens:      5734400 | elapsed time per iteration (s): 15.22 | learning rate: 9.175E-07 | global batch size:    16 | lm loss: 1.022235E+01 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      176/  128728 | consumed samples:         2816 | consumed tokens:      5767168 | elapsed time per iteration (s): 15.21 | learning rate: 9.227E-07 | global batch size:    16 | lm loss: 1.020374E+01 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      177/  128728 | consumed samples:         2832 | consumed tokens:      5799936 | elapsed time per iteration (s): 15.17 | learning rate: 9.280E-07 | global batch size:    16 | lm loss: 1.014564E+01 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      178/  128728 | consumed samples:         2848 | consumed tokens:      5832704 | elapsed time per iteration (s): 15.21 | learning rate: 9.332E-07 | global batch size:    16 | lm loss: 1.032068E+01 | grad norm: 1.243 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      179/  128728 | consumed samples:         2864 | consumed tokens:      5865472 | elapsed time per iteration (s): 15.19 | learning rate: 9.385E-07 | global batch size:    16 | lm loss: 1.024278E+01 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      180/  128728 | consumed samples:         2880 | consumed tokens:      5898240 | elapsed time per iteration (s): 15.19 | learning rate: 9.437E-07 | global batch size:    16 | lm loss: 1.029474E+01 | grad norm: 0.598 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      181/  128728 | consumed samples:         2896 | consumed tokens:      5931008 | elapsed time per iteration (s): 15.19 | learning rate: 9.490E-07 | global batch size:    16 | lm loss: 1.046901E+01 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      182/  128728 | consumed samples:         2912 | consumed tokens:      5963776 | elapsed time per iteration (s): 15.22 | learning rate: 9.542E-07 | global batch size:    16 | lm loss: 1.012921E+01 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      183/  128728 | consumed samples:         2928 | consumed tokens:      5996544 | elapsed time per iteration (s): 15.20 | learning rate: 9.594E-07 | global batch size:    16 | lm loss: 1.034022E+01 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      184/  128728 | consumed samples:         2944 | consumed tokens:      6029312 | elapsed time per iteration (s): 15.18 | learning rate: 9.647E-07 | global batch size:    16 | lm loss: 1.003381E+01 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      185/  128728 | consumed samples:         2960 | consumed tokens:      6062080 | elapsed time per iteration (s): 15.16 | learning rate: 9.699E-07 | global batch size:    16 | lm loss: 1.021115E+01 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      186/  128728 | consumed samples:         2976 | consumed tokens:      6094848 | elapsed time per iteration (s): 15.19 | learning rate: 9.752E-07 | global batch size:    16 | lm loss: 1.006208E+01 | grad norm: 1.198 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      187/  128728 | consumed samples:         2992 | consumed tokens:      6127616 | elapsed time per iteration (s): 15.22 | learning rate: 9.804E-07 | global batch size:    16 | lm loss: 1.016780E+01 | grad norm: 1.172 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      188/  128728 | consumed samples:         3008 | consumed tokens:      6160384 | elapsed time per iteration (s): 15.17 | learning rate: 9.857E-07 | global batch size:    16 | lm loss: 1.032679E+01 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      189/  128728 | consumed samples:         3024 | consumed tokens:      6193152 | elapsed time per iteration (s): 15.20 | learning rate: 9.909E-07 | global batch size:    16 | lm loss: 1.000206E+01 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      190/  128728 | consumed samples:         3040 | consumed tokens:      6225920 | elapsed time per iteration (s): 15.21 | learning rate: 9.961E-07 | global batch size:    16 | lm loss: 1.015638E+01 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      191/  128728 | consumed samples:         3056 | consumed tokens:      6258688 | elapsed time per iteration (s): 15.21 | learning rate: 1.001E-06 | global batch size:    16 | lm loss: 9.991480E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      192/  128728 | consumed samples:         3072 | consumed tokens:      6291456 | elapsed time per iteration (s): 15.20 | learning rate: 1.007E-06 | global batch size:    16 | lm loss: 1.009315E+01 | grad norm: 0.881 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      193/  128728 | consumed samples:         3088 | consumed tokens:      6324224 | elapsed time per iteration (s): 15.26 | learning rate: 1.012E-06 | global batch size:    16 | lm loss: 9.941729E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      194/  128728 | consumed samples:         3104 | consumed tokens:      6356992 | elapsed time per iteration (s): 15.23 | learning rate: 1.017E-06 | global batch size:    16 | lm loss: 1.005856E+01 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      195/  128728 | consumed samples:         3120 | consumed tokens:      6389760 | elapsed time per iteration (s): 15.20 | learning rate: 1.022E-06 | global batch size:    16 | lm loss: 1.016409E+01 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      196/  128728 | consumed samples:         3136 | consumed tokens:      6422528 | elapsed time per iteration (s): 15.23 | learning rate: 1.028E-06 | global batch size:    16 | lm loss: 9.989647E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      197/  128728 | consumed samples:         3152 | consumed tokens:      6455296 | elapsed time per iteration (s): 15.23 | learning rate: 1.033E-06 | global batch size:    16 | lm loss: 9.962715E+00 | grad norm: 0.655 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      198/  128728 | consumed samples:         3168 | consumed tokens:      6488064 | elapsed time per iteration (s): 15.24 | learning rate: 1.038E-06 | global batch size:    16 | lm loss: 1.009250E+01 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      199/  128728 | consumed samples:         3184 | consumed tokens:      6520832 | elapsed time per iteration (s): 15.22 | learning rate: 1.043E-06 | global batch size:    16 | lm loss: 9.905367E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      200/  128728 | consumed samples:         3200 | consumed tokens:      6553600 | elapsed time per iteration (s): 15.20 | learning rate: 1.049E-06 | global batch size:    16 | lm loss: 1.007274E+01 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      201/  128728 | consumed samples:         3216 | consumed tokens:      6586368 | elapsed time per iteration (s): 15.22 | learning rate: 1.054E-06 | global batch size:    16 | lm loss: 9.892535E+00 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      202/  128728 | consumed samples:         3232 | consumed tokens:      6619136 | elapsed time per iteration (s): 15.22 | learning rate: 1.059E-06 | global batch size:    16 | lm loss: 9.908247E+00 | grad norm: 1.690 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      203/  128728 | consumed samples:         3248 | consumed tokens:      6651904 | elapsed time per iteration (s): 15.20 | learning rate: 1.064E-06 | global batch size:    16 | lm loss: 9.759439E+00 | grad norm: 0.893 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      204/  128728 | consumed samples:         3264 | consumed tokens:      6684672 | elapsed time per iteration (s): 15.23 | learning rate: 1.070E-06 | global batch size:    16 | lm loss: 9.843822E+00 | grad norm: 1.350 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      205/  128728 | consumed samples:         3280 | consumed tokens:      6717440 | elapsed time per iteration (s): 15.20 | learning rate: 1.075E-06 | global batch size:    16 | lm loss: 9.970119E+00 | grad norm: 1.374 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      206/  128728 | consumed samples:         3296 | consumed tokens:      6750208 | elapsed time per iteration (s): 15.23 | learning rate: 1.080E-06 | global batch size:    16 | lm loss: 1.008592E+01 | grad norm: 1.284 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      207/  128728 | consumed samples:         3312 | consumed tokens:      6782976 | elapsed time per iteration (s): 15.21 | learning rate: 1.085E-06 | global batch size:    16 | lm loss: 9.928805E+00 | grad norm: 1.043 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      208/  128728 | consumed samples:         3328 | consumed tokens:      6815744 | elapsed time per iteration (s): 15.20 | learning rate: 1.091E-06 | global batch size:    16 | lm loss: 9.940935E+00 | grad norm: 1.178 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      209/  128728 | consumed samples:         3344 | consumed tokens:      6848512 | elapsed time per iteration (s): 15.23 | learning rate: 1.096E-06 | global batch size:    16 | lm loss: 9.809174E+00 | grad norm: 1.453 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      210/  128728 | consumed samples:         3360 | consumed tokens:      6881280 | elapsed time per iteration (s): 15.24 | learning rate: 1.101E-06 | global batch size:    16 | lm loss: 9.955800E+00 | grad norm: 1.625 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      211/  128728 | consumed samples:         3376 | consumed tokens:      6914048 | elapsed time per iteration (s): 15.22 | learning rate: 1.106E-06 | global batch size:    16 | lm loss: 9.909077E+00 | grad norm: 1.584 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      212/  128728 | consumed samples:         3392 | consumed tokens:      6946816 | elapsed time per iteration (s): 15.23 | learning rate: 1.111E-06 | global batch size:    16 | lm loss: 9.912115E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      213/  128728 | consumed samples:         3408 | consumed tokens:      6979584 | elapsed time per iteration (s): 15.21 | learning rate: 1.117E-06 | global batch size:    16 | lm loss: 9.802191E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      214/  128728 | consumed samples:         3424 | consumed tokens:      7012352 | elapsed time per iteration (s): 15.22 | learning rate: 1.122E-06 | global batch size:    16 | lm loss: 9.900744E+00 | grad norm: 1.028 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      215/  128728 | consumed samples:         3440 | consumed tokens:      7045120 | elapsed time per iteration (s): 15.21 | learning rate: 1.127E-06 | global batch size:    16 | lm loss: 9.727583E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      216/  128728 | consumed samples:         3456 | consumed tokens:      7077888 | elapsed time per iteration (s): 15.22 | learning rate: 1.132E-06 | global batch size:    16 | lm loss: 9.846464E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      217/  128728 | consumed samples:         3472 | consumed tokens:      7110656 | elapsed time per iteration (s): 15.19 | learning rate: 1.138E-06 | global batch size:    16 | lm loss: 1.001000E+01 | grad norm: 0.998 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      218/  128728 | consumed samples:         3488 | consumed tokens:      7143424 | elapsed time per iteration (s): 15.23 | learning rate: 1.143E-06 | global batch size:    16 | lm loss: 9.839026E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      219/  128728 | consumed samples:         3504 | consumed tokens:      7176192 | elapsed time per iteration (s): 15.22 | learning rate: 1.148E-06 | global batch size:    16 | lm loss: 9.730466E+00 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      220/  128728 | consumed samples:         3520 | consumed tokens:      7208960 | elapsed time per iteration (s): 15.17 | learning rate: 1.153E-06 | global batch size:    16 | lm loss: 9.767716E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      221/  128728 | consumed samples:         3536 | consumed tokens:      7241728 | elapsed time per iteration (s): 15.18 | learning rate: 1.159E-06 | global batch size:    16 | lm loss: 9.788709E+00 | grad norm: 1.378 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      222/  128728 | consumed samples:         3552 | consumed tokens:      7274496 | elapsed time per iteration (s): 15.19 | learning rate: 1.164E-06 | global batch size:    16 | lm loss: 9.750614E+00 | grad norm: 1.179 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      223/  128728 | consumed samples:         3568 | consumed tokens:      7307264 | elapsed time per iteration (s): 15.20 | learning rate: 1.169E-06 | global batch size:    16 | lm loss: 9.629946E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      224/  128728 | consumed samples:         3584 | consumed tokens:      7340032 | elapsed time per iteration (s): 15.20 | learning rate: 1.174E-06 | global batch size:    16 | lm loss: 1.004527E+01 | grad norm: 1.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      225/  128728 | consumed samples:         3600 | consumed tokens:      7372800 | elapsed time per iteration (s): 15.24 | learning rate: 1.180E-06 | global batch size:    16 | lm loss: 9.818333E+00 | grad norm: 1.734 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      226/  128728 | consumed samples:         3616 | consumed tokens:      7405568 | elapsed time per iteration (s): 15.18 | learning rate: 1.185E-06 | global batch size:    16 | lm loss: 9.859022E+00 | grad norm: 1.213 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      227/  128728 | consumed samples:         3632 | consumed tokens:      7438336 | elapsed time per iteration (s): 15.17 | learning rate: 1.190E-06 | global batch size:    16 | lm loss: 9.774880E+00 | grad norm: 1.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      228/  128728 | consumed samples:         3648 | consumed tokens:      7471104 | elapsed time per iteration (s): 15.22 | learning rate: 1.195E-06 | global batch size:    16 | lm loss: 9.777248E+00 | grad norm: 1.207 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      229/  128728 | consumed samples:         3664 | consumed tokens:      7503872 | elapsed time per iteration (s): 15.23 | learning rate: 1.201E-06 | global batch size:    16 | lm loss: 9.784309E+00 | grad norm: 1.161 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      230/  128728 | consumed samples:         3680 | consumed tokens:      7536640 | elapsed time per iteration (s): 15.21 | learning rate: 1.206E-06 | global batch size:    16 | lm loss: 9.753279E+00 | grad norm: 1.041 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      231/  128728 | consumed samples:         3696 | consumed tokens:      7569408 | elapsed time per iteration (s): 15.22 | learning rate: 1.211E-06 | global batch size:    16 | lm loss: 9.784714E+00 | grad norm: 1.064 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      232/  128728 | consumed samples:         3712 | consumed tokens:      7602176 | elapsed time per iteration (s): 15.22 | learning rate: 1.216E-06 | global batch size:    16 | lm loss: 9.695133E+00 | grad norm: 1.334 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      233/  128728 | consumed samples:         3728 | consumed tokens:      7634944 | elapsed time per iteration (s): 15.22 | learning rate: 1.222E-06 | global batch size:    16 | lm loss: 9.556194E+00 | grad norm: 1.305 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      234/  128728 | consumed samples:         3744 | consumed tokens:      7667712 | elapsed time per iteration (s): 15.19 | learning rate: 1.227E-06 | global batch size:    16 | lm loss: 9.775770E+00 | grad norm: 1.239 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      235/  128728 | consumed samples:         3760 | consumed tokens:      7700480 | elapsed time per iteration (s): 15.22 | learning rate: 1.232E-06 | global batch size:    16 | lm loss: 9.595947E+00 | grad norm: 1.122 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      236/  128728 | consumed samples:         3776 | consumed tokens:      7733248 | elapsed time per iteration (s): 15.21 | learning rate: 1.237E-06 | global batch size:    16 | lm loss: 9.768786E+00 | grad norm: 1.232 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      237/  128728 | consumed samples:         3792 | consumed tokens:      7766016 | elapsed time per iteration (s): 15.24 | learning rate: 1.243E-06 | global batch size:    16 | lm loss: 9.527258E+00 | grad norm: 1.115 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      238/  128728 | consumed samples:         3808 | consumed tokens:      7798784 | elapsed time per iteration (s): 15.22 | learning rate: 1.248E-06 | global batch size:    16 | lm loss: 9.808368E+00 | grad norm: 1.180 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      239/  128728 | consumed samples:         3824 | consumed tokens:      7831552 | elapsed time per iteration (s): 15.20 | learning rate: 1.253E-06 | global batch size:    16 | lm loss: 9.664412E+00 | grad norm: 1.186 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      240/  128728 | consumed samples:         3840 | consumed tokens:      7864320 | elapsed time per iteration (s): 15.21 | learning rate: 1.258E-06 | global batch size:    16 | lm loss: 9.680309E+00 | grad norm: 2.453 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      241/  128728 | consumed samples:         3856 | consumed tokens:      7897088 | elapsed time per iteration (s): 15.22 | learning rate: 1.264E-06 | global batch size:    16 | lm loss: 9.523140E+00 | grad norm: 0.932 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      242/  128728 | consumed samples:         3872 | consumed tokens:      7929856 | elapsed time per iteration (s): 15.18 | learning rate: 1.269E-06 | global batch size:    16 | lm loss: 9.746195E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      243/  128728 | consumed samples:         3888 | consumed tokens:      7962624 | elapsed time per iteration (s): 15.20 | learning rate: 1.274E-06 | global batch size:    16 | lm loss: 9.654213E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      244/  128728 | consumed samples:         3904 | consumed tokens:      7995392 | elapsed time per iteration (s): 15.19 | learning rate: 1.279E-06 | global batch size:    16 | lm loss: 9.681046E+00 | grad norm: 1.205 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration      245/  128728 | consumed samples:         3920 | consumed tokens:      8028160 | elapsed time per iteration (s): 15.22 | learning rate: 1.285E-06 | global batch size:    16 | lm loss: 9.748778E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      246/  128728 | consumed samples:         3936 | consumed tokens:      8060928 | elapsed time per iteration (s): 15.21 | learning rate: 1.290E-06 | global batch size:    16 | lm loss: 9.600563E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      247/  128728 | consumed samples:         3952 | consumed tokens:      8093696 | elapsed time per iteration (s): 15.18 | learning rate: 1.295E-06 | global batch size:    16 | lm loss: 9.489889E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      248/  128728 | consumed samples:         3968 | consumed tokens:      8126464 | elapsed time per iteration (s): 15.22 | learning rate: 1.300E-06 | global batch size:    16 | lm loss: 9.397079E+00 | grad norm: 1.251 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      249/  128728 | consumed samples:         3984 | consumed tokens:      8159232 | elapsed time per iteration (s): 15.19 | learning rate: 1.305E-06 | global batch size:    16 | lm loss: 9.610090E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      250/  128728 | consumed samples:         4000 | consumed tokens:      8192000 | elapsed time per iteration (s): 15.22 | learning rate: 1.311E-06 | global batch size:    16 | lm loss: 9.520005E+00 | grad norm: 1.252 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      251/  128728 | consumed samples:         4016 | consumed tokens:      8224768 | elapsed time per iteration (s): 15.23 | learning rate: 1.316E-06 | global batch size:    16 | lm loss: 9.354611E+00 | grad norm: 1.216 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      252/  128728 | consumed samples:         4032 | consumed tokens:      8257536 | elapsed time per iteration (s): 15.26 | learning rate: 1.321E-06 | global batch size:    16 | lm loss: 9.402354E+00 | grad norm: 1.381 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      253/  128728 | consumed samples:         4048 | consumed tokens:      8290304 | elapsed time per iteration (s): 15.25 | learning rate: 1.326E-06 | global batch size:    16 | lm loss: 9.472418E+00 | grad norm: 1.558 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      254/  128728 | consumed samples:         4064 | consumed tokens:      8323072 | elapsed time per iteration (s): 15.24 | learning rate: 1.332E-06 | global batch size:    16 | lm loss: 9.596987E+00 | grad norm: 1.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      255/  128728 | consumed samples:         4080 | consumed tokens:      8355840 | elapsed time per iteration (s): 15.25 | learning rate: 1.337E-06 | global batch size:    16 | lm loss: 9.526454E+00 | grad norm: 1.328 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      256/  128728 | consumed samples:         4096 | consumed tokens:      8388608 | elapsed time per iteration (s): 15.29 | learning rate: 1.342E-06 | global batch size:    16 | lm loss: 9.607473E+00 | grad norm: 1.050 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration      257/  128728 | consumed samples:         4112 | consumed tokens:      8421376 | elapsed time per iteration (s): 15.25 | learning rate: 1.347E-06 | global batch size:    16 | lm loss: 9.439919E+00 | grad norm: 1.168 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      258/  128728 | consumed samples:         4128 | consumed tokens:      8454144 | elapsed time per iteration (s): 15.23 | learning rate: 1.353E-06 | global batch size:    16 | lm loss: 9.616064E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      259/  128728 | consumed samples:         4144 | consumed tokens:      8486912 | elapsed time per iteration (s): 15.27 | learning rate: 1.358E-06 | global batch size:    16 | lm loss: 9.386358E+00 | grad norm: 1.068 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      260/  128728 | consumed samples:         4160 | consumed tokens:      8519680 | elapsed time per iteration (s): 15.23 | learning rate: 1.363E-06 | global batch size:    16 | lm loss: 9.311523E+00 | grad norm: 0.944 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      261/  128728 | consumed samples:         4176 | consumed tokens:      8552448 | elapsed time per iteration (s): 15.17 | learning rate: 1.368E-06 | global batch size:    16 | lm loss: 9.406882E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      262/  128728 | consumed samples:         4192 | consumed tokens:      8585216 | elapsed time per iteration (s): 15.20 | learning rate: 1.374E-06 | global batch size:    16 | lm loss: 9.483783E+00 | grad norm: 1.232 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      263/  128728 | consumed samples:         4208 | consumed tokens:      8617984 | elapsed time per iteration (s): 15.21 | learning rate: 1.379E-06 | global batch size:    16 | lm loss: 9.435326E+00 | grad norm: 1.179 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      264/  128728 | consumed samples:         4224 | consumed tokens:      8650752 | elapsed time per iteration (s): 15.23 | learning rate: 1.384E-06 | global batch size:    16 | lm loss: 9.483128E+00 | grad norm: 1.459 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      265/  128728 | consumed samples:         4240 | consumed tokens:      8683520 | elapsed time per iteration (s): 15.24 | learning rate: 1.389E-06 | global batch size:    16 | lm loss: 9.487989E+00 | grad norm: 1.064 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      266/  128728 | consumed samples:         4256 | consumed tokens:      8716288 | elapsed time per iteration (s): 15.24 | learning rate: 1.395E-06 | global batch size:    16 | lm loss: 9.551134E+00 | grad norm: 1.466 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      267/  128728 | consumed samples:         4272 | consumed tokens:      8749056 | elapsed time per iteration (s): 15.24 | learning rate: 1.400E-06 | global batch size:    16 | lm loss: 9.242275E+00 | grad norm: 1.120 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      268/  128728 | consumed samples:         4288 | consumed tokens:      8781824 | elapsed time per iteration (s): 15.26 | learning rate: 1.405E-06 | global batch size:    16 | lm loss: 9.469782E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      269/  128728 | consumed samples:         4304 | consumed tokens:      8814592 | elapsed time per iteration (s): 15.24 | learning rate: 1.410E-06 | global batch size:    16 | lm loss: 9.499035E+00 | grad norm: 1.487 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      270/  128728 | consumed samples:         4320 | consumed tokens:      8847360 | elapsed time per iteration (s): 15.22 | learning rate: 1.416E-06 | global batch size:    16 | lm loss: 9.467442E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      271/  128728 | consumed samples:         4336 | consumed tokens:      8880128 | elapsed time per iteration (s): 15.24 | learning rate: 1.421E-06 | global batch size:    16 | lm loss: 9.442656E+00 | grad norm: 1.546 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      272/  128728 | consumed samples:         4352 | consumed tokens:      8912896 | elapsed time per iteration (s): 15.21 | learning rate: 1.426E-06 | global batch size:    16 | lm loss: 9.322374E+00 | grad norm: 0.947 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      273/  128728 | consumed samples:         4368 | consumed tokens:      8945664 | elapsed time per iteration (s): 15.21 | learning rate: 1.431E-06 | global batch size:    16 | lm loss: 9.270580E+00 | grad norm: 1.042 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      274/  128728 | consumed samples:         4384 | consumed tokens:      8978432 | elapsed time per iteration (s): 15.22 | learning rate: 1.437E-06 | global batch size:    16 | lm loss: 9.374606E+00 | grad norm: 1.063 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      275/  128728 | consumed samples:         4400 | consumed tokens:      9011200 | elapsed time per iteration (s): 15.24 | learning rate: 1.442E-06 | global batch size:    16 | lm loss: 9.264148E+00 | grad norm: 1.074 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      276/  128728 | consumed samples:         4416 | consumed tokens:      9043968 | elapsed time per iteration (s): 15.26 | learning rate: 1.447E-06 | global batch size:    16 | lm loss: 9.256626E+00 | grad norm: 1.241 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      277/  128728 | consumed samples:         4432 | consumed tokens:      9076736 | elapsed time per iteration (s): 15.26 | learning rate: 1.452E-06 | global batch size:    16 | lm loss: 9.479916E+00 | grad norm: 2.775 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      278/  128728 | consumed samples:         4448 | consumed tokens:      9109504 | elapsed time per iteration (s): 15.25 | learning rate: 1.458E-06 | global batch size:    16 | lm loss: 9.171821E+00 | grad norm: 1.493 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration      279/  128728 | consumed samples:         4464 | consumed tokens:      9142272 | elapsed time per iteration (s): 15.25 | learning rate: 1.463E-06 | global batch size:    16 | lm loss: 9.419685E+00 | grad norm: 1.119 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      280/  128728 | consumed samples:         4480 | consumed tokens:      9175040 | elapsed time per iteration (s): 15.21 | learning rate: 1.468E-06 | global batch size:    16 | lm loss: 9.336754E+00 | grad norm: 1.083 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      281/  128728 | consumed samples:         4496 | consumed tokens:      9207808 | elapsed time per iteration (s): 15.24 | learning rate: 1.473E-06 | global batch size:    16 | lm loss: 9.144946E+00 | grad norm: 1.990 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      282/  128728 | consumed samples:         4512 | consumed tokens:      9240576 | elapsed time per iteration (s): 15.24 | learning rate: 1.478E-06 | global batch size:    16 | lm loss: 9.401902E+00 | grad norm: 1.032 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      283/  128728 | consumed samples:         4528 | consumed tokens:      9273344 | elapsed time per iteration (s): 15.21 | learning rate: 1.484E-06 | global batch size:    16 | lm loss: 9.207463E+00 | grad norm: 1.175 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      284/  128728 | consumed samples:         4544 | consumed tokens:      9306112 | elapsed time per iteration (s): 15.21 | learning rate: 1.489E-06 | global batch size:    16 | lm loss: 9.289896E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      285/  128728 | consumed samples:         4560 | consumed tokens:      9338880 | elapsed time per iteration (s): 15.21 | learning rate: 1.494E-06 | global batch size:    16 | lm loss: 9.052877E+00 | grad norm: 1.013 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      286/  128728 | consumed samples:         4576 | consumed tokens:      9371648 | elapsed time per iteration (s): 15.19 | learning rate: 1.499E-06 | global batch size:    16 | lm loss: 9.375488E+00 | grad norm: 1.203 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      287/  128728 | consumed samples:         4592 | consumed tokens:      9404416 | elapsed time per iteration (s): 15.21 | learning rate: 1.505E-06 | global batch size:    16 | lm loss: 9.356708E+00 | grad norm: 1.152 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      288/  128728 | consumed samples:         4608 | consumed tokens:      9437184 | elapsed time per iteration (s): 15.22 | learning rate: 1.510E-06 | global batch size:    16 | lm loss: 9.133143E+00 | grad norm: 1.189 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      289/  128728 | consumed samples:         4624 | consumed tokens:      9469952 | elapsed time per iteration (s): 15.26 | learning rate: 1.515E-06 | global batch size:    16 | lm loss: 9.436096E+00 | grad norm: 1.579 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      290/  128728 | consumed samples:         4640 | consumed tokens:      9502720 | elapsed time per iteration (s): 15.21 | learning rate: 1.520E-06 | global batch size:    16 | lm loss: 9.226528E+00 | grad norm: 1.218 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      291/  128728 | consumed samples:         4656 | consumed tokens:      9535488 | elapsed time per iteration (s): 15.19 | learning rate: 1.526E-06 | global batch size:    16 | lm loss: 9.340797E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      292/  128728 | consumed samples:         4672 | consumed tokens:      9568256 | elapsed time per iteration (s): 15.22 | learning rate: 1.531E-06 | global batch size:    16 | lm loss: 9.186805E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      293/  128728 | consumed samples:         4688 | consumed tokens:      9601024 | elapsed time per iteration (s): 15.21 | learning rate: 1.536E-06 | global batch size:    16 | lm loss: 9.120500E+00 | grad norm: 1.466 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      294/  128728 | consumed samples:         4704 | consumed tokens:      9633792 | elapsed time per iteration (s): 15.19 | learning rate: 1.541E-06 | global batch size:    16 | lm loss: 9.039913E+00 | grad norm: 1.139 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      295/  128728 | consumed samples:         4720 | consumed tokens:      9666560 | elapsed time per iteration (s): 15.26 | learning rate: 1.547E-06 | global batch size:    16 | lm loss: 9.181991E+00 | grad norm: 1.117 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      296/  128728 | consumed samples:         4736 | consumed tokens:      9699328 | elapsed time per iteration (s): 15.23 | learning rate: 1.552E-06 | global batch size:    16 | lm loss: 9.090605E+00 | grad norm: 1.307 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      297/  128728 | consumed samples:         4752 | consumed tokens:      9732096 | elapsed time per iteration (s): 15.25 | learning rate: 1.557E-06 | global batch size:    16 | lm loss: 9.270121E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      298/  128728 | consumed samples:         4768 | consumed tokens:      9764864 | elapsed time per iteration (s): 15.21 | learning rate: 1.562E-06 | global batch size:    16 | lm loss: 9.101935E+00 | grad norm: 1.120 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      299/  128728 | consumed samples:         4784 | consumed tokens:      9797632 | elapsed time per iteration (s): 15.21 | learning rate: 1.568E-06 | global batch size:    16 | lm loss: 9.210810E+00 | grad norm: 1.630 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      300/  128728 | consumed samples:         4800 | consumed tokens:      9830400 | elapsed time per iteration (s): 15.21 | learning rate: 1.573E-06 | global batch size:    16 | lm loss: 9.187110E+00 | grad norm: 1.465 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      301/  128728 | consumed samples:         4816 | consumed tokens:      9863168 | elapsed time per iteration (s): 15.22 | learning rate: 1.578E-06 | global batch size:    16 | lm loss: 9.143536E+00 | grad norm: 1.351 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      302/  128728 | consumed samples:         4832 | consumed tokens:      9895936 | elapsed time per iteration (s): 15.22 | learning rate: 1.583E-06 | global batch size:    16 | lm loss: 9.160694E+00 | grad norm: 1.096 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      303/  128728 | consumed samples:         4848 | consumed tokens:      9928704 | elapsed time per iteration (s): 15.24 | learning rate: 1.589E-06 | global batch size:    16 | lm loss: 9.221185E+00 | grad norm: 1.317 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      304/  128728 | consumed samples:         4864 | consumed tokens:      9961472 | elapsed time per iteration (s): 15.21 | learning rate: 1.594E-06 | global batch size:    16 | lm loss: 9.189565E+00 | grad norm: 1.200 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      305/  128728 | consumed samples:         4880 | consumed tokens:      9994240 | elapsed time per iteration (s): 15.21 | learning rate: 1.599E-06 | global batch size:    16 | lm loss: 9.239432E+00 | grad norm: 1.386 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      306/  128728 | consumed samples:         4896 | consumed tokens:     10027008 | elapsed time per iteration (s): 15.23 | learning rate: 1.604E-06 | global batch size:    16 | lm loss: 9.193028E+00 | grad norm: 1.024 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      307/  128728 | consumed samples:         4912 | consumed tokens:     10059776 | elapsed time per iteration (s): 15.22 | learning rate: 1.610E-06 | global batch size:    16 | lm loss: 9.158922E+00 | grad norm: 1.299 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      308/  128728 | consumed samples:         4928 | consumed tokens:     10092544 | elapsed time per iteration (s): 15.21 | learning rate: 1.615E-06 | global batch size:    16 | lm loss: 9.136261E+00 | grad norm: 1.022 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      309/  128728 | consumed samples:         4944 | consumed tokens:     10125312 | elapsed time per iteration (s): 15.23 | learning rate: 1.620E-06 | global batch size:    16 | lm loss: 9.243754E+00 | grad norm: 1.196 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      310/  128728 | consumed samples:         4960 | consumed tokens:     10158080 | elapsed time per iteration (s): 15.23 | learning rate: 1.625E-06 | global batch size:    16 | lm loss: 9.191011E+00 | grad norm: 1.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      311/  128728 | consumed samples:         4976 | consumed tokens:     10190848 | elapsed time per iteration (s): 15.21 | learning rate: 1.631E-06 | global batch size:    16 | lm loss: 9.023661E+00 | grad norm: 1.505 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      312/  128728 | consumed samples:         4992 | consumed tokens:     10223616 | elapsed time per iteration (s): 15.26 | learning rate: 1.636E-06 | global batch size:    16 | lm loss: 9.186005E+00 | grad norm: 1.280 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      313/  128728 | consumed samples:         5008 | consumed tokens:     10256384 | elapsed time per iteration (s): 15.23 | learning rate: 1.641E-06 | global batch size:    16 | lm loss: 9.018515E+00 | grad norm: 1.208 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      314/  128728 | consumed samples:         5024 | consumed tokens:     10289152 | elapsed time per iteration (s): 15.19 | learning rate: 1.646E-06 | global batch size:    16 | lm loss: 8.974466E+00 | grad norm: 1.562 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      315/  128728 | consumed samples:         5040 | consumed tokens:     10321920 | elapsed time per iteration (s): 15.17 | learning rate: 1.652E-06 | global batch size:    16 | lm loss: 9.060785E+00 | grad norm: 1.455 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration      316/  128728 | consumed samples:         5056 | consumed tokens:     10354688 | elapsed time per iteration (s): 15.25 | learning rate: 1.657E-06 | global batch size:    16 | lm loss: 9.159584E+00 | grad norm: 1.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      317/  128728 | consumed samples:         5072 | consumed tokens:     10387456 | elapsed time per iteration (s): 15.19 | learning rate: 1.662E-06 | global batch size:    16 | lm loss: 9.113900E+00 | grad norm: 1.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      318/  128728 | consumed samples:         5088 | consumed tokens:     10420224 | elapsed time per iteration (s): 15.23 | learning rate: 1.667E-06 | global batch size:    16 | lm loss: 9.078951E+00 | grad norm: 1.130 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      319/  128728 | consumed samples:         5104 | consumed tokens:     10452992 | elapsed time per iteration (s): 15.21 | learning rate: 1.672E-06 | global batch size:    16 | lm loss: 9.058454E+00 | grad norm: 1.397 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      320/  128728 | consumed samples:         5120 | consumed tokens:     10485760 | elapsed time per iteration (s): 15.22 | learning rate: 1.678E-06 | global batch size:    16 | lm loss: 9.104960E+00 | grad norm: 1.391 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      321/  128728 | consumed samples:         5136 | consumed tokens:     10518528 | elapsed time per iteration (s): 15.19 | learning rate: 1.683E-06 | global batch size:    16 | lm loss: 8.983455E+00 | grad norm: 1.727 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      322/  128728 | consumed samples:         5152 | consumed tokens:     10551296 | elapsed time per iteration (s): 15.22 | learning rate: 1.688E-06 | global batch size:    16 | lm loss: 8.882467E+00 | grad norm: 1.355 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      323/  128728 | consumed samples:         5168 | consumed tokens:     10584064 | elapsed time per iteration (s): 15.21 | learning rate: 1.693E-06 | global batch size:    16 | lm loss: 8.978757E+00 | grad norm: 2.112 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      324/  128728 | consumed samples:         5184 | consumed tokens:     10616832 | elapsed time per iteration (s): 15.15 | learning rate: 1.699E-06 | global batch size:    16 | lm loss: 9.070255E+00 | grad norm: 1.126 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration      325/  128728 | consumed samples:         5200 | consumed tokens:     10649600 | elapsed time per iteration (s): 15.20 | learning rate: 1.704E-06 | global batch size:    16 | lm loss: 9.185911E+00 | grad norm: 1.380 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      326/  128728 | consumed samples:         5216 | consumed tokens:     10682368 | elapsed time per iteration (s): 15.23 | learning rate: 1.709E-06 | global batch size:    16 | lm loss: 8.935247E+00 | grad norm: 1.418 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      327/  128728 | consumed samples:         5232 | consumed tokens:     10715136 | elapsed time per iteration (s): 15.24 | learning rate: 1.714E-06 | global batch size:    16 | lm loss: 8.980277E+00 | grad norm: 1.305 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      328/  128728 | consumed samples:         5248 | consumed tokens:     10747904 | elapsed time per iteration (s): 15.23 | learning rate: 1.720E-06 | global batch size:    16 | lm loss: 9.004158E+00 | grad norm: 1.655 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      329/  128728 | consumed samples:         5264 | consumed tokens:     10780672 | elapsed time per iteration (s): 15.17 | learning rate: 1.725E-06 | global batch size:    16 | lm loss: 9.141132E+00 | grad norm: 1.525 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      330/  128728 | consumed samples:         5280 | consumed tokens:     10813440 | elapsed time per iteration (s): 15.20 | learning rate: 1.730E-06 | global batch size:    16 | lm loss: 9.098420E+00 | grad norm: 1.606 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      331/  128728 | consumed samples:         5296 | consumed tokens:     10846208 | elapsed time per iteration (s): 15.21 | learning rate: 1.735E-06 | global batch size:    16 | lm loss: 9.103991E+00 | grad norm: 1.140 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      332/  128728 | consumed samples:         5312 | consumed tokens:     10878976 | elapsed time per iteration (s): 15.18 | learning rate: 1.741E-06 | global batch size:    16 | lm loss: 9.196499E+00 | grad norm: 2.187 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      333/  128728 | consumed samples:         5328 | consumed tokens:     10911744 | elapsed time per iteration (s): 15.24 | learning rate: 1.746E-06 | global batch size:    16 | lm loss: 8.898166E+00 | grad norm: 1.308 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      334/  128728 | consumed samples:         5344 | consumed tokens:     10944512 | elapsed time per iteration (s): 15.23 | learning rate: 1.751E-06 | global batch size:    16 | lm loss: 8.988365E+00 | grad norm: 1.492 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      335/  128728 | consumed samples:         5360 | consumed tokens:     10977280 | elapsed time per iteration (s): 15.25 | learning rate: 1.756E-06 | global batch size:    16 | lm loss: 8.947336E+00 | grad norm: 1.491 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      336/  128728 | consumed samples:         5376 | consumed tokens:     11010048 | elapsed time per iteration (s): 15.20 | learning rate: 1.762E-06 | global batch size:    16 | lm loss: 8.925644E+00 | grad norm: 2.491 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      337/  128728 | consumed samples:         5392 | consumed tokens:     11042816 | elapsed time per iteration (s): 15.26 | learning rate: 1.767E-06 | global batch size:    16 | lm loss: 8.995684E+00 | grad norm: 1.650 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      338/  128728 | consumed samples:         5408 | consumed tokens:     11075584 | elapsed time per iteration (s): 15.24 | learning rate: 1.772E-06 | global batch size:    16 | lm loss: 8.828646E+00 | grad norm: 1.639 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      339/  128728 | consumed samples:         5424 | consumed tokens:     11108352 | elapsed time per iteration (s): 15.23 | learning rate: 1.777E-06 | global batch size:    16 | lm loss: 8.849914E+00 | grad norm: 1.488 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      340/  128728 | consumed samples:         5440 | consumed tokens:     11141120 | elapsed time per iteration (s): 15.22 | learning rate: 1.783E-06 | global batch size:    16 | lm loss: 8.669468E+00 | grad norm: 1.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      341/  128728 | consumed samples:         5456 | consumed tokens:     11173888 | elapsed time per iteration (s): 15.22 | learning rate: 1.788E-06 | global batch size:    16 | lm loss: 8.875322E+00 | grad norm: 1.433 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      342/  128728 | consumed samples:         5472 | consumed tokens:     11206656 | elapsed time per iteration (s): 15.24 | learning rate: 1.793E-06 | global batch size:    16 | lm loss: 9.113847E+00 | grad norm: 1.601 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      343/  128728 | consumed samples:         5488 | consumed tokens:     11239424 | elapsed time per iteration (s): 15.25 | learning rate: 1.798E-06 | global batch size:    16 | lm loss: 9.039911E+00 | grad norm: 2.029 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      344/  128728 | consumed samples:         5504 | consumed tokens:     11272192 | elapsed time per iteration (s): 15.25 | learning rate: 1.804E-06 | global batch size:    16 | lm loss: 8.952249E+00 | grad norm: 1.320 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      345/  128728 | consumed samples:         5520 | consumed tokens:     11304960 | elapsed time per iteration (s): 15.19 | learning rate: 1.809E-06 | global batch size:    16 | lm loss: 9.029071E+00 | grad norm: 2.857 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      346/  128728 | consumed samples:         5536 | consumed tokens:     11337728 | elapsed time per iteration (s): 15.22 | learning rate: 1.814E-06 | global batch size:    16 | lm loss: 8.957701E+00 | grad norm: 2.596 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      347/  128728 | consumed samples:         5552 | consumed tokens:     11370496 | elapsed time per iteration (s): 15.19 | learning rate: 1.819E-06 | global batch size:    16 | lm loss: 9.178146E+00 | grad norm: 2.227 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      348/  128728 | consumed samples:         5568 | consumed tokens:     11403264 | elapsed time per iteration (s): 15.24 | learning rate: 1.825E-06 | global batch size:    16 | lm loss: 8.739803E+00 | grad norm: 2.014 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      349/  128728 | consumed samples:         5584 | consumed tokens:     11436032 | elapsed time per iteration (s): 15.21 | learning rate: 1.830E-06 | global batch size:    16 | lm loss: 9.074715E+00 | grad norm: 1.441 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      350/  128728 | consumed samples:         5600 | consumed tokens:     11468800 | elapsed time per iteration (s): 15.23 | learning rate: 1.835E-06 | global batch size:    16 | lm loss: 8.816961E+00 | grad norm: 2.010 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      351/  128728 | consumed samples:         5616 | consumed tokens:     11501568 | elapsed time per iteration (s): 15.21 | learning rate: 1.840E-06 | global batch size:    16 | lm loss: 9.123592E+00 | grad norm: 2.014 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      352/  128728 | consumed samples:         5632 | consumed tokens:     11534336 | elapsed time per iteration (s): 15.24 | learning rate: 1.845E-06 | global batch size:    16 | lm loss: 9.053972E+00 | grad norm: 1.400 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      353/  128728 | consumed samples:         5648 | consumed tokens:     11567104 | elapsed time per iteration (s): 15.23 | learning rate: 1.851E-06 | global batch size:    16 | lm loss: 8.837742E+00 | grad norm: 1.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      354/  128728 | consumed samples:         5664 | consumed tokens:     11599872 | elapsed time per iteration (s): 15.25 | learning rate: 1.856E-06 | global batch size:    16 | lm loss: 8.724428E+00 | grad norm: 1.992 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      355/  128728 | consumed samples:         5680 | consumed tokens:     11632640 | elapsed time per iteration (s): 15.26 | learning rate: 1.861E-06 | global batch size:    16 | lm loss: 8.793618E+00 | grad norm: 1.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      356/  128728 | consumed samples:         5696 | consumed tokens:     11665408 | elapsed time per iteration (s): 15.22 | learning rate: 1.866E-06 | global batch size:    16 | lm loss: 8.806067E+00 | grad norm: 1.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      357/  128728 | consumed samples:         5712 | consumed tokens:     11698176 | elapsed time per iteration (s): 15.20 | learning rate: 1.872E-06 | global batch size:    16 | lm loss: 8.795446E+00 | grad norm: 1.873 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      358/  128728 | consumed samples:         5728 | consumed tokens:     11730944 | elapsed time per iteration (s): 15.24 | learning rate: 1.877E-06 | global batch size:    16 | lm loss: 8.763588E+00 | grad norm: 1.667 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      359/  128728 | consumed samples:         5744 | consumed tokens:     11763712 | elapsed time per iteration (s): 15.26 | learning rate: 1.882E-06 | global batch size:    16 | lm loss: 8.908950E+00 | grad norm: 1.528 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      360/  128728 | consumed samples:         5760 | consumed tokens:     11796480 | elapsed time per iteration (s): 15.19 | learning rate: 1.887E-06 | global batch size:    16 | lm loss: 8.781729E+00 | grad norm: 1.996 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      361/  128728 | consumed samples:         5776 | consumed tokens:     11829248 | elapsed time per iteration (s): 15.19 | learning rate: 1.893E-06 | global batch size:    16 | lm loss: 8.808187E+00 | grad norm: 1.622 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      362/  128728 | consumed samples:         5792 | consumed tokens:     11862016 | elapsed time per iteration (s): 15.25 | learning rate: 1.898E-06 | global batch size:    16 | lm loss: 8.742043E+00 | grad norm: 1.576 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      363/  128728 | consumed samples:         5808 | consumed tokens:     11894784 | elapsed time per iteration (s): 15.27 | learning rate: 1.903E-06 | global batch size:    16 | lm loss: 8.903679E+00 | grad norm: 1.292 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      364/  128728 | consumed samples:         5824 | consumed tokens:     11927552 | elapsed time per iteration (s): 15.25 | learning rate: 1.908E-06 | global batch size:    16 | lm loss: 8.821105E+00 | grad norm: 1.190 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      365/  128728 | consumed samples:         5840 | consumed tokens:     11960320 | elapsed time per iteration (s): 15.24 | learning rate: 1.914E-06 | global batch size:    16 | lm loss: 8.744251E+00 | grad norm: 2.032 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      366/  128728 | consumed samples:         5856 | consumed tokens:     11993088 | elapsed time per iteration (s): 15.24 | learning rate: 1.919E-06 | global batch size:    16 | lm loss: 8.918768E+00 | grad norm: 1.993 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      367/  128728 | consumed samples:         5872 | consumed tokens:     12025856 | elapsed time per iteration (s): 15.24 | learning rate: 1.924E-06 | global batch size:    16 | lm loss: 8.736933E+00 | grad norm: 3.109 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      368/  128728 | consumed samples:         5888 | consumed tokens:     12058624 | elapsed time per iteration (s): 15.24 | learning rate: 1.929E-06 | global batch size:    16 | lm loss: 8.928401E+00 | grad norm: 3.081 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      369/  128728 | consumed samples:         5904 | consumed tokens:     12091392 | elapsed time per iteration (s): 15.24 | learning rate: 1.935E-06 | global batch size:    16 | lm loss: 8.997413E+00 | grad norm: 2.562 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      370/  128728 | consumed samples:         5920 | consumed tokens:     12124160 | elapsed time per iteration (s): 15.18 | learning rate: 1.940E-06 | global batch size:    16 | lm loss: 8.656151E+00 | grad norm: 2.452 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      371/  128728 | consumed samples:         5936 | consumed tokens:     12156928 | elapsed time per iteration (s): 15.21 | learning rate: 1.945E-06 | global batch size:    16 | lm loss: 8.794637E+00 | grad norm: 2.540 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      372/  128728 | consumed samples:         5952 | consumed tokens:     12189696 | elapsed time per iteration (s): 15.24 | learning rate: 1.950E-06 | global batch size:    16 | lm loss: 8.713245E+00 | grad norm: 2.233 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      373/  128728 | consumed samples:         5968 | consumed tokens:     12222464 | elapsed time per iteration (s): 15.24 | learning rate: 1.956E-06 | global batch size:    16 | lm loss: 8.920404E+00 | grad norm: 1.386 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      374/  128728 | consumed samples:         5984 | consumed tokens:     12255232 | elapsed time per iteration (s): 15.24 | learning rate: 1.961E-06 | global batch size:    16 | lm loss: 8.724771E+00 | grad norm: 2.832 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      375/  128728 | consumed samples:         6000 | consumed tokens:     12288000 | elapsed time per iteration (s): 15.24 | learning rate: 1.966E-06 | global batch size:    16 | lm loss: 8.874722E+00 | grad norm: 2.098 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      376/  128728 | consumed samples:         6016 | consumed tokens:     12320768 | elapsed time per iteration (s): 15.24 | learning rate: 1.971E-06 | global batch size:    16 | lm loss: 8.530634E+00 | grad norm: 1.285 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      377/  128728 | consumed samples:         6032 | consumed tokens:     12353536 | elapsed time per iteration (s): 15.22 | learning rate: 1.977E-06 | global batch size:    16 | lm loss: 8.726177E+00 | grad norm: 2.358 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      378/  128728 | consumed samples:         6048 | consumed tokens:     12386304 | elapsed time per iteration (s): 15.24 | learning rate: 1.982E-06 | global batch size:    16 | lm loss: 8.662714E+00 | grad norm: 1.946 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      379/  128728 | consumed samples:         6064 | consumed tokens:     12419072 | elapsed time per iteration (s): 15.26 | learning rate: 1.987E-06 | global batch size:    16 | lm loss: 8.682480E+00 | grad norm: 1.877 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      380/  128728 | consumed samples:         6080 | consumed tokens:     12451840 | elapsed time per iteration (s): 15.24 | learning rate: 1.992E-06 | global batch size:    16 | lm loss: 8.867916E+00 | grad norm: 1.969 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      381/  128728 | consumed samples:         6096 | consumed tokens:     12484608 | elapsed time per iteration (s): 15.23 | learning rate: 1.998E-06 | global batch size:    16 | lm loss: 8.751642E+00 | grad norm: 2.013 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      382/  128728 | consumed samples:         6112 | consumed tokens:     12517376 | elapsed time per iteration (s): 15.27 | learning rate: 2.003E-06 | global batch size:    16 | lm loss: 8.746722E+00 | grad norm: 1.271 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      383/  128728 | consumed samples:         6128 | consumed tokens:     12550144 | elapsed time per iteration (s): 15.23 | learning rate: 2.008E-06 | global batch size:    16 | lm loss: 8.698657E+00 | grad norm: 2.237 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      384/  128728 | consumed samples:         6144 | consumed tokens:     12582912 | elapsed time per iteration (s): 15.27 | learning rate: 2.013E-06 | global batch size:    16 | lm loss: 8.771927E+00 | grad norm: 1.493 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      385/  128728 | consumed samples:         6160 | consumed tokens:     12615680 | elapsed time per iteration (s): 15.22 | learning rate: 2.019E-06 | global batch size:    16 | lm loss: 8.916728E+00 | grad norm: 2.030 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      386/  128728 | consumed samples:         6176 | consumed tokens:     12648448 | elapsed time per iteration (s): 15.25 | learning rate: 2.024E-06 | global batch size:    16 | lm loss: 8.761660E+00 | grad norm: 2.348 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration      387/  128728 | consumed samples:         6192 | consumed tokens:     12681216 | elapsed time per iteration (s): 15.26 | learning rate: 2.029E-06 | global batch size:    16 | lm loss: 8.588232E+00 | grad norm: 1.430 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      388/  128728 | consumed samples:         6208 | consumed tokens:     12713984 | elapsed time per iteration (s): 15.22 | learning rate: 2.034E-06 | global batch size:    16 | lm loss: 8.639584E+00 | grad norm: 2.459 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      389/  128728 | consumed samples:         6224 | consumed tokens:     12746752 | elapsed time per iteration (s): 15.25 | learning rate: 2.039E-06 | global batch size:    16 | lm loss: 8.722241E+00 | grad norm: 2.932 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      390/  128728 | consumed samples:         6240 | consumed tokens:     12779520 | elapsed time per iteration (s): 15.25 | learning rate: 2.045E-06 | global batch size:    16 | lm loss: 8.805967E+00 | grad norm: 2.410 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      391/  128728 | consumed samples:         6256 | consumed tokens:     12812288 | elapsed time per iteration (s): 15.25 | learning rate: 2.050E-06 | global batch size:    16 | lm loss: 8.767456E+00 | grad norm: 1.182 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      392/  128728 | consumed samples:         6272 | consumed tokens:     12845056 | elapsed time per iteration (s): 15.24 | learning rate: 2.055E-06 | global batch size:    16 | lm loss: 8.722268E+00 | grad norm: 2.268 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      393/  128728 | consumed samples:         6288 | consumed tokens:     12877824 | elapsed time per iteration (s): 15.22 | learning rate: 2.060E-06 | global batch size:    16 | lm loss: 8.755892E+00 | grad norm: 1.321 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      394/  128728 | consumed samples:         6304 | consumed tokens:     12910592 | elapsed time per iteration (s): 15.25 | learning rate: 2.066E-06 | global batch size:    16 | lm loss: 8.785294E+00 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration      395/  128728 | consumed samples:         6320 | consumed tokens:     12943360 | elapsed time per iteration (s): 15.23 | learning rate: 2.071E-06 | global batch size:    16 | lm loss: 8.598062E+00 | grad norm: 1.885 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      396/  128728 | consumed samples:         6336 | consumed tokens:     12976128 | elapsed time per iteration (s): 15.17 | learning rate: 2.076E-06 | global batch size:    16 | lm loss: 8.763098E+00 | grad norm: 1.409 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration      397/  128728 | consumed samples:         6352 | consumed tokens:     13008896 | elapsed time per iteration (s): 15.23 | learning rate: 2.081E-06 | global batch size:    16 | lm loss: 8.675168E+00 | grad norm: 1.460 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      398/  128728 | consumed samples:         6368 | consumed tokens:     13041664 | elapsed time per iteration (s): 15.23 | learning rate: 2.087E-06 | global batch size:    16 | lm loss: 8.532794E+00 | grad norm: 1.052 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      399/  128728 | consumed samples:         6384 | consumed tokens:     13074432 | elapsed time per iteration (s): 15.22 | learning rate: 2.092E-06 | global batch size:    16 | lm loss: 8.650246E+00 | grad norm: 1.473 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      400/  128728 | consumed samples:         6400 | consumed tokens:     13107200 | elapsed time per iteration (s): 15.22 | learning rate: 2.097E-06 | global batch size:    16 | lm loss: 8.503979E+00 | grad norm: 1.464 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      401/  128728 | consumed samples:         6416 | consumed tokens:     13139968 | elapsed time per iteration (s): 15.22 | learning rate: 2.102E-06 | global batch size:    16 | lm loss: 8.529534E+00 | grad norm: 1.292 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      402/  128728 | consumed samples:         6432 | consumed tokens:     13172736 | elapsed time per iteration (s): 15.24 | learning rate: 2.108E-06 | global batch size:    16 | lm loss: 8.620544E+00 | grad norm: 1.908 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      403/  128728 | consumed samples:         6448 | consumed tokens:     13205504 | elapsed time per iteration (s): 15.26 | learning rate: 2.113E-06 | global batch size:    16 | lm loss: 8.570610E+00 | grad norm: 1.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      404/  128728 | consumed samples:         6464 | consumed tokens:     13238272 | elapsed time per iteration (s): 15.25 | learning rate: 2.118E-06 | global batch size:    16 | lm loss: 8.559856E+00 | grad norm: 1.191 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      405/  128728 | consumed samples:         6480 | consumed tokens:     13271040 | elapsed time per iteration (s): 15.28 | learning rate: 2.123E-06 | global batch size:    16 | lm loss: 8.603176E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration      406/  128728 | consumed samples:         6496 | consumed tokens:     13303808 | elapsed time per iteration (s): 15.18 | learning rate: 2.129E-06 | global batch size:    16 | lm loss: 8.468877E+00 | grad norm: 1.470 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      407/  128728 | consumed samples:         6512 | consumed tokens:     13336576 | elapsed time per iteration (s): 15.25 | learning rate: 2.134E-06 | global batch size:    16 | lm loss: 8.496984E+00 | grad norm: 1.102 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      408/  128728 | consumed samples:         6528 | consumed tokens:     13369344 | elapsed time per iteration (s): 15.18 | learning rate: 2.139E-06 | global batch size:    16 | lm loss: 8.568752E+00 | grad norm: 1.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      409/  128728 | consumed samples:         6544 | consumed tokens:     13402112 | elapsed time per iteration (s): 15.17 | learning rate: 2.144E-06 | global batch size:    16 | lm loss: 8.504786E+00 | grad norm: 2.098 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration      410/  128728 | consumed samples:         6560 | consumed tokens:     13434880 | elapsed time per iteration (s): 15.23 | learning rate: 2.150E-06 | global batch size:    16 | lm loss: 8.729224E+00 | grad norm: 1.265 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      411/  128728 | consumed samples:         6576 | consumed tokens:     13467648 | elapsed time per iteration (s): 15.18 | learning rate: 2.155E-06 | global batch size:    16 | lm loss: 8.696260E+00 | grad norm: 3.105 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      412/  128728 | consumed samples:         6592 | consumed tokens:     13500416 | elapsed time per iteration (s): 15.24 | learning rate: 2.160E-06 | global batch size:    16 | lm loss: 8.525265E+00 | grad norm: 1.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      413/  128728 | consumed samples:         6608 | consumed tokens:     13533184 | elapsed time per iteration (s): 15.22 | learning rate: 2.165E-06 | global batch size:    16 | lm loss: 8.653839E+00 | grad norm: 3.276 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      414/  128728 | consumed samples:         6624 | consumed tokens:     13565952 | elapsed time per iteration (s): 15.24 | learning rate: 2.171E-06 | global batch size:    16 | lm loss: 8.959422E+00 | grad norm: 4.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      415/  128728 | consumed samples:         6640 | consumed tokens:     13598720 | elapsed time per iteration (s): 15.23 | learning rate: 2.176E-06 | global batch size:    16 | lm loss: 8.594271E+00 | grad norm: 1.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      416/  128728 | consumed samples:         6656 | consumed tokens:     13631488 | elapsed time per iteration (s): 15.21 | learning rate: 2.181E-06 | global batch size:    16 | lm loss: 8.770068E+00 | grad norm: 1.618 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      417/  128728 | consumed samples:         6672 | consumed tokens:     13664256 | elapsed time per iteration (s): 15.21 | learning rate: 2.186E-06 | global batch size:    16 | lm loss: 8.684436E+00 | grad norm: 2.121 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      418/  128728 | consumed samples:         6688 | consumed tokens:     13697024 | elapsed time per iteration (s): 15.23 | learning rate: 2.192E-06 | global batch size:    16 | lm loss: 8.469204E+00 | grad norm: 1.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      419/  128728 | consumed samples:         6704 | consumed tokens:     13729792 | elapsed time per iteration (s): 15.19 | learning rate: 2.197E-06 | global batch size:    16 | lm loss: 8.532163E+00 | grad norm: 2.872 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      420/  128728 | consumed samples:         6720 | consumed tokens:     13762560 | elapsed time per iteration (s): 15.25 | learning rate: 2.202E-06 | global batch size:    16 | lm loss: 8.762425E+00 | grad norm: 2.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      421/  128728 | consumed samples:         6736 | consumed tokens:     13795328 | elapsed time per iteration (s): 15.20 | learning rate: 2.207E-06 | global batch size:    16 | lm loss: 8.625541E+00 | grad norm: 1.791 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      422/  128728 | consumed samples:         6752 | consumed tokens:     13828096 | elapsed time per iteration (s): 15.20 | learning rate: 2.213E-06 | global batch size:    16 | lm loss: 8.546190E+00 | grad norm: 2.494 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      423/  128728 | consumed samples:         6768 | consumed tokens:     13860864 | elapsed time per iteration (s): 15.24 | learning rate: 2.218E-06 | global batch size:    16 | lm loss: 8.478785E+00 | grad norm: 1.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      424/  128728 | consumed samples:         6784 | consumed tokens:     13893632 | elapsed time per iteration (s): 15.23 | learning rate: 2.223E-06 | global batch size:    16 | lm loss: 8.501416E+00 | grad norm: 2.448 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      425/  128728 | consumed samples:         6800 | consumed tokens:     13926400 | elapsed time per iteration (s): 15.25 | learning rate: 2.228E-06 | global batch size:    16 | lm loss: 8.284233E+00 | grad norm: 2.423 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      426/  128728 | consumed samples:         6816 | consumed tokens:     13959168 | elapsed time per iteration (s): 15.20 | learning rate: 2.233E-06 | global batch size:    16 | lm loss: 8.605833E+00 | grad norm: 1.574 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      427/  128728 | consumed samples:         6832 | consumed tokens:     13991936 | elapsed time per iteration (s): 15.22 | learning rate: 2.239E-06 | global batch size:    16 | lm loss: 8.659263E+00 | grad norm: 2.362 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      428/  128728 | consumed samples:         6848 | consumed tokens:     14024704 | elapsed time per iteration (s): 15.26 | learning rate: 2.244E-06 | global batch size:    16 | lm loss: 8.621931E+00 | grad norm: 1.420 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      429/  128728 | consumed samples:         6864 | consumed tokens:     14057472 | elapsed time per iteration (s): 15.22 | learning rate: 2.249E-06 | global batch size:    16 | lm loss: 8.517220E+00 | grad norm: 2.533 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      430/  128728 | consumed samples:         6880 | consumed tokens:     14090240 | elapsed time per iteration (s): 15.17 | learning rate: 2.254E-06 | global batch size:    16 | lm loss: 8.515087E+00 | grad norm: 2.479 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      431/  128728 | consumed samples:         6896 | consumed tokens:     14123008 | elapsed time per iteration (s): 15.24 | learning rate: 2.260E-06 | global batch size:    16 | lm loss: 8.327739E+00 | grad norm: 1.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      432/  128728 | consumed samples:         6912 | consumed tokens:     14155776 | elapsed time per iteration (s): 15.20 | learning rate: 2.265E-06 | global batch size:    16 | lm loss: 8.415800E+00 | grad norm: 1.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      433/  128728 | consumed samples:         6928 | consumed tokens:     14188544 | elapsed time per iteration (s): 15.19 | learning rate: 2.270E-06 | global batch size:    16 | lm loss: 8.553007E+00 | grad norm: 2.123 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      434/  128728 | consumed samples:         6944 | consumed tokens:     14221312 | elapsed time per iteration (s): 15.22 | learning rate: 2.275E-06 | global batch size:    16 | lm loss: 8.405775E+00 | grad norm: 1.633 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      435/  128728 | consumed samples:         6960 | consumed tokens:     14254080 | elapsed time per iteration (s): 15.24 | learning rate: 2.281E-06 | global batch size:    16 | lm loss: 8.622299E+00 | grad norm: 2.424 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      436/  128728 | consumed samples:         6976 | consumed tokens:     14286848 | elapsed time per iteration (s): 15.24 | learning rate: 2.286E-06 | global batch size:    16 | lm loss: 8.557680E+00 | grad norm: 1.607 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      437/  128728 | consumed samples:         6992 | consumed tokens:     14319616 | elapsed time per iteration (s): 15.22 | learning rate: 2.291E-06 | global batch size:    16 | lm loss: 8.482496E+00 | grad norm: 1.482 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      438/  128728 | consumed samples:         7008 | consumed tokens:     14352384 | elapsed time per iteration (s): 15.18 | learning rate: 2.296E-06 | global batch size:    16 | lm loss: 8.464623E+00 | grad norm: 1.392 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      439/  128728 | consumed samples:         7024 | consumed tokens:     14385152 | elapsed time per iteration (s): 15.23 | learning rate: 2.302E-06 | global batch size:    16 | lm loss: 8.540413E+00 | grad norm: 1.409 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      440/  128728 | consumed samples:         7040 | consumed tokens:     14417920 | elapsed time per iteration (s): 15.21 | learning rate: 2.307E-06 | global batch size:    16 | lm loss: 8.238720E+00 | grad norm: 1.500 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      441/  128728 | consumed samples:         7056 | consumed tokens:     14450688 | elapsed time per iteration (s): 15.23 | learning rate: 2.312E-06 | global batch size:    16 | lm loss: 8.452703E+00 | grad norm: 1.373 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      442/  128728 | consumed samples:         7072 | consumed tokens:     14483456 | elapsed time per iteration (s): 15.21 | learning rate: 2.317E-06 | global batch size:    16 | lm loss: 8.485899E+00 | grad norm: 1.983 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      443/  128728 | consumed samples:         7088 | consumed tokens:     14516224 | elapsed time per iteration (s): 15.23 | learning rate: 2.323E-06 | global batch size:    16 | lm loss: 8.319616E+00 | grad norm: 1.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      444/  128728 | consumed samples:         7104 | consumed tokens:     14548992 | elapsed time per iteration (s): 15.19 | learning rate: 2.328E-06 | global batch size:    16 | lm loss: 8.515532E+00 | grad norm: 2.338 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      445/  128728 | consumed samples:         7120 | consumed tokens:     14581760 | elapsed time per iteration (s): 15.26 | learning rate: 2.333E-06 | global batch size:    16 | lm loss: 8.538868E+00 | grad norm: 1.501 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      446/  128728 | consumed samples:         7136 | consumed tokens:     14614528 | elapsed time per iteration (s): 15.24 | learning rate: 2.338E-06 | global batch size:    16 | lm loss: 8.450447E+00 | grad norm: 2.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      447/  128728 | consumed samples:         7152 | consumed tokens:     14647296 | elapsed time per iteration (s): 15.21 | learning rate: 2.344E-06 | global batch size:    16 | lm loss: 8.434704E+00 | grad norm: 1.665 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      448/  128728 | consumed samples:         7168 | consumed tokens:     14680064 | elapsed time per iteration (s): 15.23 | learning rate: 2.349E-06 | global batch size:    16 | lm loss: 8.462121E+00 | grad norm: 2.163 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      449/  128728 | consumed samples:         7184 | consumed tokens:     14712832 | elapsed time per iteration (s): 15.28 | learning rate: 2.354E-06 | global batch size:    16 | lm loss: 8.375209E+00 | grad norm: 1.431 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration      450/  128728 | consumed samples:         7200 | consumed tokens:     14745600 | elapsed time per iteration (s): 15.23 | learning rate: 2.359E-06 | global batch size:    16 | lm loss: 8.421515E+00 | grad norm: 3.121 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      451/  128728 | consumed samples:         7216 | consumed tokens:     14778368 | elapsed time per iteration (s): 15.25 | learning rate: 2.365E-06 | global batch size:    16 | lm loss: 8.501270E+00 | grad norm: 2.456 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      452/  128728 | consumed samples:         7232 | consumed tokens:     14811136 | elapsed time per iteration (s): 15.21 | learning rate: 2.370E-06 | global batch size:    16 | lm loss: 8.473967E+00 | grad norm: 2.406 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      453/  128728 | consumed samples:         7248 | consumed tokens:     14843904 | elapsed time per iteration (s): 15.22 | learning rate: 2.375E-06 | global batch size:    16 | lm loss: 8.457233E+00 | grad norm: 2.141 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      454/  128728 | consumed samples:         7264 | consumed tokens:     14876672 | elapsed time per iteration (s): 15.20 | learning rate: 2.380E-06 | global batch size:    16 | lm loss: 8.371508E+00 | grad norm: 2.295 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      455/  128728 | consumed samples:         7280 | consumed tokens:     14909440 | elapsed time per iteration (s): 15.23 | learning rate: 2.386E-06 | global batch size:    16 | lm loss: 8.326353E+00 | grad norm: 2.136 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      456/  128728 | consumed samples:         7296 | consumed tokens:     14942208 | elapsed time per iteration (s): 15.22 | learning rate: 2.391E-06 | global batch size:    16 | lm loss: 8.511120E+00 | grad norm: 1.501 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      457/  128728 | consumed samples:         7312 | consumed tokens:     14974976 | elapsed time per iteration (s): 15.15 | learning rate: 2.396E-06 | global batch size:    16 | lm loss: 8.472582E+00 | grad norm: 1.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration      458/  128728 | consumed samples:         7328 | consumed tokens:     15007744 | elapsed time per iteration (s): 15.22 | learning rate: 2.401E-06 | global batch size:    16 | lm loss: 8.273072E+00 | grad norm: 7.604 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      459/  128728 | consumed samples:         7344 | consumed tokens:     15040512 | elapsed time per iteration (s): 15.22 | learning rate: 2.406E-06 | global batch size:    16 | lm loss: 8.573572E+00 | grad norm: 3.482 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      460/  128728 | consumed samples:         7360 | consumed tokens:     15073280 | elapsed time per iteration (s): 15.22 | learning rate: 2.412E-06 | global batch size:    16 | lm loss: 8.714581E+00 | grad norm: 2.643 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      461/  128728 | consumed samples:         7376 | consumed tokens:     15106048 | elapsed time per iteration (s): 15.22 | learning rate: 2.417E-06 | global batch size:    16 | lm loss: 8.068087E+00 | grad norm: 2.726 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      462/  128728 | consumed samples:         7392 | consumed tokens:     15138816 | elapsed time per iteration (s): 15.24 | learning rate: 2.422E-06 | global batch size:    16 | lm loss: 8.380728E+00 | grad norm: 3.734 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      463/  128728 | consumed samples:         7408 | consumed tokens:     15171584 | elapsed time per iteration (s): 15.24 | learning rate: 2.427E-06 | global batch size:    16 | lm loss: 8.633892E+00 | grad norm: 1.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      464/  128728 | consumed samples:         7424 | consumed tokens:     15204352 | elapsed time per iteration (s): 15.24 | learning rate: 2.433E-06 | global batch size:    16 | lm loss: 8.328359E+00 | grad norm: 2.455 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      465/  128728 | consumed samples:         7440 | consumed tokens:     15237120 | elapsed time per iteration (s): 15.23 | learning rate: 2.438E-06 | global batch size:    16 | lm loss: 8.553513E+00 | grad norm: 2.396 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      466/  128728 | consumed samples:         7456 | consumed tokens:     15269888 | elapsed time per iteration (s): 15.24 | learning rate: 2.443E-06 | global batch size:    16 | lm loss: 8.325161E+00 | grad norm: 1.684 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      467/  128728 | consumed samples:         7472 | consumed tokens:     15302656 | elapsed time per iteration (s): 15.25 | learning rate: 2.448E-06 | global batch size:    16 | lm loss: 8.422958E+00 | grad norm: 2.124 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      468/  128728 | consumed samples:         7488 | consumed tokens:     15335424 | elapsed time per iteration (s): 15.24 | learning rate: 2.454E-06 | global batch size:    16 | lm loss: 8.435691E+00 | grad norm: 1.864 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      469/  128728 | consumed samples:         7504 | consumed tokens:     15368192 | elapsed time per iteration (s): 15.25 | learning rate: 2.459E-06 | global batch size:    16 | lm loss: 8.224545E+00 | grad norm: 1.955 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      470/  128728 | consumed samples:         7520 | consumed tokens:     15400960 | elapsed time per iteration (s): 15.23 | learning rate: 2.464E-06 | global batch size:    16 | lm loss: 8.631124E+00 | grad norm: 2.884 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      471/  128728 | consumed samples:         7536 | consumed tokens:     15433728 | elapsed time per iteration (s): 15.23 | learning rate: 2.469E-06 | global batch size:    16 | lm loss: 8.445702E+00 | grad norm: 2.323 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      472/  128728 | consumed samples:         7552 | consumed tokens:     15466496 | elapsed time per iteration (s): 15.24 | learning rate: 2.475E-06 | global batch size:    16 | lm loss: 8.381889E+00 | grad norm: 2.096 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      473/  128728 | consumed samples:         7568 | consumed tokens:     15499264 | elapsed time per iteration (s): 15.21 | learning rate: 2.480E-06 | global batch size:    16 | lm loss: 8.303854E+00 | grad norm: 1.661 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      474/  128728 | consumed samples:         7584 | consumed tokens:     15532032 | elapsed time per iteration (s): 15.24 | learning rate: 2.485E-06 | global batch size:    16 | lm loss: 8.326303E+00 | grad norm: 2.465 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      475/  128728 | consumed samples:         7600 | consumed tokens:     15564800 | elapsed time per iteration (s): 15.18 | learning rate: 2.490E-06 | global batch size:    16 | lm loss: 8.428562E+00 | grad norm: 2.307 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      476/  128728 | consumed samples:         7616 | consumed tokens:     15597568 | elapsed time per iteration (s): 15.19 | learning rate: 2.496E-06 | global batch size:    16 | lm loss: 8.343838E+00 | grad norm: 1.918 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      477/  128728 | consumed samples:         7632 | consumed tokens:     15630336 | elapsed time per iteration (s): 15.23 | learning rate: 2.501E-06 | global batch size:    16 | lm loss: 8.380249E+00 | grad norm: 1.896 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      478/  128728 | consumed samples:         7648 | consumed tokens:     15663104 | elapsed time per iteration (s): 15.25 | learning rate: 2.506E-06 | global batch size:    16 | lm loss: 8.442167E+00 | grad norm: 1.867 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      479/  128728 | consumed samples:         7664 | consumed tokens:     15695872 | elapsed time per iteration (s): 15.26 | learning rate: 2.511E-06 | global batch size:    16 | lm loss: 8.244312E+00 | grad norm: 2.274 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      480/  128728 | consumed samples:         7680 | consumed tokens:     15728640 | elapsed time per iteration (s): 15.22 | learning rate: 2.517E-06 | global batch size:    16 | lm loss: 8.509534E+00 | grad norm: 1.639 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      481/  128728 | consumed samples:         7696 | consumed tokens:     15761408 | elapsed time per iteration (s): 15.26 | learning rate: 2.522E-06 | global batch size:    16 | lm loss: 8.340829E+00 | grad norm: 2.525 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      482/  128728 | consumed samples:         7712 | consumed tokens:     15794176 | elapsed time per iteration (s): 15.23 | learning rate: 2.527E-06 | global batch size:    16 | lm loss: 8.174219E+00 | grad norm: 1.454 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      483/  128728 | consumed samples:         7728 | consumed tokens:     15826944 | elapsed time per iteration (s): 15.23 | learning rate: 2.532E-06 | global batch size:    16 | lm loss: 8.252996E+00 | grad norm: 2.464 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      484/  128728 | consumed samples:         7744 | consumed tokens:     15859712 | elapsed time per iteration (s): 15.25 | learning rate: 2.538E-06 | global batch size:    16 | lm loss: 8.682319E+00 | grad norm: 2.568 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      485/  128728 | consumed samples:         7760 | consumed tokens:     15892480 | elapsed time per iteration (s): 15.23 | learning rate: 2.543E-06 | global batch size:    16 | lm loss: 8.436552E+00 | grad norm: 1.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      486/  128728 | consumed samples:         7776 | consumed tokens:     15925248 | elapsed time per iteration (s): 15.22 | learning rate: 2.548E-06 | global batch size:    16 | lm loss: 8.348639E+00 | grad norm: 2.126 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      487/  128728 | consumed samples:         7792 | consumed tokens:     15958016 | elapsed time per iteration (s): 15.25 | learning rate: 2.553E-06 | global batch size:    16 | lm loss: 8.289967E+00 | grad norm: 1.481 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      488/  128728 | consumed samples:         7808 | consumed tokens:     15990784 | elapsed time per iteration (s): 15.24 | learning rate: 2.559E-06 | global batch size:    16 | lm loss: 8.350722E+00 | grad norm: 1.249 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      489/  128728 | consumed samples:         7824 | consumed tokens:     16023552 | elapsed time per iteration (s): 15.24 | learning rate: 2.564E-06 | global batch size:    16 | lm loss: 8.134272E+00 | grad norm: 1.163 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      490/  128728 | consumed samples:         7840 | consumed tokens:     16056320 | elapsed time per iteration (s): 15.21 | learning rate: 2.569E-06 | global batch size:    16 | lm loss: 8.318563E+00 | grad norm: 1.376 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      491/  128728 | consumed samples:         7856 | consumed tokens:     16089088 | elapsed time per iteration (s): 15.23 | learning rate: 2.574E-06 | global batch size:    16 | lm loss: 8.154824E+00 | grad norm: 1.389 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      492/  128728 | consumed samples:         7872 | consumed tokens:     16121856 | elapsed time per iteration (s): 15.19 | learning rate: 2.580E-06 | global batch size:    16 | lm loss: 8.418050E+00 | grad norm: 1.445 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      493/  128728 | consumed samples:         7888 | consumed tokens:     16154624 | elapsed time per iteration (s): 15.20 | learning rate: 2.585E-06 | global batch size:    16 | lm loss: 8.386696E+00 | grad norm: 2.251 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      494/  128728 | consumed samples:         7904 | consumed tokens:     16187392 | elapsed time per iteration (s): 15.21 | learning rate: 2.590E-06 | global batch size:    16 | lm loss: 8.342263E+00 | grad norm: 1.546 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      495/  128728 | consumed samples:         7920 | consumed tokens:     16220160 | elapsed time per iteration (s): 15.23 | learning rate: 2.595E-06 | global batch size:    16 | lm loss: 8.309517E+00 | grad norm: 1.872 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      496/  128728 | consumed samples:         7936 | consumed tokens:     16252928 | elapsed time per iteration (s): 15.26 | learning rate: 2.600E-06 | global batch size:    16 | lm loss: 8.248186E+00 | grad norm: 1.297 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      497/  128728 | consumed samples:         7952 | consumed tokens:     16285696 | elapsed time per iteration (s): 15.23 | learning rate: 2.606E-06 | global batch size:    16 | lm loss: 8.194453E+00 | grad norm: 1.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      498/  128728 | consumed samples:         7968 | consumed tokens:     16318464 | elapsed time per iteration (s): 15.25 | learning rate: 2.611E-06 | global batch size:    16 | lm loss: 8.389359E+00 | grad norm: 2.262 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      499/  128728 | consumed samples:         7984 | consumed tokens:     16351232 | elapsed time per iteration (s): 15.22 | learning rate: 2.616E-06 | global batch size:    16 | lm loss: 8.140213E+00 | grad norm: 1.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      500/  128728 | consumed samples:         8000 | consumed tokens:     16384000 | elapsed time per iteration (s): 15.26 | learning rate: 2.621E-06 | global batch size:    16 | lm loss: 8.574575E+00 | grad norm: 1.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default0]:saving checkpoint at iteration     500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:[2022-03-03 08:00:44,912] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/mp_rank_00_model_states.pt
[default1]:[2022-03-03 08:00:45,317] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/mp_rank_01_model_states.pt
[default5]:[2022-03-03 08:01:22,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt
[default1]:[2022-03-03 08:01:25,322] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt
[default1]:[2022-03-03 08:01:25,803] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt
[default7]:[2022-03-03 08:01:25,871] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt
[default6]:[2022-03-03 08:01:27,645] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt
[default0]:[2022-03-03 08:01:27,898] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt
[default7]:[2022-03-03 08:01:28,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt
[default0]:[2022-03-03 08:01:28,281] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt
[default2]:[2022-03-03 08:01:28,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt
[default4]:[2022-03-03 08:01:28,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt
[default2]:[2022-03-03 08:01:28,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 08:01:28,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt
[default3]:[2022-03-03 08:01:28,483] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt
[default1]:[2022-03-03 08:01:28,611] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt
[default1]:[2022-03-03 08:01:28,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt
[default3]:[2022-03-03 08:01:28,866] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt
[default7]:[2022-03-03 08:01:29,048] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt
[default4]:[2022-03-03 08:01:29,140] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt
[default2]:[2022-03-03 08:01:29,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt
[default5]:[2022-03-03 08:01:29,220] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 08:01:29,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt
[default0]:[2022-03-03 08:01:29,436] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt
[default5]:[2022-03-03 08:01:29,424] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt
[default7]:[2022-03-03 08:01:29,515] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt
[default1]:[2022-03-03 08:01:29,670] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 08:01:29,851] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt
[default6]:[2022-03-03 08:01:29,915] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt
[default4]:[2022-03-03 08:01:30,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt
[default6]:[2022-03-03 08:01:30,143] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt
[default3]:[2022-03-03 08:01:30,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt
[default4]:[2022-03-03 08:01:30,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 08:01:31,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt
[default3]:[2022-03-03 08:01:31,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt
[default6]:[2022-03-03 08:01:31,158] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt
[default1]:[2022-03-03 08:01:31,296] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt
[default7]:[2022-03-03 08:01:31,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt
[default0]:[2022-03-03 08:01:32,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt
[default3]:[2022-03-03 08:01:32,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt
[default6]:[2022-03-03 08:01:32,413] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt
[default4]:[2022-03-03 08:01:32,460] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt
[default1]:[2022-03-03 08:01:32,379] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt
[default7]:[2022-03-03 08:01:32,453] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 08:01:32,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 08:01:32,451] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt
[default1]:[2022-03-03 08:01:32,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 08:01:32,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt
[default2]:[2022-03-03 08:01:32,471] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt
[default3]:[2022-03-03 08:01:32,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt
[default4]:[2022-03-03 08:01:32,578] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt
[default7]:[2022-03-03 08:01:32,567] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt
[default7]:[2022-03-03 08:01:32,550] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt
[default4]:[2022-03-03 08:01:32,662] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt
[default4]:[2022-03-03 08:01:32,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt
[default4]:[2022-03-03 08:01:32,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt
[default4]:[2022-03-03 08:01:32,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt
[default0]:[2022-03-03 08:01:32,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt
[default6]:[2022-03-03 08:01:32,705] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 08:01:32,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 08:01:32,714] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 08:01:32,730] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 08:01:32,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt
[default0]:[2022-03-03 08:01:32,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt
[default0]:[2022-03-03 08:01:32,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt
[default6]:[2022-03-03 08:01:32,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt
[default3]:[2022-03-03 08:01:32,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 08:01:32,727] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt
[default2]:[2022-03-03 08:01:32,709] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt
[default3]:[2022-03-03 08:01:32,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt
[default0]:[2022-03-03 08:01:32,764] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt
[default0]:[2022-03-03 08:01:32,758] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt
[default2]:[2022-03-03 08:01:32,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt
[default0]:[2022-03-03 08:01:32,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
[default5]:[2022-03-03 08:01:32,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt
[default6]:[2022-03-03 08:01:32,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt
[default6]:[2022-03-03 08:01:32,758] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt
[default2]:[2022-03-03 08:01:32,807] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt
[default3]:[2022-03-03 08:01:32,842] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt
[default0]:[2022-03-03 08:01:32,800] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt
[default1]:[2022-03-03 08:01:32,823] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt
[default5]:[2022-03-03 08:01:32,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt
[default4]:[2022-03-03 08:01:32,878] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt
[default7]:[2022-03-03 08:01:32,840] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 08:01:32,915] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt
[default1]:[2022-03-03 08:01:32,957] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt
[default5]:[2022-03-03 08:01:32,938] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt
[default5]:[2022-03-03 08:01:32,957] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 08:01:32,924] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt
[default1]:[2022-03-03 08:01:33,024] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt
[default7]:[2022-03-03 08:01:32,974] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt
[default2]:[2022-03-03 08:01:32,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 08:01:33,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt
[default0]:[2022-03-03 08:01:33,045] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt
[default0]:[2022-03-03 08:01:33,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt
[default2]:[2022-03-03 08:01:33,171] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 08:01:33,150] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt
[default0]:[2022-03-03 08:01:33,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt
[default2]:[2022-03-03 08:01:33,186] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt
[default4]:[2022-03-03 08:01:33,245] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 08:01:33,388] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 08:01:33,443] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt
[default3]:[2022-03-03 08:01:33,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 08:01:33,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt
[default3]:[2022-03-03 08:01:33,377] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt
[default4]:[2022-03-03 08:01:33,400] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt
[default2]:[2022-03-03 08:01:33,420] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt
[default3]:[2022-03-03 08:01:33,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt
[default1]:[2022-03-03 08:01:33,476] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt
[default5]:[2022-03-03 08:01:33,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt
[default1]:[2022-03-03 08:01:33,514] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt
[default0]:[2022-03-03 08:01:33,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 08:01:33,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt
[default4]:[2022-03-03 08:01:33,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt
[default7]:[2022-03-03 08:01:33,602] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt
[default1]:[2022-03-03 08:01:33,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt
[default1]:[2022-03-03 08:01:33,604] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt
[default6]:[2022-03-03 08:01:33,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt
[default6]:[2022-03-03 08:01:33,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt
[default3]:[2022-03-03 08:01:33,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt
[default7]:[2022-03-03 08:01:33,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 08:01:33,613] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt
[default2]:[2022-03-03 08:01:33,724] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 08:01:33,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 08:01:33,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 08:01:33,830] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt
[default1]:[2022-03-03 08:01:33,794] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 08:01:33,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt
[default6]:[2022-03-03 08:01:33,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 08:01:33,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 08:01:33,896] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt
[default1]:[2022-03-03 08:01:33,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt
[default5]:[2022-03-03 08:01:33,912] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 08:01:33,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt
[default4]:[2022-03-03 08:01:33,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt
[default6]:[2022-03-03 08:01:33,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 08:01:33,934] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt
[default3]:[2022-03-03 08:01:33,983] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt
[default1]:[2022-03-03 08:01:33,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt
[default5]:[2022-03-03 08:01:34,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt
[default6]:[2022-03-03 08:01:34,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 08:01:34,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt
[default2]:[2022-03-03 08:01:34,048] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt
[default1]:[2022-03-03 08:01:34,082] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt
[default0]:[2022-03-03 08:01:34,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[default4]:[2022-03-03 08:01:34,150] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt
[default3]:[2022-03-03 08:01:34,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt
[default5]:[2022-03-03 08:01:34,245] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt
[default1]:[2022-03-03 08:01:34,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt
[default7]:[2022-03-03 08:01:34,223] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 08:01:34,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt
[default7]:[2022-03-03 08:01:34,202] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt
[default3]:[2022-03-03 08:01:34,284] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt
[default1]:[2022-03-03 08:01:34,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt
[default5]:[2022-03-03 08:01:34,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt
[default7]:[2022-03-03 08:01:34,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt
[default4]:[2022-03-03 08:01:34,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt
[default1]:[2022-03-03 08:01:34,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt
[default2]:[2022-03-03 08:01:34,366] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt
[default0]:[2022-03-03 08:01:34,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 08:01:34,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt
[default0]:[2022-03-03 08:01:34,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt
[default0]:[2022-03-03 08:01:34,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt
[default3]:[2022-03-03 08:01:34,492] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt
[default1]:[2022-03-03 08:01:34,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt
[default5]:[2022-03-03 08:01:34,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt
[default5]:[2022-03-03 08:01:34,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt
[default4]:[2022-03-03 08:01:34,544] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt
[default5]:[2022-03-03 08:01:34,505] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 08:01:34,486] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt
[default7]:[2022-03-03 08:01:34,568] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt
[default7]:[2022-03-03 08:01:34,562] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt
[default2]:[2022-03-03 08:01:34,591] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt
[default1]:[2022-03-03 08:01:34,558] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 08:01:34,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt
[default0]:[2022-03-03 08:01:34,601] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 08:01:34,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt
[default5]:[2022-03-03 08:01:34,643] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt
[default2]:[2022-03-03 08:01:34,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 08:01:34,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt
[default0]:[2022-03-03 08:01:34,750] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt
[default4]:[2022-03-03 08:01:34,700] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt
[default2]:[2022-03-03 08:01:34,715] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 08:01:34,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt
[default7]:[2022-03-03 08:01:34,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt
[default5]:[2022-03-03 08:01:34,805] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt
[default6]:[2022-03-03 08:01:34,868] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt
[default7]:[2022-03-03 08:01:34,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt
[default7]:[2022-03-03 08:01:34,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt
[default5]:[2022-03-03 08:01:34,848] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt
[default3]:[2022-03-03 08:01:34,897] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 08:01:34,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt
[default0]:[2022-03-03 08:01:34,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt
[default4]:[2022-03-03 08:01:34,913] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt
[default5]:[2022-03-03 08:01:34,960] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 08:01:34,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
[default1]:[2022-03-03 08:01:34,923] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt
[default1]:[2022-03-03 08:01:34,983] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt
[default5]:[2022-03-03 08:01:34,980] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 08:01:34,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
[default1]:[2022-03-03 08:01:34,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt
[default6]:[2022-03-03 08:01:35,036] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt
[default5]:[2022-03-03 08:01:35,069] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 08:01:35,049] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 08:01:35,058] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 08:01:35,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt
[default1]:[2022-03-03 08:01:35,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt
[default0]:[2022-03-03 08:01:35,126] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt
[default3]:[2022-03-03 08:01:35,111] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt
[default2]:[2022-03-03 08:01:35,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt
[default0]:[2022-03-03 08:01:35,096] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt
[default0]:[2022-03-03 08:01:35,096] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 08:01:35,142] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt
[default0]:[2022-03-03 08:01:35,106] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt
[default2]:[2022-03-03 08:01:35,138] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt
[default3]:[2022-03-03 08:01:35,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt
[default6]:[2022-03-03 08:01:35,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt
[default6]:[2022-03-03 08:01:35,170] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt
[default1]:[2022-03-03 08:01:35,157] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt
[default2]:[2022-03-03 08:01:35,147] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt
[default7]:[2022-03-03 08:01:35,164] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt
[default4]:[2022-03-03 08:01:35,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 08:01:35,181] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 08:01:35,118] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt
[default1]:[2022-03-03 08:01:35,163] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 08:01:35,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt
[default2]:[2022-03-03 08:01:35,213] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt
[default4]:[2022-03-03 08:01:35,170] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt
[default6]:[2022-03-03 08:01:35,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt
[default6]:[2022-03-03 08:01:35,150] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt
[default0]:[2022-03-03 08:01:35,195] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt
[default3]:[2022-03-03 08:01:35,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt
[default6]:[2022-03-03 08:01:35,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt
[default6]:[2022-03-03 08:01:35,228] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt
[default4]:[2022-03-03 08:01:35,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt
[default2]:[2022-03-03 08:01:35,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt
[default4]:[2022-03-03 08:01:35,271] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt
[default4]:[2022-03-03 08:01:35,205] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt
[default2]:[2022-03-03 08:01:35,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt
[default5]:[2022-03-03 08:01:35,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt
[default3]:[2022-03-03 08:01:35,355] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt
[default5]:[2022-03-03 08:01:35,319] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 08:01:35,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt
[default0]:[2022-03-03 08:01:35,363] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt
[default2]:[2022-03-03 08:01:35,326] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt
[default0]:[2022-03-03 08:01:35,404] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt
[default6]:[2022-03-03 08:01:35,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt
[default4]:[2022-03-03 08:01:35,383] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt
[default1]:[2022-03-03 08:01:35,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt
[default1]:[2022-03-03 08:01:35,412] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt
[default3]:[2022-03-03 08:01:35,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 08:01:35,460] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt
[default7]:[2022-03-03 08:01:35,541] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt
[default0]:[2022-03-03 08:01:35,607] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt
[default4]:[2022-03-03 08:01:35,606] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt
[default2]:[2022-03-03 08:01:35,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt
[default1]:[2022-03-03 08:01:35,592] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt
[default5]:[2022-03-03 08:01:35,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 08:01:35,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt
[default2]:[2022-03-03 08:01:35,619] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt
[default2]:[2022-03-03 08:01:35,736] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt
[default4]:[2022-03-03 08:01:35,682] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 08:01:35,691] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt
[default1]:[2022-03-03 08:01:35,680] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 08:01:35,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt
[default2]:[2022-03-03 08:01:35,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt
[default7]:[2022-03-03 08:01:35,740] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt
[default7]:[2022-03-03 08:01:35,752] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt
[default6]:[2022-03-03 08:01:35,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt
[default0]:[2022-03-03 08:01:35,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt
[default7]:[2022-03-03 08:01:35,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt
[default7]:[2022-03-03 08:01:35,760] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt
[default5]:[2022-03-03 08:01:35,812] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt
[default5]:[2022-03-03 08:01:35,831] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt
[default4]:[2022-03-03 08:01:35,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 08:01:35,910] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt
[default5]:[2022-03-03 08:01:35,898] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt
[default5]:[2022-03-03 08:01:35,914] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 08:01:35,876] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 08:01:35,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt
[default2]:[2022-03-03 08:01:35,903] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 08:01:35,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt
[default4]:[2022-03-03 08:01:35,976] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt
[default3]:[2022-03-03 08:01:35,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 08:01:35,928] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt
[default7]:[2022-03-03 08:01:35,945] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 08:01:35,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt
[default6]:[2022-03-03 08:01:35,924] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt
[default5]:[2022-03-03 08:01:36,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt
[default6]:[2022-03-03 08:01:36,064] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt
[default4]:[2022-03-03 08:01:36,159] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 08:01:36,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt
[default1]:[2022-03-03 08:01:36,163] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 08:01:36,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt
[default7]:[2022-03-03 08:01:36,340] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt
[default4]:[2022-03-03 08:01:36,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
[default6]:[2022-03-03 08:01:36,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt
[default4]:[2022-03-03 08:01:36,448] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt
[default3]:[2022-03-03 08:01:36,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 08:01:36,523] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 08:01:36,542] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 08:01:36,515] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt
[default3]:[2022-03-03 08:01:36,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt
[default3]:[2022-03-03 08:01:36,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt
[default2]:[2022-03-03 08:01:36,618] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt
[default3]:[2022-03-03 08:01:36,628] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt
[default3]:[2022-03-03 08:01:36,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 08:01:36,706] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt
[default5]:[2022-03-03 08:01:36,636] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 08:01:36,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt
[default1]:[2022-03-03 08:01:36,742] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 08:01:36,729] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt
[default0]:[2022-03-03 08:01:36,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt
[default0]:[2022-03-03 08:01:36,708] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt
[default7]:[2022-03-03 08:01:36,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 08:01:36,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt
[default2]:[2022-03-03 08:01:36,777] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt
[default2]:[2022-03-03 08:01:36,804] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt
[default7]:[2022-03-03 08:01:36,744] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt
[default4]:[2022-03-03 08:01:36,770] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt
[default0]:[2022-03-03 08:01:36,806] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt
[default2]:[2022-03-03 08:01:36,798] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt
[default0]:[2022-03-03 08:01:36,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
[default4]:[2022-03-03 08:01:36,897] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt
[default6]:[2022-03-03 08:01:36,916] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 08:01:36,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt
[default7]:[2022-03-03 08:01:36,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt
[default7]:[2022-03-03 08:01:36,922] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt
[default1]:[2022-03-03 08:01:36,976] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt
[default6]:[2022-03-03 08:01:37,017] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt
[default0]:[2022-03-03 08:01:36,971] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt
[default1]:[2022-03-03 08:01:36,952] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt
[default2]:[2022-03-03 08:01:36,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt
[default2]:[2022-03-03 08:01:36,992] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt
[default5]:[2022-03-03 08:01:37,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt
[default1]:[2022-03-03 08:01:37,028] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt
[default5]:[2022-03-03 08:01:37,069] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt
[default7]:[2022-03-03 08:01:37,151] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt
[default6]:[2022-03-03 08:01:37,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt
[default1]:[2022-03-03 08:01:37,282] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt
[default6]:[2022-03-03 08:01:37,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt
[default2]:[2022-03-03 08:01:37,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt
[default7]:[2022-03-03 08:01:37,394] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt
[default0]:[2022-03-03 08:01:37,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt
[default4]:[2022-03-03 08:01:37,577] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt
[default2]:[2022-03-03 08:01:37,646] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt
[default6]:[2022-03-03 08:01:37,627] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt
[default5]:[2022-03-03 08:01:37,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt
[default6]:[2022-03-03 08:01:37,796] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt
[default3]:[2022-03-03 08:01:37,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 08:01:37,804] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt
[default4]:[2022-03-03 08:01:37,892] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
[default7]:[2022-03-03 08:01:37,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 08:01:38,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt
[default3]:[2022-03-03 08:01:38,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt
[default1]:[2022-03-03 08:01:38,175] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt
[default4]:[2022-03-03 08:01:38,177] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt
[default5]:[2022-03-03 08:01:38,168] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 08:01:38,274] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt
[default6]:[2022-03-03 08:01:38,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt
[default0]:[2022-03-03 08:01:38,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt
[default1]:[2022-03-03 08:01:38,255] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt
[default0]:[2022-03-03 08:01:38,318] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
[default7]:[2022-03-03 08:01:38,293] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt
[default5]:[2022-03-03 08:01:38,290] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt
[default7]:[2022-03-03 08:01:38,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt
[default4]:[2022-03-03 08:01:38,362] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
[default6]:[2022-03-03 08:01:38,472] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt
[default5]:[2022-03-03 08:01:38,587] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt
[default0]:[2022-03-03 08:01:38,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 08:01:38,604] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt
[default5]:[2022-03-03 08:01:38,600] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt
[default1]:[2022-03-03 08:01:38,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 08:01:38,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt
[default7]:[2022-03-03 08:01:38,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt
[default4]:[2022-03-03 08:01:38,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 08:01:39,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 08:01:39,047] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt
[default6]:[2022-03-03 08:01:40,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 08:01:40,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt
[default4]:[2022-03-03 08:01:40,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt
[default5]:[2022-03-03 08:01:41,456] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt
[default2]:[2022-03-03 08:01:41,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 08:01:41,647] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt
[default0]:  successfully saved checkpoint at iteration     500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:[2022-03-03 08:01:42,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt
[default1]:[2022-03-03 08:01:42,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt
[default7]:time (ms) | save-checkpoint: 67248.11
[default7]: iteration      501/  128728 | consumed samples:         8016 | consumed tokens:     16416768 | elapsed time per iteration (s): 82.50 | learning rate: 2.627E-06 | global batch size:    16 | lm loss: 8.488214E+00 | grad norm: 1.406 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.194 | TFLOPs: 1.48 |
[default7]: iteration      502/  128728 | consumed samples:         8032 | consumed tokens:     16449536 | elapsed time per iteration (s): 15.23 | learning rate: 2.632E-06 | global batch size:    16 | lm loss: 8.423536E+00 | grad norm: 2.178 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      503/  128728 | consumed samples:         8048 | consumed tokens:     16482304 | elapsed time per iteration (s): 15.22 | learning rate: 2.637E-06 | global batch size:    16 | lm loss: 8.185781E+00 | grad norm: 1.556 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      504/  128728 | consumed samples:         8064 | consumed tokens:     16515072 | elapsed time per iteration (s): 15.24 | learning rate: 2.642E-06 | global batch size:    16 | lm loss: 8.098662E+00 | grad norm: 2.003 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      505/  128728 | consumed samples:         8080 | consumed tokens:     16547840 | elapsed time per iteration (s): 15.26 | learning rate: 2.648E-06 | global batch size:    16 | lm loss: 8.311114E+00 | grad norm: 1.880 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      506/  128728 | consumed samples:         8096 | consumed tokens:     16580608 | elapsed time per iteration (s): 15.26 | learning rate: 2.653E-06 | global batch size:    16 | lm loss: 8.314884E+00 | grad norm: 1.417 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      507/  128728 | consumed samples:         8112 | consumed tokens:     16613376 | elapsed time per iteration (s): 15.21 | learning rate: 2.658E-06 | global batch size:    16 | lm loss: 8.274142E+00 | grad norm: 1.901 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      508/  128728 | consumed samples:         8128 | consumed tokens:     16646144 | elapsed time per iteration (s): 15.22 | learning rate: 2.663E-06 | global batch size:    16 | lm loss: 8.300067E+00 | grad norm: 1.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      509/  128728 | consumed samples:         8144 | consumed tokens:     16678912 | elapsed time per iteration (s): 15.24 | learning rate: 2.669E-06 | global batch size:    16 | lm loss: 8.125998E+00 | grad norm: 1.942 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      510/  128728 | consumed samples:         8160 | consumed tokens:     16711680 | elapsed time per iteration (s): 15.21 | learning rate: 2.674E-06 | global batch size:    16 | lm loss: 8.157375E+00 | grad norm: 1.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      511/  128728 | consumed samples:         8176 | consumed tokens:     16744448 | elapsed time per iteration (s): 15.26 | learning rate: 2.679E-06 | global batch size:    16 | lm loss: 8.114425E+00 | grad norm: 2.584 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      512/  128728 | consumed samples:         8192 | consumed tokens:     16777216 | elapsed time per iteration (s): 15.24 | learning rate: 2.684E-06 | global batch size:    16 | lm loss: 8.181797E+00 | grad norm: 1.527 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      513/  128728 | consumed samples:         8208 | consumed tokens:     16809984 | elapsed time per iteration (s): 15.19 | learning rate: 2.690E-06 | global batch size:    16 | lm loss: 8.276696E+00 | grad norm: 2.095 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      514/  128728 | consumed samples:         8224 | consumed tokens:     16842752 | elapsed time per iteration (s): 15.18 | learning rate: 2.695E-06 | global batch size:    16 | lm loss: 8.265854E+00 | grad norm: 2.106 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      515/  128728 | consumed samples:         8240 | consumed tokens:     16875520 | elapsed time per iteration (s): 15.18 | learning rate: 2.700E-06 | global batch size:    16 | lm loss: 8.100229E+00 | grad norm: 1.676 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      516/  128728 | consumed samples:         8256 | consumed tokens:     16908288 | elapsed time per iteration (s): 15.21 | learning rate: 2.705E-06 | global batch size:    16 | lm loss: 8.021216E+00 | grad norm: 1.580 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      517/  128728 | consumed samples:         8272 | consumed tokens:     16941056 | elapsed time per iteration (s): 15.20 | learning rate: 2.711E-06 | global batch size:    16 | lm loss: 8.086869E+00 | grad norm: 1.505 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      518/  128728 | consumed samples:         8288 | consumed tokens:     16973824 | elapsed time per iteration (s): 15.26 | learning rate: 2.716E-06 | global batch size:    16 | lm loss: 8.120964E+00 | grad norm: 2.182 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      519/  128728 | consumed samples:         8304 | consumed tokens:     17006592 | elapsed time per iteration (s): 15.25 | learning rate: 2.721E-06 | global batch size:    16 | lm loss: 8.232798E+00 | grad norm: 1.930 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      520/  128728 | consumed samples:         8320 | consumed tokens:     17039360 | elapsed time per iteration (s): 15.23 | learning rate: 2.726E-06 | global batch size:    16 | lm loss: 8.287365E+00 | grad norm: 1.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      521/  128728 | consumed samples:         8336 | consumed tokens:     17072128 | elapsed time per iteration (s): 15.23 | learning rate: 2.732E-06 | global batch size:    16 | lm loss: 8.058668E+00 | grad norm: 2.147 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      522/  128728 | consumed samples:         8352 | consumed tokens:     17104896 | elapsed time per iteration (s): 15.19 | learning rate: 2.737E-06 | global batch size:    16 | lm loss: 7.900158E+00 | grad norm: 2.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      523/  128728 | consumed samples:         8368 | consumed tokens:     17137664 | elapsed time per iteration (s): 15.24 | learning rate: 2.742E-06 | global batch size:    16 | lm loss: 8.412863E+00 | grad norm: 1.729 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      524/  128728 | consumed samples:         8384 | consumed tokens:     17170432 | elapsed time per iteration (s): 15.23 | learning rate: 2.747E-06 | global batch size:    16 | lm loss: 8.102924E+00 | grad norm: 2.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      525/  128728 | consumed samples:         8400 | consumed tokens:     17203200 | elapsed time per iteration (s): 15.24 | learning rate: 2.753E-06 | global batch size:    16 | lm loss: 7.951356E+00 | grad norm: 1.575 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      526/  128728 | consumed samples:         8416 | consumed tokens:     17235968 | elapsed time per iteration (s): 15.27 | learning rate: 2.758E-06 | global batch size:    16 | lm loss: 8.285418E+00 | grad norm: 2.349 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      527/  128728 | consumed samples:         8432 | consumed tokens:     17268736 | elapsed time per iteration (s): 15.21 | learning rate: 2.763E-06 | global batch size:    16 | lm loss: 8.269984E+00 | grad norm: 2.126 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      528/  128728 | consumed samples:         8448 | consumed tokens:     17301504 | elapsed time per iteration (s): 15.26 | learning rate: 2.768E-06 | global batch size:    16 | lm loss: 8.237260E+00 | grad norm: 2.085 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      529/  128728 | consumed samples:         8464 | consumed tokens:     17334272 | elapsed time per iteration (s): 15.25 | learning rate: 2.773E-06 | global batch size:    16 | lm loss: 8.148373E+00 | grad norm: 2.296 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      530/  128728 | consumed samples:         8480 | consumed tokens:     17367040 | elapsed time per iteration (s): 15.20 | learning rate: 2.779E-06 | global batch size:    16 | lm loss: 8.244123E+00 | grad norm: 1.654 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      531/  128728 | consumed samples:         8496 | consumed tokens:     17399808 | elapsed time per iteration (s): 15.16 | learning rate: 2.784E-06 | global batch size:    16 | lm loss: 8.061798E+00 | grad norm: 1.231 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration      532/  128728 | consumed samples:         8512 | consumed tokens:     17432576 | elapsed time per iteration (s): 15.19 | learning rate: 2.789E-06 | global batch size:    16 | lm loss: 8.042222E+00 | grad norm: 1.489 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      533/  128728 | consumed samples:         8528 | consumed tokens:     17465344 | elapsed time per iteration (s): 15.18 | learning rate: 2.794E-06 | global batch size:    16 | lm loss: 8.086902E+00 | grad norm: 1.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      534/  128728 | consumed samples:         8544 | consumed tokens:     17498112 | elapsed time per iteration (s): 15.23 | learning rate: 2.800E-06 | global batch size:    16 | lm loss: 8.083276E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      535/  128728 | consumed samples:         8560 | consumed tokens:     17530880 | elapsed time per iteration (s): 15.18 | learning rate: 2.805E-06 | global batch size:    16 | lm loss: 8.244881E+00 | grad norm: 1.885 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      536/  128728 | consumed samples:         8576 | consumed tokens:     17563648 | elapsed time per iteration (s): 15.18 | learning rate: 2.810E-06 | global batch size:    16 | lm loss: 8.199797E+00 | grad norm: 1.813 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      537/  128728 | consumed samples:         8592 | consumed tokens:     17596416 | elapsed time per iteration (s): 15.27 | learning rate: 2.815E-06 | global batch size:    16 | lm loss: 8.002762E+00 | grad norm: 1.044 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      538/  128728 | consumed samples:         8608 | consumed tokens:     17629184 | elapsed time per iteration (s): 15.28 | learning rate: 2.821E-06 | global batch size:    16 | lm loss: 8.290606E+00 | grad norm: 4.069 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration      539/  128728 | consumed samples:         8624 | consumed tokens:     17661952 | elapsed time per iteration (s): 15.26 | learning rate: 2.826E-06 | global batch size:    16 | lm loss: 7.995849E+00 | grad norm: 1.941 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      540/  128728 | consumed samples:         8640 | consumed tokens:     17694720 | elapsed time per iteration (s): 15.26 | learning rate: 2.831E-06 | global batch size:    16 | lm loss: 8.186256E+00 | grad norm: 1.338 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      541/  128728 | consumed samples:         8656 | consumed tokens:     17727488 | elapsed time per iteration (s): 15.18 | learning rate: 2.836E-06 | global batch size:    16 | lm loss: 8.296293E+00 | grad norm: 2.187 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      542/  128728 | consumed samples:         8672 | consumed tokens:     17760256 | elapsed time per iteration (s): 15.25 | learning rate: 2.842E-06 | global batch size:    16 | lm loss: 8.072968E+00 | grad norm: 2.974 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      543/  128728 | consumed samples:         8688 | consumed tokens:     17793024 | elapsed time per iteration (s): 15.20 | learning rate: 2.847E-06 | global batch size:    16 | lm loss: 8.082905E+00 | grad norm: 2.201 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      544/  128728 | consumed samples:         8704 | consumed tokens:     17825792 | elapsed time per iteration (s): 15.23 | learning rate: 2.852E-06 | global batch size:    16 | lm loss: 8.032642E+00 | grad norm: 2.560 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      545/  128728 | consumed samples:         8720 | consumed tokens:     17858560 | elapsed time per iteration (s): 15.22 | learning rate: 2.857E-06 | global batch size:    16 | lm loss: 8.391273E+00 | grad norm: 2.395 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      546/  128728 | consumed samples:         8736 | consumed tokens:     17891328 | elapsed time per iteration (s): 15.22 | learning rate: 2.863E-06 | global batch size:    16 | lm loss: 8.539359E+00 | grad norm: 4.169 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      547/  128728 | consumed samples:         8752 | consumed tokens:     17924096 | elapsed time per iteration (s): 15.23 | learning rate: 2.868E-06 | global batch size:    16 | lm loss: 8.038402E+00 | grad norm: 3.554 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      548/  128728 | consumed samples:         8768 | consumed tokens:     17956864 | elapsed time per iteration (s): 15.24 | learning rate: 2.873E-06 | global batch size:    16 | lm loss: 8.210316E+00 | grad norm: 1.668 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      549/  128728 | consumed samples:         8784 | consumed tokens:     17989632 | elapsed time per iteration (s): 15.27 | learning rate: 2.878E-06 | global batch size:    16 | lm loss: 8.174568E+00 | grad norm: 1.671 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      550/  128728 | consumed samples:         8800 | consumed tokens:     18022400 | elapsed time per iteration (s): 15.25 | learning rate: 2.884E-06 | global batch size:    16 | lm loss: 8.142506E+00 | grad norm: 2.204 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      551/  128728 | consumed samples:         8816 | consumed tokens:     18055168 | elapsed time per iteration (s): 15.23 | learning rate: 2.889E-06 | global batch size:    16 | lm loss: 8.139000E+00 | grad norm: 1.684 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      552/  128728 | consumed samples:         8832 | consumed tokens:     18087936 | elapsed time per iteration (s): 15.22 | learning rate: 2.894E-06 | global batch size:    16 | lm loss: 8.084474E+00 | grad norm: 1.485 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      553/  128728 | consumed samples:         8848 | consumed tokens:     18120704 | elapsed time per iteration (s): 15.27 | learning rate: 2.899E-06 | global batch size:    16 | lm loss: 8.098001E+00 | grad norm: 1.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      554/  128728 | consumed samples:         8864 | consumed tokens:     18153472 | elapsed time per iteration (s): 15.23 | learning rate: 2.905E-06 | global batch size:    16 | lm loss: 8.071024E+00 | grad norm: 1.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      555/  128728 | consumed samples:         8880 | consumed tokens:     18186240 | elapsed time per iteration (s): 15.23 | learning rate: 2.910E-06 | global batch size:    16 | lm loss: 8.011195E+00 | grad norm: 1.380 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      556/  128728 | consumed samples:         8896 | consumed tokens:     18219008 | elapsed time per iteration (s): 15.24 | learning rate: 2.915E-06 | global batch size:    16 | lm loss: 8.171795E+00 | grad norm: 1.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      557/  128728 | consumed samples:         8912 | consumed tokens:     18251776 | elapsed time per iteration (s): 15.24 | learning rate: 2.920E-06 | global batch size:    16 | lm loss: 8.022076E+00 | grad norm: 1.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      558/  128728 | consumed samples:         8928 | consumed tokens:     18284544 | elapsed time per iteration (s): 15.22 | learning rate: 2.926E-06 | global batch size:    16 | lm loss: 7.988214E+00 | grad norm: 1.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      559/  128728 | consumed samples:         8944 | consumed tokens:     18317312 | elapsed time per iteration (s): 15.23 | learning rate: 2.931E-06 | global batch size:    16 | lm loss: 7.990775E+00 | grad norm: 1.640 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      560/  128728 | consumed samples:         8960 | consumed tokens:     18350080 | elapsed time per iteration (s): 15.24 | learning rate: 2.936E-06 | global batch size:    16 | lm loss: 8.082418E+00 | grad norm: 1.921 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      561/  128728 | consumed samples:         8976 | consumed tokens:     18382848 | elapsed time per iteration (s): 15.24 | learning rate: 2.941E-06 | global batch size:    16 | lm loss: 8.083212E+00 | grad norm: 2.008 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      562/  128728 | consumed samples:         8992 | consumed tokens:     18415616 | elapsed time per iteration (s): 15.25 | learning rate: 2.947E-06 | global batch size:    16 | lm loss: 7.988510E+00 | grad norm: 1.481 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      563/  128728 | consumed samples:         9008 | consumed tokens:     18448384 | elapsed time per iteration (s): 15.26 | learning rate: 2.952E-06 | global batch size:    16 | lm loss: 8.018039E+00 | grad norm: 2.576 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      564/  128728 | consumed samples:         9024 | consumed tokens:     18481152 | elapsed time per iteration (s): 15.22 | learning rate: 2.957E-06 | global batch size:    16 | lm loss: 8.159368E+00 | grad norm: 1.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      565/  128728 | consumed samples:         9040 | consumed tokens:     18513920 | elapsed time per iteration (s): 15.24 | learning rate: 2.962E-06 | global batch size:    16 | lm loss: 8.076411E+00 | grad norm: 2.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      566/  128728 | consumed samples:         9056 | consumed tokens:     18546688 | elapsed time per iteration (s): 15.24 | learning rate: 2.967E-06 | global batch size:    16 | lm loss: 8.065808E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      567/  128728 | consumed samples:         9072 | consumed tokens:     18579456 | elapsed time per iteration (s): 15.24 | learning rate: 2.973E-06 | global batch size:    16 | lm loss: 8.268667E+00 | grad norm: 2.372 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      568/  128728 | consumed samples:         9088 | consumed tokens:     18612224 | elapsed time per iteration (s): 15.21 | learning rate: 2.978E-06 | global batch size:    16 | lm loss: 8.158611E+00 | grad norm: 2.448 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      569/  128728 | consumed samples:         9104 | consumed tokens:     18644992 | elapsed time per iteration (s): 15.19 | learning rate: 2.983E-06 | global batch size:    16 | lm loss: 8.178822E+00 | grad norm: 1.379 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration      570/  128728 | consumed samples:         9120 | consumed tokens:     18677760 | elapsed time per iteration (s): 15.23 | learning rate: 2.988E-06 | global batch size:    16 | lm loss: 8.175869E+00 | grad norm: 1.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      571/  128728 | consumed samples:         9136 | consumed tokens:     18710528 | elapsed time per iteration (s): 15.24 | learning rate: 2.994E-06 | global batch size:    16 | lm loss: 8.093798E+00 | grad norm: 1.806 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      572/  128728 | consumed samples:         9152 | consumed tokens:     18743296 | elapsed time per iteration (s): 15.22 | learning rate: 2.999E-06 | global batch size:    16 | lm loss: 8.181890E+00 | grad norm: 2.048 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      573/  128728 | consumed samples:         9168 | consumed tokens:     18776064 | elapsed time per iteration (s): 15.25 | learning rate: 3.004E-06 | global batch size:    16 | lm loss: 8.041045E+00 | grad norm: 2.266 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      574/  128728 | consumed samples:         9184 | consumed tokens:     18808832 | elapsed time per iteration (s): 15.23 | learning rate: 3.009E-06 | global batch size:    16 | lm loss: 8.138422E+00 | grad norm: 3.221 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      575/  128728 | consumed samples:         9200 | consumed tokens:     18841600 | elapsed time per iteration (s): 15.22 | learning rate: 3.015E-06 | global batch size:    16 | lm loss: 8.045207E+00 | grad norm: 2.058 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      576/  128728 | consumed samples:         9216 | consumed tokens:     18874368 | elapsed time per iteration (s): 15.22 | learning rate: 3.020E-06 | global batch size:    16 | lm loss: 7.972528E+00 | grad norm: 1.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      577/  128728 | consumed samples:         9232 | consumed tokens:     18907136 | elapsed time per iteration (s): 15.25 | learning rate: 3.025E-06 | global batch size:    16 | lm loss: 8.178508E+00 | grad norm: 1.988 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      578/  128728 | consumed samples:         9248 | consumed tokens:     18939904 | elapsed time per iteration (s): 15.24 | learning rate: 3.030E-06 | global batch size:    16 | lm loss: 7.980485E+00 | grad norm: 1.631 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      579/  128728 | consumed samples:         9264 | consumed tokens:     18972672 | elapsed time per iteration (s): 15.24 | learning rate: 3.036E-06 | global batch size:    16 | lm loss: 7.864195E+00 | grad norm: 1.538 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      580/  128728 | consumed samples:         9280 | consumed tokens:     19005440 | elapsed time per iteration (s): 15.25 | learning rate: 3.041E-06 | global batch size:    16 | lm loss: 8.087688E+00 | grad norm: 2.799 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      581/  128728 | consumed samples:         9296 | consumed tokens:     19038208 | elapsed time per iteration (s): 15.22 | learning rate: 3.046E-06 | global batch size:    16 | lm loss: 8.038260E+00 | grad norm: 1.989 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      582/  128728 | consumed samples:         9312 | consumed tokens:     19070976 | elapsed time per iteration (s): 15.25 | learning rate: 3.051E-06 | global batch size:    16 | lm loss: 7.954132E+00 | grad norm: 1.389 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      583/  128728 | consumed samples:         9328 | consumed tokens:     19103744 | elapsed time per iteration (s): 15.18 | learning rate: 3.057E-06 | global batch size:    16 | lm loss: 8.152493E+00 | grad norm: 2.484 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      584/  128728 | consumed samples:         9344 | consumed tokens:     19136512 | elapsed time per iteration (s): 15.19 | learning rate: 3.062E-06 | global batch size:    16 | lm loss: 8.236040E+00 | grad norm: 1.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      585/  128728 | consumed samples:         9360 | consumed tokens:     19169280 | elapsed time per iteration (s): 15.17 | learning rate: 3.067E-06 | global batch size:    16 | lm loss: 7.907086E+00 | grad norm: 2.350 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration      586/  128728 | consumed samples:         9376 | consumed tokens:     19202048 | elapsed time per iteration (s): 15.19 | learning rate: 3.072E-06 | global batch size:    16 | lm loss: 8.304672E+00 | grad norm: 1.986 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      587/  128728 | consumed samples:         9392 | consumed tokens:     19234816 | elapsed time per iteration (s): 15.24 | learning rate: 3.078E-06 | global batch size:    16 | lm loss: 8.053318E+00 | grad norm: 1.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      588/  128728 | consumed samples:         9408 | consumed tokens:     19267584 | elapsed time per iteration (s): 15.24 | learning rate: 3.083E-06 | global batch size:    16 | lm loss: 8.005896E+00 | grad norm: 1.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      589/  128728 | consumed samples:         9424 | consumed tokens:     19300352 | elapsed time per iteration (s): 15.26 | learning rate: 3.088E-06 | global batch size:    16 | lm loss: 7.824888E+00 | grad norm: 1.580 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      590/  128728 | consumed samples:         9440 | consumed tokens:     19333120 | elapsed time per iteration (s): 15.23 | learning rate: 3.093E-06 | global batch size:    16 | lm loss: 8.009818E+00 | grad norm: 1.547 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      591/  128728 | consumed samples:         9456 | consumed tokens:     19365888 | elapsed time per iteration (s): 15.22 | learning rate: 3.099E-06 | global batch size:    16 | lm loss: 7.998293E+00 | grad norm: 1.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      592/  128728 | consumed samples:         9472 | consumed tokens:     19398656 | elapsed time per iteration (s): 15.27 | learning rate: 3.104E-06 | global batch size:    16 | lm loss: 8.065947E+00 | grad norm: 1.816 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      593/  128728 | consumed samples:         9488 | consumed tokens:     19431424 | elapsed time per iteration (s): 15.27 | learning rate: 3.109E-06 | global batch size:    16 | lm loss: 7.924274E+00 | grad norm: 1.954 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      594/  128728 | consumed samples:         9504 | consumed tokens:     19464192 | elapsed time per iteration (s): 15.22 | learning rate: 3.114E-06 | global batch size:    16 | lm loss: 7.962350E+00 | grad norm: 1.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      595/  128728 | consumed samples:         9520 | consumed tokens:     19496960 | elapsed time per iteration (s): 15.23 | learning rate: 3.120E-06 | global batch size:    16 | lm loss: 8.032861E+00 | grad norm: 3.286 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      596/  128728 | consumed samples:         9536 | consumed tokens:     19529728 | elapsed time per iteration (s): 15.27 | learning rate: 3.125E-06 | global batch size:    16 | lm loss: 8.025990E+00 | grad norm: 3.459 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      597/  128728 | consumed samples:         9552 | consumed tokens:     19562496 | elapsed time per iteration (s): 15.21 | learning rate: 3.130E-06 | global batch size:    16 | lm loss: 8.163157E+00 | grad norm: 2.253 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      598/  128728 | consumed samples:         9568 | consumed tokens:     19595264 | elapsed time per iteration (s): 15.17 | learning rate: 3.135E-06 | global batch size:    16 | lm loss: 8.050694E+00 | grad norm: 1.486 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      599/  128728 | consumed samples:         9584 | consumed tokens:     19628032 | elapsed time per iteration (s): 15.20 | learning rate: 3.140E-06 | global batch size:    16 | lm loss: 8.062954E+00 | grad norm: 1.669 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      600/  128728 | consumed samples:         9600 | consumed tokens:     19660800 | elapsed time per iteration (s): 15.21 | learning rate: 3.146E-06 | global batch size:    16 | lm loss: 8.087465E+00 | grad norm: 1.461 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      601/  128728 | consumed samples:         9616 | consumed tokens:     19693568 | elapsed time per iteration (s): 15.18 | learning rate: 3.151E-06 | global batch size:    16 | lm loss: 8.023573E+00 | grad norm: 1.170 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      602/  128728 | consumed samples:         9632 | consumed tokens:     19726336 | elapsed time per iteration (s): 15.22 | learning rate: 3.156E-06 | global batch size:    16 | lm loss: 7.818781E+00 | grad norm: 1.654 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      603/  128728 | consumed samples:         9648 | consumed tokens:     19759104 | elapsed time per iteration (s): 15.22 | learning rate: 3.161E-06 | global batch size:    16 | lm loss: 8.015720E+00 | grad norm: 1.395 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      604/  128728 | consumed samples:         9664 | consumed tokens:     19791872 | elapsed time per iteration (s): 15.25 | learning rate: 3.167E-06 | global batch size:    16 | lm loss: 8.103092E+00 | grad norm: 2.917 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      605/  128728 | consumed samples:         9680 | consumed tokens:     19824640 | elapsed time per iteration (s): 15.24 | learning rate: 3.172E-06 | global batch size:    16 | lm loss: 7.763780E+00 | grad norm: 2.339 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      606/  128728 | consumed samples:         9696 | consumed tokens:     19857408 | elapsed time per iteration (s): 15.24 | learning rate: 3.177E-06 | global batch size:    16 | lm loss: 7.950778E+00 | grad norm: 1.569 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      607/  128728 | consumed samples:         9712 | consumed tokens:     19890176 | elapsed time per iteration (s): 15.26 | learning rate: 3.182E-06 | global batch size:    16 | lm loss: 7.931053E+00 | grad norm: 2.130 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      608/  128728 | consumed samples:         9728 | consumed tokens:     19922944 | elapsed time per iteration (s): 15.23 | learning rate: 3.188E-06 | global batch size:    16 | lm loss: 8.145065E+00 | grad norm: 1.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      609/  128728 | consumed samples:         9744 | consumed tokens:     19955712 | elapsed time per iteration (s): 15.26 | learning rate: 3.193E-06 | global batch size:    16 | lm loss: 7.848729E+00 | grad norm: 1.252 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      610/  128728 | consumed samples:         9760 | consumed tokens:     19988480 | elapsed time per iteration (s): 15.25 | learning rate: 3.198E-06 | global batch size:    16 | lm loss: 8.195064E+00 | grad norm: 2.007 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      611/  128728 | consumed samples:         9776 | consumed tokens:     20021248 | elapsed time per iteration (s): 15.17 | learning rate: 3.203E-06 | global batch size:    16 | lm loss: 8.242138E+00 | grad norm: 1.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      612/  128728 | consumed samples:         9792 | consumed tokens:     20054016 | elapsed time per iteration (s): 15.23 | learning rate: 3.209E-06 | global batch size:    16 | lm loss: 8.117524E+00 | grad norm: 1.384 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      613/  128728 | consumed samples:         9808 | consumed tokens:     20086784 | elapsed time per iteration (s): 15.25 | learning rate: 3.214E-06 | global batch size:    16 | lm loss: 7.918928E+00 | grad norm: 1.371 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      614/  128728 | consumed samples:         9824 | consumed tokens:     20119552 | elapsed time per iteration (s): 15.26 | learning rate: 3.219E-06 | global batch size:    16 | lm loss: 8.051600E+00 | grad norm: 2.007 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      615/  128728 | consumed samples:         9840 | consumed tokens:     20152320 | elapsed time per iteration (s): 15.20 | learning rate: 3.224E-06 | global batch size:    16 | lm loss: 7.935547E+00 | grad norm: 1.270 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      616/  128728 | consumed samples:         9856 | consumed tokens:     20185088 | elapsed time per iteration (s): 15.24 | learning rate: 3.230E-06 | global batch size:    16 | lm loss: 8.028803E+00 | grad norm: 1.547 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      617/  128728 | consumed samples:         9872 | consumed tokens:     20217856 | elapsed time per iteration (s): 15.22 | learning rate: 3.235E-06 | global batch size:    16 | lm loss: 7.955178E+00 | grad norm: 1.555 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      618/  128728 | consumed samples:         9888 | consumed tokens:     20250624 | elapsed time per iteration (s): 15.24 | learning rate: 3.240E-06 | global batch size:    16 | lm loss: 7.941856E+00 | grad norm: 2.057 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      619/  128728 | consumed samples:         9904 | consumed tokens:     20283392 | elapsed time per iteration (s): 15.21 | learning rate: 3.245E-06 | global batch size:    16 | lm loss: 8.104746E+00 | grad norm: 1.580 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      620/  128728 | consumed samples:         9920 | consumed tokens:     20316160 | elapsed time per iteration (s): 15.18 | learning rate: 3.251E-06 | global batch size:    16 | lm loss: 7.847284E+00 | grad norm: 1.397 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      621/  128728 | consumed samples:         9936 | consumed tokens:     20348928 | elapsed time per iteration (s): 15.23 | learning rate: 3.256E-06 | global batch size:    16 | lm loss: 8.035351E+00 | grad norm: 1.532 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      622/  128728 | consumed samples:         9952 | consumed tokens:     20381696 | elapsed time per iteration (s): 15.29 | learning rate: 3.261E-06 | global batch size:    16 | lm loss: 7.982013E+00 | grad norm: 1.415 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration      623/  128728 | consumed samples:         9968 | consumed tokens:     20414464 | elapsed time per iteration (s): 15.17 | learning rate: 3.266E-06 | global batch size:    16 | lm loss: 7.936229E+00 | grad norm: 2.187 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration      624/  128728 | consumed samples:         9984 | consumed tokens:     20447232 | elapsed time per iteration (s): 15.24 | learning rate: 3.272E-06 | global batch size:    16 | lm loss: 7.954924E+00 | grad norm: 1.624 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      625/  128728 | consumed samples:        10000 | consumed tokens:     20480000 | elapsed time per iteration (s): 15.25 | learning rate: 3.277E-06 | global batch size:    16 | lm loss: 7.793859E+00 | grad norm: 1.168 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      626/  128728 | consumed samples:        10016 | consumed tokens:     20512768 | elapsed time per iteration (s): 15.26 | learning rate: 3.282E-06 | global batch size:    16 | lm loss: 8.114607E+00 | grad norm: 1.913 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      627/  128728 | consumed samples:        10032 | consumed tokens:     20545536 | elapsed time per iteration (s): 15.24 | learning rate: 3.287E-06 | global batch size:    16 | lm loss: 7.914503E+00 | grad norm: 1.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      628/  128728 | consumed samples:        10048 | consumed tokens:     20578304 | elapsed time per iteration (s): 15.27 | learning rate: 3.293E-06 | global batch size:    16 | lm loss: 8.035368E+00 | grad norm: 1.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      629/  128728 | consumed samples:        10064 | consumed tokens:     20611072 | elapsed time per iteration (s): 15.17 | learning rate: 3.298E-06 | global batch size:    16 | lm loss: 7.947924E+00 | grad norm: 1.470 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      630/  128728 | consumed samples:        10080 | consumed tokens:     20643840 | elapsed time per iteration (s): 15.23 | learning rate: 3.303E-06 | global batch size:    16 | lm loss: 7.966818E+00 | grad norm: 1.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      631/  128728 | consumed samples:        10096 | consumed tokens:     20676608 | elapsed time per iteration (s): 15.24 | learning rate: 3.308E-06 | global batch size:    16 | lm loss: 7.870564E+00 | grad norm: 1.144 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      632/  128728 | consumed samples:        10112 | consumed tokens:     20709376 | elapsed time per iteration (s): 15.22 | learning rate: 3.314E-06 | global batch size:    16 | lm loss: 7.987050E+00 | grad norm: 1.295 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      633/  128728 | consumed samples:        10128 | consumed tokens:     20742144 | elapsed time per iteration (s): 15.18 | learning rate: 3.319E-06 | global batch size:    16 | lm loss: 7.923104E+00 | grad norm: 1.387 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      634/  128728 | consumed samples:        10144 | consumed tokens:     20774912 | elapsed time per iteration (s): 15.22 | learning rate: 3.324E-06 | global batch size:    16 | lm loss: 7.981370E+00 | grad norm: 3.554 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      635/  128728 | consumed samples:        10160 | consumed tokens:     20807680 | elapsed time per iteration (s): 15.22 | learning rate: 3.329E-06 | global batch size:    16 | lm loss: 7.451349E+00 | grad norm: 2.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      636/  128728 | consumed samples:        10176 | consumed tokens:     20840448 | elapsed time per iteration (s): 15.28 | learning rate: 3.334E-06 | global batch size:    16 | lm loss: 7.894579E+00 | grad norm: 1.334 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration      637/  128728 | consumed samples:        10192 | consumed tokens:     20873216 | elapsed time per iteration (s): 15.24 | learning rate: 3.340E-06 | global batch size:    16 | lm loss: 7.836030E+00 | grad norm: 1.562 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      638/  128728 | consumed samples:        10208 | consumed tokens:     20905984 | elapsed time per iteration (s): 15.17 | learning rate: 3.345E-06 | global batch size:    16 | lm loss: 7.979243E+00 | grad norm: 1.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      639/  128728 | consumed samples:        10224 | consumed tokens:     20938752 | elapsed time per iteration (s): 15.21 | learning rate: 3.350E-06 | global batch size:    16 | lm loss: 7.720065E+00 | grad norm: 1.240 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      640/  128728 | consumed samples:        10240 | consumed tokens:     20971520 | elapsed time per iteration (s): 15.24 | learning rate: 3.355E-06 | global batch size:    16 | lm loss: 7.699399E+00 | grad norm: 1.935 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      641/  128728 | consumed samples:        10256 | consumed tokens:     21004288 | elapsed time per iteration (s): 15.22 | learning rate: 3.361E-06 | global batch size:    16 | lm loss: 8.025188E+00 | grad norm: 2.391 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      642/  128728 | consumed samples:        10272 | consumed tokens:     21037056 | elapsed time per iteration (s): 15.26 | learning rate: 3.366E-06 | global batch size:    16 | lm loss: 7.736159E+00 | grad norm: 1.592 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      643/  128728 | consumed samples:        10288 | consumed tokens:     21069824 | elapsed time per iteration (s): 15.22 | learning rate: 3.371E-06 | global batch size:    16 | lm loss: 7.719475E+00 | grad norm: 2.159 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      644/  128728 | consumed samples:        10304 | consumed tokens:     21102592 | elapsed time per iteration (s): 15.25 | learning rate: 3.376E-06 | global batch size:    16 | lm loss: 7.865746E+00 | grad norm: 1.433 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      645/  128728 | consumed samples:        10320 | consumed tokens:     21135360 | elapsed time per iteration (s): 15.24 | learning rate: 3.382E-06 | global batch size:    16 | lm loss: 8.016085E+00 | grad norm: 2.456 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      646/  128728 | consumed samples:        10336 | consumed tokens:     21168128 | elapsed time per iteration (s): 15.21 | learning rate: 3.387E-06 | global batch size:    16 | lm loss: 7.879150E+00 | grad norm: 1.573 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      647/  128728 | consumed samples:        10352 | consumed tokens:     21200896 | elapsed time per iteration (s): 15.27 | learning rate: 3.392E-06 | global batch size:    16 | lm loss: 7.871262E+00 | grad norm: 2.059 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      648/  128728 | consumed samples:        10368 | consumed tokens:     21233664 | elapsed time per iteration (s): 15.24 | learning rate: 3.397E-06 | global batch size:    16 | lm loss: 8.009554E+00 | grad norm: 1.466 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      649/  128728 | consumed samples:        10384 | consumed tokens:     21266432 | elapsed time per iteration (s): 15.29 | learning rate: 3.403E-06 | global batch size:    16 | lm loss: 7.901595E+00 | grad norm: 1.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration      650/  128728 | consumed samples:        10400 | consumed tokens:     21299200 | elapsed time per iteration (s): 15.24 | learning rate: 3.408E-06 | global batch size:    16 | lm loss: 7.781230E+00 | grad norm: 1.553 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      651/  128728 | consumed samples:        10416 | consumed tokens:     21331968 | elapsed time per iteration (s): 15.24 | learning rate: 3.413E-06 | global batch size:    16 | lm loss: 7.900571E+00 | grad norm: 1.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      652/  128728 | consumed samples:        10432 | consumed tokens:     21364736 | elapsed time per iteration (s): 15.27 | learning rate: 3.418E-06 | global batch size:    16 | lm loss: 8.001532E+00 | grad norm: 1.577 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      653/  128728 | consumed samples:        10448 | consumed tokens:     21397504 | elapsed time per iteration (s): 15.22 | learning rate: 3.424E-06 | global batch size:    16 | lm loss: 7.724453E+00 | grad norm: 1.467 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      654/  128728 | consumed samples:        10464 | consumed tokens:     21430272 | elapsed time per iteration (s): 15.22 | learning rate: 3.429E-06 | global batch size:    16 | lm loss: 7.786034E+00 | grad norm: 2.194 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      655/  128728 | consumed samples:        10480 | consumed tokens:     21463040 | elapsed time per iteration (s): 15.21 | learning rate: 3.434E-06 | global batch size:    16 | lm loss: 8.125753E+00 | grad norm: 2.500 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      656/  128728 | consumed samples:        10496 | consumed tokens:     21495808 | elapsed time per iteration (s): 15.18 | learning rate: 3.439E-06 | global batch size:    16 | lm loss: 7.898974E+00 | grad norm: 1.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      657/  128728 | consumed samples:        10512 | consumed tokens:     21528576 | elapsed time per iteration (s): 15.25 | learning rate: 3.445E-06 | global batch size:    16 | lm loss: 7.543049E+00 | grad norm: 2.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      658/  128728 | consumed samples:        10528 | consumed tokens:     21561344 | elapsed time per iteration (s): 15.24 | learning rate: 3.450E-06 | global batch size:    16 | lm loss: 7.893567E+00 | grad norm: 1.285 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      659/  128728 | consumed samples:        10544 | consumed tokens:     21594112 | elapsed time per iteration (s): 15.24 | learning rate: 3.455E-06 | global batch size:    16 | lm loss: 7.778220E+00 | grad norm: 1.663 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      660/  128728 | consumed samples:        10560 | consumed tokens:     21626880 | elapsed time per iteration (s): 15.24 | learning rate: 3.460E-06 | global batch size:    16 | lm loss: 7.709709E+00 | grad norm: 1.215 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      661/  128728 | consumed samples:        10576 | consumed tokens:     21659648 | elapsed time per iteration (s): 15.26 | learning rate: 3.466E-06 | global batch size:    16 | lm loss: 7.854992E+00 | grad norm: 1.256 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      662/  128728 | consumed samples:        10592 | consumed tokens:     21692416 | elapsed time per iteration (s): 15.23 | learning rate: 3.471E-06 | global batch size:    16 | lm loss: 7.712576E+00 | grad norm: 1.222 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      663/  128728 | consumed samples:        10608 | consumed tokens:     21725184 | elapsed time per iteration (s): 15.24 | learning rate: 3.476E-06 | global batch size:    16 | lm loss: 7.989018E+00 | grad norm: 1.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      664/  128728 | consumed samples:        10624 | consumed tokens:     21757952 | elapsed time per iteration (s): 15.21 | learning rate: 3.481E-06 | global batch size:    16 | lm loss: 7.714592E+00 | grad norm: 2.321 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      665/  128728 | consumed samples:        10640 | consumed tokens:     21790720 | elapsed time per iteration (s): 15.24 | learning rate: 3.487E-06 | global batch size:    16 | lm loss: 7.794542E+00 | grad norm: 1.438 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      666/  128728 | consumed samples:        10656 | consumed tokens:     21823488 | elapsed time per iteration (s): 15.26 | learning rate: 3.492E-06 | global batch size:    16 | lm loss: 7.864296E+00 | grad norm: 1.936 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      667/  128728 | consumed samples:        10672 | consumed tokens:     21856256 | elapsed time per iteration (s): 15.27 | learning rate: 3.497E-06 | global batch size:    16 | lm loss: 7.764245E+00 | grad norm: 1.307 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      668/  128728 | consumed samples:        10688 | consumed tokens:     21889024 | elapsed time per iteration (s): 15.25 | learning rate: 3.502E-06 | global batch size:    16 | lm loss: 7.735165E+00 | grad norm: 1.385 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      669/  128728 | consumed samples:        10704 | consumed tokens:     21921792 | elapsed time per iteration (s): 15.26 | learning rate: 3.507E-06 | global batch size:    16 | lm loss: 8.010805E+00 | grad norm: 1.539 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      670/  128728 | consumed samples:        10720 | consumed tokens:     21954560 | elapsed time per iteration (s): 15.21 | learning rate: 3.513E-06 | global batch size:    16 | lm loss: 8.047853E+00 | grad norm: 1.605 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      671/  128728 | consumed samples:        10736 | consumed tokens:     21987328 | elapsed time per iteration (s): 15.17 | learning rate: 3.518E-06 | global batch size:    16 | lm loss: 7.816953E+00 | grad norm: 1.629 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      672/  128728 | consumed samples:        10752 | consumed tokens:     22020096 | elapsed time per iteration (s): 15.23 | learning rate: 3.523E-06 | global batch size:    16 | lm loss: 7.711550E+00 | grad norm: 1.454 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      673/  128728 | consumed samples:        10768 | consumed tokens:     22052864 | elapsed time per iteration (s): 15.25 | learning rate: 3.528E-06 | global batch size:    16 | lm loss: 7.875598E+00 | grad norm: 1.664 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      674/  128728 | consumed samples:        10784 | consumed tokens:     22085632 | elapsed time per iteration (s): 15.26 | learning rate: 3.534E-06 | global batch size:    16 | lm loss: 8.126424E+00 | grad norm: 1.873 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      675/  128728 | consumed samples:        10800 | consumed tokens:     22118400 | elapsed time per iteration (s): 15.22 | learning rate: 3.539E-06 | global batch size:    16 | lm loss: 7.853518E+00 | grad norm: 1.294 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      676/  128728 | consumed samples:        10816 | consumed tokens:     22151168 | elapsed time per iteration (s): 15.27 | learning rate: 3.544E-06 | global batch size:    16 | lm loss: 7.600890E+00 | grad norm: 2.141 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      677/  128728 | consumed samples:        10832 | consumed tokens:     22183936 | elapsed time per iteration (s): 15.18 | learning rate: 3.549E-06 | global batch size:    16 | lm loss: 8.051760E+00 | grad norm: 1.994 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      678/  128728 | consumed samples:        10848 | consumed tokens:     22216704 | elapsed time per iteration (s): 15.24 | learning rate: 3.555E-06 | global batch size:    16 | lm loss: 7.808934E+00 | grad norm: 1.232 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      679/  128728 | consumed samples:        10864 | consumed tokens:     22249472 | elapsed time per iteration (s): 15.22 | learning rate: 3.560E-06 | global batch size:    16 | lm loss: 7.776139E+00 | grad norm: 1.302 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      680/  128728 | consumed samples:        10880 | consumed tokens:     22282240 | elapsed time per iteration (s): 15.20 | learning rate: 3.565E-06 | global batch size:    16 | lm loss: 7.821063E+00 | grad norm: 1.292 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      681/  128728 | consumed samples:        10896 | consumed tokens:     22315008 | elapsed time per iteration (s): 15.26 | learning rate: 3.570E-06 | global batch size:    16 | lm loss: 8.087934E+00 | grad norm: 2.480 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      682/  128728 | consumed samples:        10912 | consumed tokens:     22347776 | elapsed time per iteration (s): 15.21 | learning rate: 3.576E-06 | global batch size:    16 | lm loss: 7.812382E+00 | grad norm: 1.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      683/  128728 | consumed samples:        10928 | consumed tokens:     22380544 | elapsed time per iteration (s): 15.23 | learning rate: 3.581E-06 | global batch size:    16 | lm loss: 7.646718E+00 | grad norm: 1.346 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      684/  128728 | consumed samples:        10944 | consumed tokens:     22413312 | elapsed time per iteration (s): 15.21 | learning rate: 3.586E-06 | global batch size:    16 | lm loss: 7.798770E+00 | grad norm: 1.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      685/  128728 | consumed samples:        10960 | consumed tokens:     22446080 | elapsed time per iteration (s): 15.26 | learning rate: 3.591E-06 | global batch size:    16 | lm loss: 7.723039E+00 | grad norm: 1.094 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      686/  128728 | consumed samples:        10976 | consumed tokens:     22478848 | elapsed time per iteration (s): 15.22 | learning rate: 3.597E-06 | global batch size:    16 | lm loss: 7.886545E+00 | grad norm: 1.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      687/  128728 | consumed samples:        10992 | consumed tokens:     22511616 | elapsed time per iteration (s): 15.24 | learning rate: 3.602E-06 | global batch size:    16 | lm loss: 7.877650E+00 | grad norm: 1.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      688/  128728 | consumed samples:        11008 | consumed tokens:     22544384 | elapsed time per iteration (s): 15.24 | learning rate: 3.607E-06 | global batch size:    16 | lm loss: 7.980981E+00 | grad norm: 1.588 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      689/  128728 | consumed samples:        11024 | consumed tokens:     22577152 | elapsed time per iteration (s): 15.27 | learning rate: 3.612E-06 | global batch size:    16 | lm loss: 7.830743E+00 | grad norm: 1.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      690/  128728 | consumed samples:        11040 | consumed tokens:     22609920 | elapsed time per iteration (s): 15.27 | learning rate: 3.618E-06 | global batch size:    16 | lm loss: 7.749845E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      691/  128728 | consumed samples:        11056 | consumed tokens:     22642688 | elapsed time per iteration (s): 15.22 | learning rate: 3.623E-06 | global batch size:    16 | lm loss: 7.584512E+00 | grad norm: 1.227 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      692/  128728 | consumed samples:        11072 | consumed tokens:     22675456 | elapsed time per iteration (s): 15.19 | learning rate: 3.628E-06 | global batch size:    16 | lm loss: 7.821875E+00 | grad norm: 1.403 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      693/  128728 | consumed samples:        11088 | consumed tokens:     22708224 | elapsed time per iteration (s): 15.21 | learning rate: 3.633E-06 | global batch size:    16 | lm loss: 7.693274E+00 | grad norm: 1.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      694/  128728 | consumed samples:        11104 | consumed tokens:     22740992 | elapsed time per iteration (s): 15.25 | learning rate: 3.639E-06 | global batch size:    16 | lm loss: 7.663749E+00 | grad norm: 1.613 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      695/  128728 | consumed samples:        11120 | consumed tokens:     22773760 | elapsed time per iteration (s): 15.20 | learning rate: 3.644E-06 | global batch size:    16 | lm loss: 7.937167E+00 | grad norm: 2.037 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      696/  128728 | consumed samples:        11136 | consumed tokens:     22806528 | elapsed time per iteration (s): 15.23 | learning rate: 3.649E-06 | global batch size:    16 | lm loss: 7.848682E+00 | grad norm: 2.415 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      697/  128728 | consumed samples:        11152 | consumed tokens:     22839296 | elapsed time per iteration (s): 15.22 | learning rate: 3.654E-06 | global batch size:    16 | lm loss: 7.809802E+00 | grad norm: 1.698 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      698/  128728 | consumed samples:        11168 | consumed tokens:     22872064 | elapsed time per iteration (s): 15.22 | learning rate: 3.660E-06 | global batch size:    16 | lm loss: 7.940400E+00 | grad norm: 2.311 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      699/  128728 | consumed samples:        11184 | consumed tokens:     22904832 | elapsed time per iteration (s): 15.19 | learning rate: 3.665E-06 | global batch size:    16 | lm loss: 7.481762E+00 | grad norm: 1.795 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration      700/  128728 | consumed samples:        11200 | consumed tokens:     22937600 | elapsed time per iteration (s): 15.24 | learning rate: 3.670E-06 | global batch size:    16 | lm loss: 7.774322E+00 | grad norm: 1.613 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      701/  128728 | consumed samples:        11216 | consumed tokens:     22970368 | elapsed time per iteration (s): 15.21 | learning rate: 3.675E-06 | global batch size:    16 | lm loss: 7.873240E+00 | grad norm: 1.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      702/  128728 | consumed samples:        11232 | consumed tokens:     23003136 | elapsed time per iteration (s): 15.22 | learning rate: 3.681E-06 | global batch size:    16 | lm loss: 7.976169E+00 | grad norm: 2.095 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      703/  128728 | consumed samples:        11248 | consumed tokens:     23035904 | elapsed time per iteration (s): 15.24 | learning rate: 3.686E-06 | global batch size:    16 | lm loss: 7.603686E+00 | grad norm: 1.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      704/  128728 | consumed samples:        11264 | consumed tokens:     23068672 | elapsed time per iteration (s): 15.20 | learning rate: 3.691E-06 | global batch size:    16 | lm loss: 7.877244E+00 | grad norm: 1.471 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      705/  128728 | consumed samples:        11280 | consumed tokens:     23101440 | elapsed time per iteration (s): 15.23 | learning rate: 3.696E-06 | global batch size:    16 | lm loss: 7.863098E+00 | grad norm: 2.257 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      706/  128728 | consumed samples:        11296 | consumed tokens:     23134208 | elapsed time per iteration (s): 15.26 | learning rate: 3.701E-06 | global batch size:    16 | lm loss: 7.755744E+00 | grad norm: 1.835 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      707/  128728 | consumed samples:        11312 | consumed tokens:     23166976 | elapsed time per iteration (s): 15.19 | learning rate: 3.707E-06 | global batch size:    16 | lm loss: 7.782665E+00 | grad norm: 1.655 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      708/  128728 | consumed samples:        11328 | consumed tokens:     23199744 | elapsed time per iteration (s): 15.22 | learning rate: 3.712E-06 | global batch size:    16 | lm loss: 7.712109E+00 | grad norm: 2.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      709/  128728 | consumed samples:        11344 | consumed tokens:     23232512 | elapsed time per iteration (s): 15.23 | learning rate: 3.717E-06 | global batch size:    16 | lm loss: 7.964561E+00 | grad norm: 1.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      710/  128728 | consumed samples:        11360 | consumed tokens:     23265280 | elapsed time per iteration (s): 15.23 | learning rate: 3.722E-06 | global batch size:    16 | lm loss: 7.780368E+00 | grad norm: 1.226 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      711/  128728 | consumed samples:        11376 | consumed tokens:     23298048 | elapsed time per iteration (s): 15.25 | learning rate: 3.728E-06 | global batch size:    16 | lm loss: 7.641358E+00 | grad norm: 1.478 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      712/  128728 | consumed samples:        11392 | consumed tokens:     23330816 | elapsed time per iteration (s): 15.22 | learning rate: 3.733E-06 | global batch size:    16 | lm loss: 7.781178E+00 | grad norm: 1.565 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      713/  128728 | consumed samples:        11408 | consumed tokens:     23363584 | elapsed time per iteration (s): 15.22 | learning rate: 3.738E-06 | global batch size:    16 | lm loss: 7.840047E+00 | grad norm: 1.440 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      714/  128728 | consumed samples:        11424 | consumed tokens:     23396352 | elapsed time per iteration (s): 15.21 | learning rate: 3.743E-06 | global batch size:    16 | lm loss: 8.027559E+00 | grad norm: 2.249 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      715/  128728 | consumed samples:        11440 | consumed tokens:     23429120 | elapsed time per iteration (s): 15.27 | learning rate: 3.749E-06 | global batch size:    16 | lm loss: 7.754458E+00 | grad norm: 1.347 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      716/  128728 | consumed samples:        11456 | consumed tokens:     23461888 | elapsed time per iteration (s): 15.22 | learning rate: 3.754E-06 | global batch size:    16 | lm loss: 7.990946E+00 | grad norm: 2.277 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      717/  128728 | consumed samples:        11472 | consumed tokens:     23494656 | elapsed time per iteration (s): 15.21 | learning rate: 3.759E-06 | global batch size:    16 | lm loss: 7.646182E+00 | grad norm: 1.425 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      718/  128728 | consumed samples:        11488 | consumed tokens:     23527424 | elapsed time per iteration (s): 15.21 | learning rate: 3.764E-06 | global batch size:    16 | lm loss: 7.488737E+00 | grad norm: 1.417 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      719/  128728 | consumed samples:        11504 | consumed tokens:     23560192 | elapsed time per iteration (s): 15.24 | learning rate: 3.770E-06 | global batch size:    16 | lm loss: 8.027329E+00 | grad norm: 2.163 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      720/  128728 | consumed samples:        11520 | consumed tokens:     23592960 | elapsed time per iteration (s): 15.22 | learning rate: 3.775E-06 | global batch size:    16 | lm loss: 7.625739E+00 | grad norm: 1.588 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      721/  128728 | consumed samples:        11536 | consumed tokens:     23625728 | elapsed time per iteration (s): 15.21 | learning rate: 3.780E-06 | global batch size:    16 | lm loss: 7.805804E+00 | grad norm: 1.370 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      722/  128728 | consumed samples:        11552 | consumed tokens:     23658496 | elapsed time per iteration (s): 15.21 | learning rate: 3.785E-06 | global batch size:    16 | lm loss: 7.652366E+00 | grad norm: 2.256 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      723/  128728 | consumed samples:        11568 | consumed tokens:     23691264 | elapsed time per iteration (s): 15.22 | learning rate: 3.791E-06 | global batch size:    16 | lm loss: 7.692093E+00 | grad norm: 2.114 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      724/  128728 | consumed samples:        11584 | consumed tokens:     23724032 | elapsed time per iteration (s): 15.19 | learning rate: 3.796E-06 | global batch size:    16 | lm loss: 8.017685E+00 | grad norm: 1.663 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      725/  128728 | consumed samples:        11600 | consumed tokens:     23756800 | elapsed time per iteration (s): 15.22 | learning rate: 3.801E-06 | global batch size:    16 | lm loss: 7.816594E+00 | grad norm: 2.280 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      726/  128728 | consumed samples:        11616 | consumed tokens:     23789568 | elapsed time per iteration (s): 15.27 | learning rate: 3.806E-06 | global batch size:    16 | lm loss: 7.751608E+00 | grad norm: 1.395 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      727/  128728 | consumed samples:        11632 | consumed tokens:     23822336 | elapsed time per iteration (s): 15.22 | learning rate: 3.812E-06 | global batch size:    16 | lm loss: 7.849316E+00 | grad norm: 2.128 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      728/  128728 | consumed samples:        11648 | consumed tokens:     23855104 | elapsed time per iteration (s): 15.24 | learning rate: 3.817E-06 | global batch size:    16 | lm loss: 7.822732E+00 | grad norm: 2.368 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      729/  128728 | consumed samples:        11664 | consumed tokens:     23887872 | elapsed time per iteration (s): 15.25 | learning rate: 3.822E-06 | global batch size:    16 | lm loss: 7.754844E+00 | grad norm: 1.827 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      730/  128728 | consumed samples:        11680 | consumed tokens:     23920640 | elapsed time per iteration (s): 15.21 | learning rate: 3.827E-06 | global batch size:    16 | lm loss: 7.941967E+00 | grad norm: 2.185 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      731/  128728 | consumed samples:        11696 | consumed tokens:     23953408 | elapsed time per iteration (s): 15.24 | learning rate: 3.833E-06 | global batch size:    16 | lm loss: 7.668742E+00 | grad norm: 2.175 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      732/  128728 | consumed samples:        11712 | consumed tokens:     23986176 | elapsed time per iteration (s): 15.21 | learning rate: 3.838E-06 | global batch size:    16 | lm loss: 7.611368E+00 | grad norm: 1.977 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      733/  128728 | consumed samples:        11728 | consumed tokens:     24018944 | elapsed time per iteration (s): 15.23 | learning rate: 3.843E-06 | global batch size:    16 | lm loss: 7.738802E+00 | grad norm: 1.438 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      734/  128728 | consumed samples:        11744 | consumed tokens:     24051712 | elapsed time per iteration (s): 15.26 | learning rate: 3.848E-06 | global batch size:    16 | lm loss: 8.192348E+00 | grad norm: 2.927 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      735/  128728 | consumed samples:        11760 | consumed tokens:     24084480 | elapsed time per iteration (s): 15.25 | learning rate: 3.854E-06 | global batch size:    16 | lm loss: 7.825410E+00 | grad norm: 1.279 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      736/  128728 | consumed samples:        11776 | consumed tokens:     24117248 | elapsed time per iteration (s): 15.23 | learning rate: 3.859E-06 | global batch size:    16 | lm loss: 7.624114E+00 | grad norm: 1.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      737/  128728 | consumed samples:        11792 | consumed tokens:     24150016 | elapsed time per iteration (s): 15.21 | learning rate: 3.864E-06 | global batch size:    16 | lm loss: 7.622774E+00 | grad norm: 1.564 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      738/  128728 | consumed samples:        11808 | consumed tokens:     24182784 | elapsed time per iteration (s): 15.20 | learning rate: 3.869E-06 | global batch size:    16 | lm loss: 7.684864E+00 | grad norm: 1.594 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      739/  128728 | consumed samples:        11824 | consumed tokens:     24215552 | elapsed time per iteration (s): 15.21 | learning rate: 3.874E-06 | global batch size:    16 | lm loss: 7.810888E+00 | grad norm: 1.345 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      740/  128728 | consumed samples:        11840 | consumed tokens:     24248320 | elapsed time per iteration (s): 15.22 | learning rate: 3.880E-06 | global batch size:    16 | lm loss: 7.660820E+00 | grad norm: 1.321 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      741/  128728 | consumed samples:        11856 | consumed tokens:     24281088 | elapsed time per iteration (s): 15.25 | learning rate: 3.885E-06 | global batch size:    16 | lm loss: 7.710549E+00 | grad norm: 1.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      742/  128728 | consumed samples:        11872 | consumed tokens:     24313856 | elapsed time per iteration (s): 15.26 | learning rate: 3.890E-06 | global batch size:    16 | lm loss: 7.604763E+00 | grad norm: 1.633 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      743/  128728 | consumed samples:        11888 | consumed tokens:     24346624 | elapsed time per iteration (s): 15.24 | learning rate: 3.895E-06 | global batch size:    16 | lm loss: 7.866817E+00 | grad norm: 1.443 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      744/  128728 | consumed samples:        11904 | consumed tokens:     24379392 | elapsed time per iteration (s): 15.20 | learning rate: 3.901E-06 | global batch size:    16 | lm loss: 7.738415E+00 | grad norm: 1.214 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      745/  128728 | consumed samples:        11920 | consumed tokens:     24412160 | elapsed time per iteration (s): 15.26 | learning rate: 3.906E-06 | global batch size:    16 | lm loss: 7.592759E+00 | grad norm: 1.214 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      746/  128728 | consumed samples:        11936 | consumed tokens:     24444928 | elapsed time per iteration (s): 15.25 | learning rate: 3.911E-06 | global batch size:    16 | lm loss: 7.961129E+00 | grad norm: 2.540 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration      747/  128728 | consumed samples:        11952 | consumed tokens:     24477696 | elapsed time per iteration (s): 15.24 | learning rate: 3.916E-06 | global batch size:    16 | lm loss: 7.821071E+00 | grad norm: 2.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      748/  128728 | consumed samples:        11968 | consumed tokens:     24510464 | elapsed time per iteration (s): 15.24 | learning rate: 3.922E-06 | global batch size:    16 | lm loss: 7.662223E+00 | grad norm: 2.196 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      749/  128728 | consumed samples:        11984 | consumed tokens:     24543232 | elapsed time per iteration (s): 15.26 | learning rate: 3.927E-06 | global batch size:    16 | lm loss: 7.541708E+00 | grad norm: 1.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      750/  128728 | consumed samples:        12000 | consumed tokens:     24576000 | elapsed time per iteration (s): 15.26 | learning rate: 3.932E-06 | global batch size:    16 | lm loss: 7.680205E+00 | grad norm: 2.320 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      751/  128728 | consumed samples:        12016 | consumed tokens:     24608768 | elapsed time per iteration (s): 15.21 | learning rate: 3.937E-06 | global batch size:    16 | lm loss: 7.828065E+00 | grad norm: 2.392 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      752/  128728 | consumed samples:        12032 | consumed tokens:     24641536 | elapsed time per iteration (s): 15.18 | learning rate: 3.943E-06 | global batch size:    16 | lm loss: 7.748836E+00 | grad norm: 1.936 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      753/  128728 | consumed samples:        12048 | consumed tokens:     24674304 | elapsed time per iteration (s): 15.22 | learning rate: 3.948E-06 | global batch size:    16 | lm loss: 7.542229E+00 | grad norm: 1.632 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      754/  128728 | consumed samples:        12064 | consumed tokens:     24707072 | elapsed time per iteration (s): 15.27 | learning rate: 3.953E-06 | global batch size:    16 | lm loss: 7.863952E+00 | grad norm: 1.509 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      755/  128728 | consumed samples:        12080 | consumed tokens:     24739840 | elapsed time per iteration (s): 15.22 | learning rate: 3.958E-06 | global batch size:    16 | lm loss: 7.741620E+00 | grad norm: 1.665 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      756/  128728 | consumed samples:        12096 | consumed tokens:     24772608 | elapsed time per iteration (s): 15.19 | learning rate: 3.964E-06 | global batch size:    16 | lm loss: 7.805386E+00 | grad norm: 1.967 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      757/  128728 | consumed samples:        12112 | consumed tokens:     24805376 | elapsed time per iteration (s): 15.24 | learning rate: 3.969E-06 | global batch size:    16 | lm loss: 7.797132E+00 | grad norm: 1.614 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      758/  128728 | consumed samples:        12128 | consumed tokens:     24838144 | elapsed time per iteration (s): 15.21 | learning rate: 3.974E-06 | global batch size:    16 | lm loss: 7.813857E+00 | grad norm: 1.976 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      759/  128728 | consumed samples:        12144 | consumed tokens:     24870912 | elapsed time per iteration (s): 15.26 | learning rate: 3.979E-06 | global batch size:    16 | lm loss: 7.982421E+00 | grad norm: 1.430 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      760/  128728 | consumed samples:        12160 | consumed tokens:     24903680 | elapsed time per iteration (s): 15.22 | learning rate: 3.985E-06 | global batch size:    16 | lm loss: 7.646877E+00 | grad norm: 1.612 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      761/  128728 | consumed samples:        12176 | consumed tokens:     24936448 | elapsed time per iteration (s): 15.27 | learning rate: 3.990E-06 | global batch size:    16 | lm loss: 7.730046E+00 | grad norm: 1.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      762/  128728 | consumed samples:        12192 | consumed tokens:     24969216 | elapsed time per iteration (s): 15.23 | learning rate: 3.995E-06 | global batch size:    16 | lm loss: 7.657454E+00 | grad norm: 1.866 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      763/  128728 | consumed samples:        12208 | consumed tokens:     25001984 | elapsed time per iteration (s): 15.24 | learning rate: 4.000E-06 | global batch size:    16 | lm loss: 7.702982E+00 | grad norm: 2.398 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      764/  128728 | consumed samples:        12224 | consumed tokens:     25034752 | elapsed time per iteration (s): 15.21 | learning rate: 4.006E-06 | global batch size:    16 | lm loss: 7.730881E+00 | grad norm: 1.939 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      765/  128728 | consumed samples:        12240 | consumed tokens:     25067520 | elapsed time per iteration (s): 15.23 | learning rate: 4.011E-06 | global batch size:    16 | lm loss: 7.581010E+00 | grad norm: 1.566 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      766/  128728 | consumed samples:        12256 | consumed tokens:     25100288 | elapsed time per iteration (s): 15.26 | learning rate: 4.016E-06 | global batch size:    16 | lm loss: 7.640491E+00 | grad norm: 1.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      767/  128728 | consumed samples:        12272 | consumed tokens:     25133056 | elapsed time per iteration (s): 15.20 | learning rate: 4.021E-06 | global batch size:    16 | lm loss: 7.722907E+00 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      768/  128728 | consumed samples:        12288 | consumed tokens:     25165824 | elapsed time per iteration (s): 15.24 | learning rate: 4.027E-06 | global batch size:    16 | lm loss: 7.654436E+00 | grad norm: 1.597 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      769/  128728 | consumed samples:        12304 | consumed tokens:     25198592 | elapsed time per iteration (s): 15.83 | learning rate: 4.032E-06 | global batch size:    16 | lm loss: 7.346375E+00 | grad norm: 1.381 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.011 | TFLOPs: 7.74 |
[default7]: iteration      770/  128728 | consumed samples:        12320 | consumed tokens:     25231360 | elapsed time per iteration (s): 15.11 | learning rate: 4.037E-06 | global batch size:    16 | lm loss: 7.761549E+00 | grad norm: 1.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.059 | TFLOPs: 8.11 |
[default7]: iteration      771/  128728 | consumed samples:        12336 | consumed tokens:     25264128 | elapsed time per iteration (s): 15.03 | learning rate: 4.042E-06 | global batch size:    16 | lm loss: 7.792974E+00 | grad norm: 1.842 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.065 | TFLOPs: 8.15 |
[default7]: iteration      772/  128728 | consumed samples:        12352 | consumed tokens:     25296896 | elapsed time per iteration (s): 15.04 | learning rate: 4.048E-06 | global batch size:    16 | lm loss: 7.731169E+00 | grad norm: 1.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.064 | TFLOPs: 8.14 |
[default7]: iteration      773/  128728 | consumed samples:        12368 | consumed tokens:     25329664 | elapsed time per iteration (s): 15.05 | learning rate: 4.053E-06 | global batch size:    16 | lm loss: 7.725765E+00 | grad norm: 1.304 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.063 | TFLOPs: 8.14 |
[default7]: iteration      774/  128728 | consumed samples:        12384 | consumed tokens:     25362432 | elapsed time per iteration (s): 15.11 | learning rate: 4.058E-06 | global batch size:    16 | lm loss: 7.766714E+00 | grad norm: 1.359 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.059 | TFLOPs: 8.11 |
[default7]: iteration      775/  128728 | consumed samples:        12400 | consumed tokens:     25395200 | elapsed time per iteration (s): 15.10 | learning rate: 4.063E-06 | global batch size:    16 | lm loss: 7.545648E+00 | grad norm: 1.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.060 | TFLOPs: 8.11 |
[default7]: iteration      776/  128728 | consumed samples:        12416 | consumed tokens:     25427968 | elapsed time per iteration (s): 15.06 | learning rate: 4.068E-06 | global batch size:    16 | lm loss: 7.570961E+00 | grad norm: 1.884 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.062 | TFLOPs: 8.13 |
[default7]: iteration      777/  128728 | consumed samples:        12432 | consumed tokens:     25460736 | elapsed time per iteration (s): 15.12 | learning rate: 4.074E-06 | global batch size:    16 | lm loss: 7.759934E+00 | grad norm: 1.397 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.058 | TFLOPs: 8.10 |
[default7]: iteration      778/  128728 | consumed samples:        12448 | consumed tokens:     25493504 | elapsed time per iteration (s): 15.07 | learning rate: 4.079E-06 | global batch size:    16 | lm loss: 7.718737E+00 | grad norm: 1.466 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.062 | TFLOPs: 8.13 |
[default7]: iteration      779/  128728 | consumed samples:        12464 | consumed tokens:     25526272 | elapsed time per iteration (s): 15.05 | learning rate: 4.084E-06 | global batch size:    16 | lm loss: 7.721785E+00 | grad norm: 1.368 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.063 | TFLOPs: 8.14 |
[default7]: iteration      780/  128728 | consumed samples:        12480 | consumed tokens:     25559040 | elapsed time per iteration (s): 15.07 | learning rate: 4.089E-06 | global batch size:    16 | lm loss: 7.713555E+00 | grad norm: 1.278 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.061 | TFLOPs: 8.13 |
[default7]: iteration      781/  128728 | consumed samples:        12496 | consumed tokens:     25591808 | elapsed time per iteration (s): 14.98 | learning rate: 4.095E-06 | global batch size:    16 | lm loss: 7.670259E+00 | grad norm: 1.086 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.068 | TFLOPs: 8.18 |
[default7]: iteration      782/  128728 | consumed samples:        12512 | consumed tokens:     25624576 | elapsed time per iteration (s): 15.03 | learning rate: 4.100E-06 | global batch size:    16 | lm loss: 7.595325E+00 | grad norm: 1.156 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.064 | TFLOPs: 8.15 |
[default7]: iteration      783/  128728 | consumed samples:        12528 | consumed tokens:     25657344 | elapsed time per iteration (s): 15.13 | learning rate: 4.105E-06 | global batch size:    16 | lm loss: 7.531812E+00 | grad norm: 1.057 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.058 | TFLOPs: 8.10 |
[default7]: iteration      784/  128728 | consumed samples:        12544 | consumed tokens:     25690112 | elapsed time per iteration (s): 15.07 | learning rate: 4.110E-06 | global batch size:    16 | lm loss: 7.455743E+00 | grad norm: 1.269 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.062 | TFLOPs: 8.13 |
[default7]: iteration      785/  128728 | consumed samples:        12560 | consumed tokens:     25722880 | elapsed time per iteration (s): 15.05 | learning rate: 4.116E-06 | global batch size:    16 | lm loss: 7.543957E+00 | grad norm: 1.343 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.063 | TFLOPs: 8.14 |
[default7]: iteration      786/  128728 | consumed samples:        12576 | consumed tokens:     25755648 | elapsed time per iteration (s): 15.07 | learning rate: 4.121E-06 | global batch size:    16 | lm loss: 7.538573E+00 | grad norm: 1.470 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.062 | TFLOPs: 8.13 |
[default7]: iteration      787/  128728 | consumed samples:        12592 | consumed tokens:     25788416 | elapsed time per iteration (s): 15.05 | learning rate: 4.126E-06 | global batch size:    16 | lm loss: 7.473088E+00 | grad norm: 1.309 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.063 | TFLOPs: 8.14 |
[default7]: iteration      788/  128728 | consumed samples:        12608 | consumed tokens:     25821184 | elapsed time per iteration (s): 15.04 | learning rate: 4.131E-06 | global batch size:    16 | lm loss: 7.735221E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.064 | TFLOPs: 8.15 |
[default7]: iteration      789/  128728 | consumed samples:        12624 | consumed tokens:     25853952 | elapsed time per iteration (s): 15.08 | learning rate: 4.137E-06 | global batch size:    16 | lm loss: 7.478712E+00 | grad norm: 1.605 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.061 | TFLOPs: 8.12 |
[default7]: iteration      790/  128728 | consumed samples:        12640 | consumed tokens:     25886720 | elapsed time per iteration (s): 15.02 | learning rate: 4.142E-06 | global batch size:    16 | lm loss: 7.608325E+00 | grad norm: 1.282 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.065 | TFLOPs: 8.16 |
[default7]: iteration      791/  128728 | consumed samples:        12656 | consumed tokens:     25919488 | elapsed time per iteration (s): 15.06 | learning rate: 4.147E-06 | global batch size:    16 | lm loss: 7.450841E+00 | grad norm: 1.401 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.062 | TFLOPs: 8.13 |
[default7]: iteration      792/  128728 | consumed samples:        12672 | consumed tokens:     25952256 | elapsed time per iteration (s): 15.03 | learning rate: 4.152E-06 | global batch size:    16 | lm loss: 7.622550E+00 | grad norm: 1.494 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.065 | TFLOPs: 8.15 |
[default7]: iteration      793/  128728 | consumed samples:        12688 | consumed tokens:     25985024 | elapsed time per iteration (s): 15.13 | learning rate: 4.158E-06 | global batch size:    16 | lm loss: 7.475448E+00 | grad norm: 1.436 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.058 | TFLOPs: 8.10 |
[default7]: iteration      794/  128728 | consumed samples:        12704 | consumed tokens:     26017792 | elapsed time per iteration (s): 15.07 | learning rate: 4.163E-06 | global batch size:    16 | lm loss: 7.738382E+00 | grad norm: 1.183 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.062 | TFLOPs: 8.13 |
[default7]: iteration      795/  128728 | consumed samples:        12720 | consumed tokens:     26050560 | elapsed time per iteration (s): 14.97 | learning rate: 4.168E-06 | global batch size:    16 | lm loss: 7.791917E+00 | grad norm: 1.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.069 | TFLOPs: 8.18 |
[default7]: iteration      796/  128728 | consumed samples:        12736 | consumed tokens:     26083328 | elapsed time per iteration (s): 14.84 | learning rate: 4.173E-06 | global batch size:    16 | lm loss: 7.620174E+00 | grad norm: 1.641 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.078 | TFLOPs: 8.25 |
[default7]: iteration      797/  128728 | consumed samples:        12752 | consumed tokens:     26116096 | elapsed time per iteration (s): 15.25 | learning rate: 4.179E-06 | global batch size:    16 | lm loss: 7.478716E+00 | grad norm: 1.076 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      798/  128728 | consumed samples:        12768 | consumed tokens:     26148864 | elapsed time per iteration (s): 15.23 | learning rate: 4.184E-06 | global batch size:    16 | lm loss: 7.694183E+00 | grad norm: 2.498 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      799/  128728 | consumed samples:        12784 | consumed tokens:     26181632 | elapsed time per iteration (s): 15.22 | learning rate: 4.189E-06 | global batch size:    16 | lm loss: 7.490611E+00 | grad norm: 1.571 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      800/  128728 | consumed samples:        12800 | consumed tokens:     26214400 | elapsed time per iteration (s): 15.24 | learning rate: 4.194E-06 | global batch size:    16 | lm loss: 7.733963E+00 | grad norm: 2.020 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      801/  128728 | consumed samples:        12816 | consumed tokens:     26247168 | elapsed time per iteration (s): 15.26 | learning rate: 4.200E-06 | global batch size:    16 | lm loss: 7.516152E+00 | grad norm: 1.338 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      802/  128728 | consumed samples:        12832 | consumed tokens:     26279936 | elapsed time per iteration (s): 15.23 | learning rate: 4.205E-06 | global batch size:    16 | lm loss: 7.613828E+00 | grad norm: 2.668 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      803/  128728 | consumed samples:        12848 | consumed tokens:     26312704 | elapsed time per iteration (s): 15.24 | learning rate: 4.210E-06 | global batch size:    16 | lm loss: 7.903152E+00 | grad norm: 1.109 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      804/  128728 | consumed samples:        12864 | consumed tokens:     26345472 | elapsed time per iteration (s): 15.30 | learning rate: 4.215E-06 | global batch size:    16 | lm loss: 7.665509E+00 | grad norm: 1.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration      805/  128728 | consumed samples:        12880 | consumed tokens:     26378240 | elapsed time per iteration (s): 15.25 | learning rate: 4.221E-06 | global batch size:    16 | lm loss: 7.686241E+00 | grad norm: 1.219 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      806/  128728 | consumed samples:        12896 | consumed tokens:     26411008 | elapsed time per iteration (s): 15.24 | learning rate: 4.226E-06 | global batch size:    16 | lm loss: 7.861027E+00 | grad norm: 1.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      807/  128728 | consumed samples:        12912 | consumed tokens:     26443776 | elapsed time per iteration (s): 15.24 | learning rate: 4.231E-06 | global batch size:    16 | lm loss: 7.592918E+00 | grad norm: 1.554 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      808/  128728 | consumed samples:        12928 | consumed tokens:     26476544 | elapsed time per iteration (s): 15.21 | learning rate: 4.236E-06 | global batch size:    16 | lm loss: 7.650827E+00 | grad norm: 1.592 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      809/  128728 | consumed samples:        12944 | consumed tokens:     26509312 | elapsed time per iteration (s): 15.22 | learning rate: 4.242E-06 | global batch size:    16 | lm loss: 7.584604E+00 | grad norm: 1.414 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      810/  128728 | consumed samples:        12960 | consumed tokens:     26542080 | elapsed time per iteration (s): 15.23 | learning rate: 4.247E-06 | global batch size:    16 | lm loss: 7.401367E+00 | grad norm: 1.258 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      811/  128728 | consumed samples:        12976 | consumed tokens:     26574848 | elapsed time per iteration (s): 15.27 | learning rate: 4.252E-06 | global batch size:    16 | lm loss: 7.733647E+00 | grad norm: 1.621 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      812/  128728 | consumed samples:        12992 | consumed tokens:     26607616 | elapsed time per iteration (s): 15.25 | learning rate: 4.257E-06 | global batch size:    16 | lm loss: 7.667072E+00 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      813/  128728 | consumed samples:        13008 | consumed tokens:     26640384 | elapsed time per iteration (s): 15.23 | learning rate: 4.262E-06 | global batch size:    16 | lm loss: 7.803669E+00 | grad norm: 1.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      814/  128728 | consumed samples:        13024 | consumed tokens:     26673152 | elapsed time per iteration (s): 15.20 | learning rate: 4.268E-06 | global batch size:    16 | lm loss: 7.590942E+00 | grad norm: 1.495 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      815/  128728 | consumed samples:        13040 | consumed tokens:     26705920 | elapsed time per iteration (s): 15.18 | learning rate: 4.273E-06 | global batch size:    16 | lm loss: 7.517165E+00 | grad norm: 1.203 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      816/  128728 | consumed samples:        13056 | consumed tokens:     26738688 | elapsed time per iteration (s): 15.21 | learning rate: 4.278E-06 | global batch size:    16 | lm loss: 7.709677E+00 | grad norm: 1.526 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      817/  128728 | consumed samples:        13072 | consumed tokens:     26771456 | elapsed time per iteration (s): 15.23 | learning rate: 4.283E-06 | global batch size:    16 | lm loss: 7.403444E+00 | grad norm: 1.416 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      818/  128728 | consumed samples:        13088 | consumed tokens:     26804224 | elapsed time per iteration (s): 15.17 | learning rate: 4.289E-06 | global batch size:    16 | lm loss: 8.024632E+00 | grad norm: 1.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration      819/  128728 | consumed samples:        13104 | consumed tokens:     26836992 | elapsed time per iteration (s): 15.20 | learning rate: 4.294E-06 | global batch size:    16 | lm loss: 7.400269E+00 | grad norm: 2.521 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      820/  128728 | consumed samples:        13120 | consumed tokens:     26869760 | elapsed time per iteration (s): 15.23 | learning rate: 4.299E-06 | global batch size:    16 | lm loss: 7.701864E+00 | grad norm: 1.221 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      821/  128728 | consumed samples:        13136 | consumed tokens:     26902528 | elapsed time per iteration (s): 15.24 | learning rate: 4.304E-06 | global batch size:    16 | lm loss: 7.637981E+00 | grad norm: 1.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      822/  128728 | consumed samples:        13152 | consumed tokens:     26935296 | elapsed time per iteration (s): 15.23 | learning rate: 4.310E-06 | global batch size:    16 | lm loss: 7.643306E+00 | grad norm: 2.043 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      823/  128728 | consumed samples:        13168 | consumed tokens:     26968064 | elapsed time per iteration (s): 15.21 | learning rate: 4.315E-06 | global batch size:    16 | lm loss: 7.696064E+00 | grad norm: 2.010 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      824/  128728 | consumed samples:        13184 | consumed tokens:     27000832 | elapsed time per iteration (s): 15.21 | learning rate: 4.320E-06 | global batch size:    16 | lm loss: 7.558313E+00 | grad norm: 1.300 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      825/  128728 | consumed samples:        13200 | consumed tokens:     27033600 | elapsed time per iteration (s): 15.24 | learning rate: 4.325E-06 | global batch size:    16 | lm loss: 7.669714E+00 | grad norm: 2.262 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      826/  128728 | consumed samples:        13216 | consumed tokens:     27066368 | elapsed time per iteration (s): 15.22 | learning rate: 4.331E-06 | global batch size:    16 | lm loss: 7.444411E+00 | grad norm: 1.357 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      827/  128728 | consumed samples:        13232 | consumed tokens:     27099136 | elapsed time per iteration (s): 15.22 | learning rate: 4.336E-06 | global batch size:    16 | lm loss: 7.498442E+00 | grad norm: 1.447 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      828/  128728 | consumed samples:        13248 | consumed tokens:     27131904 | elapsed time per iteration (s): 15.22 | learning rate: 4.341E-06 | global batch size:    16 | lm loss: 7.616692E+00 | grad norm: 1.572 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      829/  128728 | consumed samples:        13264 | consumed tokens:     27164672 | elapsed time per iteration (s): 15.16 | learning rate: 4.346E-06 | global batch size:    16 | lm loss: 7.807779E+00 | grad norm: 1.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      830/  128728 | consumed samples:        13280 | consumed tokens:     27197440 | elapsed time per iteration (s): 15.17 | learning rate: 4.352E-06 | global batch size:    16 | lm loss: 7.562619E+00 | grad norm: 1.181 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration      831/  128728 | consumed samples:        13296 | consumed tokens:     27230208 | elapsed time per iteration (s): 15.24 | learning rate: 4.357E-06 | global batch size:    16 | lm loss: 7.482844E+00 | grad norm: 3.129 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      832/  128728 | consumed samples:        13312 | consumed tokens:     27262976 | elapsed time per iteration (s): 15.26 | learning rate: 4.362E-06 | global batch size:    16 | lm loss: 7.744891E+00 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      833/  128728 | consumed samples:        13328 | consumed tokens:     27295744 | elapsed time per iteration (s): 15.25 | learning rate: 4.367E-06 | global batch size:    16 | lm loss: 7.532849E+00 | grad norm: 1.263 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      834/  128728 | consumed samples:        13344 | consumed tokens:     27328512 | elapsed time per iteration (s): 15.24 | learning rate: 4.373E-06 | global batch size:    16 | lm loss: 7.463634E+00 | grad norm: 1.458 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      835/  128728 | consumed samples:        13360 | consumed tokens:     27361280 | elapsed time per iteration (s): 15.22 | learning rate: 4.378E-06 | global batch size:    16 | lm loss: 7.629139E+00 | grad norm: 1.162 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      836/  128728 | consumed samples:        13376 | consumed tokens:     27394048 | elapsed time per iteration (s): 15.23 | learning rate: 4.383E-06 | global batch size:    16 | lm loss: 7.463190E+00 | grad norm: 1.208 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      837/  128728 | consumed samples:        13392 | consumed tokens:     27426816 | elapsed time per iteration (s): 15.27 | learning rate: 4.388E-06 | global batch size:    16 | lm loss: 7.357310E+00 | grad norm: 1.178 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      838/  128728 | consumed samples:        13408 | consumed tokens:     27459584 | elapsed time per iteration (s): 15.27 | learning rate: 4.394E-06 | global batch size:    16 | lm loss: 7.757633E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      839/  128728 | consumed samples:        13424 | consumed tokens:     27492352 | elapsed time per iteration (s): 15.27 | learning rate: 4.399E-06 | global batch size:    16 | lm loss: 7.545015E+00 | grad norm: 1.190 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      840/  128728 | consumed samples:        13440 | consumed tokens:     27525120 | elapsed time per iteration (s): 15.25 | learning rate: 4.404E-06 | global batch size:    16 | lm loss: 7.411932E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      841/  128728 | consumed samples:        13456 | consumed tokens:     27557888 | elapsed time per iteration (s): 15.25 | learning rate: 4.409E-06 | global batch size:    16 | lm loss: 7.422668E+00 | grad norm: 1.191 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      842/  128728 | consumed samples:        13472 | consumed tokens:     27590656 | elapsed time per iteration (s): 15.18 | learning rate: 4.415E-06 | global batch size:    16 | lm loss: 7.665534E+00 | grad norm: 1.146 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      843/  128728 | consumed samples:        13488 | consumed tokens:     27623424 | elapsed time per iteration (s): 15.25 | learning rate: 4.420E-06 | global batch size:    16 | lm loss: 7.618068E+00 | grad norm: 1.068 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      844/  128728 | consumed samples:        13504 | consumed tokens:     27656192 | elapsed time per iteration (s): 15.18 | learning rate: 4.425E-06 | global batch size:    16 | lm loss: 7.596480E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      845/  128728 | consumed samples:        13520 | consumed tokens:     27688960 | elapsed time per iteration (s): 15.17 | learning rate: 4.430E-06 | global batch size:    16 | lm loss: 7.562824E+00 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      846/  128728 | consumed samples:        13536 | consumed tokens:     27721728 | elapsed time per iteration (s): 15.26 | learning rate: 4.435E-06 | global batch size:    16 | lm loss: 7.561560E+00 | grad norm: 1.276 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      847/  128728 | consumed samples:        13552 | consumed tokens:     27754496 | elapsed time per iteration (s): 15.22 | learning rate: 4.441E-06 | global batch size:    16 | lm loss: 7.958152E+00 | grad norm: 1.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      848/  128728 | consumed samples:        13568 | consumed tokens:     27787264 | elapsed time per iteration (s): 15.18 | learning rate: 4.446E-06 | global batch size:    16 | lm loss: 7.501763E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      849/  128728 | consumed samples:        13584 | consumed tokens:     27820032 | elapsed time per iteration (s): 15.26 | learning rate: 4.451E-06 | global batch size:    16 | lm loss: 7.435274E+00 | grad norm: 1.131 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      850/  128728 | consumed samples:        13600 | consumed tokens:     27852800 | elapsed time per iteration (s): 15.26 | learning rate: 4.456E-06 | global batch size:    16 | lm loss: 7.425239E+00 | grad norm: 1.256 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      851/  128728 | consumed samples:        13616 | consumed tokens:     27885568 | elapsed time per iteration (s): 15.26 | learning rate: 4.462E-06 | global batch size:    16 | lm loss: 7.559560E+00 | grad norm: 1.333 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      852/  128728 | consumed samples:        13632 | consumed tokens:     27918336 | elapsed time per iteration (s): 15.24 | learning rate: 4.467E-06 | global batch size:    16 | lm loss: 7.470264E+00 | grad norm: 1.636 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      853/  128728 | consumed samples:        13648 | consumed tokens:     27951104 | elapsed time per iteration (s): 15.18 | learning rate: 4.472E-06 | global batch size:    16 | lm loss: 7.504191E+00 | grad norm: 1.602 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      854/  128728 | consumed samples:        13664 | consumed tokens:     27983872 | elapsed time per iteration (s): 15.23 | learning rate: 4.477E-06 | global batch size:    16 | lm loss: 7.452326E+00 | grad norm: 1.309 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      855/  128728 | consumed samples:        13680 | consumed tokens:     28016640 | elapsed time per iteration (s): 15.18 | learning rate: 4.483E-06 | global batch size:    16 | lm loss: 7.583494E+00 | grad norm: 1.291 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      856/  128728 | consumed samples:        13696 | consumed tokens:     28049408 | elapsed time per iteration (s): 15.25 | learning rate: 4.488E-06 | global batch size:    16 | lm loss: 7.333179E+00 | grad norm: 1.127 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      857/  128728 | consumed samples:        13712 | consumed tokens:     28082176 | elapsed time per iteration (s): 15.24 | learning rate: 4.493E-06 | global batch size:    16 | lm loss: 7.519557E+00 | grad norm: 1.062 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      858/  128728 | consumed samples:        13728 | consumed tokens:     28114944 | elapsed time per iteration (s): 15.25 | learning rate: 4.498E-06 | global batch size:    16 | lm loss: 7.641896E+00 | grad norm: 1.367 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      859/  128728 | consumed samples:        13744 | consumed tokens:     28147712 | elapsed time per iteration (s): 15.17 | learning rate: 4.504E-06 | global batch size:    16 | lm loss: 7.602086E+00 | grad norm: 1.272 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      860/  128728 | consumed samples:        13760 | consumed tokens:     28180480 | elapsed time per iteration (s): 15.19 | learning rate: 4.509E-06 | global batch size:    16 | lm loss: 7.520714E+00 | grad norm: 1.215 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration      861/  128728 | consumed samples:        13776 | consumed tokens:     28213248 | elapsed time per iteration (s): 15.22 | learning rate: 4.514E-06 | global batch size:    16 | lm loss: 7.511874E+00 | grad norm: 1.317 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      862/  128728 | consumed samples:        13792 | consumed tokens:     28246016 | elapsed time per iteration (s): 15.16 | learning rate: 4.519E-06 | global batch size:    16 | lm loss: 7.545038E+00 | grad norm: 1.074 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      863/  128728 | consumed samples:        13808 | consumed tokens:     28278784 | elapsed time per iteration (s): 15.21 | learning rate: 4.525E-06 | global batch size:    16 | lm loss: 7.392710E+00 | grad norm: 1.944 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      864/  128728 | consumed samples:        13824 | consumed tokens:     28311552 | elapsed time per iteration (s): 15.26 | learning rate: 4.530E-06 | global batch size:    16 | lm loss: 7.715175E+00 | grad norm: 1.471 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      865/  128728 | consumed samples:        13840 | consumed tokens:     28344320 | elapsed time per iteration (s): 15.22 | learning rate: 4.535E-06 | global batch size:    16 | lm loss: 7.498834E+00 | grad norm: 1.192 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      866/  128728 | consumed samples:        13856 | consumed tokens:     28377088 | elapsed time per iteration (s): 15.25 | learning rate: 4.540E-06 | global batch size:    16 | lm loss: 7.556900E+00 | grad norm: 1.534 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      867/  128728 | consumed samples:        13872 | consumed tokens:     28409856 | elapsed time per iteration (s): 15.21 | learning rate: 4.546E-06 | global batch size:    16 | lm loss: 7.598176E+00 | grad norm: 1.360 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      868/  128728 | consumed samples:        13888 | consumed tokens:     28442624 | elapsed time per iteration (s): 15.24 | learning rate: 4.551E-06 | global batch size:    16 | lm loss: 7.491490E+00 | grad norm: 1.580 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      869/  128728 | consumed samples:        13904 | consumed tokens:     28475392 | elapsed time per iteration (s): 15.26 | learning rate: 4.556E-06 | global batch size:    16 | lm loss: 7.520513E+00 | grad norm: 1.223 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      870/  128728 | consumed samples:        13920 | consumed tokens:     28508160 | elapsed time per iteration (s): 15.23 | learning rate: 4.561E-06 | global batch size:    16 | lm loss: 7.169995E+00 | grad norm: 1.862 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      871/  128728 | consumed samples:        13936 | consumed tokens:     28540928 | elapsed time per iteration (s): 15.20 | learning rate: 4.567E-06 | global batch size:    16 | lm loss: 7.613565E+00 | grad norm: 1.412 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      872/  128728 | consumed samples:        13952 | consumed tokens:     28573696 | elapsed time per iteration (s): 15.23 | learning rate: 4.572E-06 | global batch size:    16 | lm loss: 7.603791E+00 | grad norm: 1.376 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      873/  128728 | consumed samples:        13968 | consumed tokens:     28606464 | elapsed time per iteration (s): 15.26 | learning rate: 4.577E-06 | global batch size:    16 | lm loss: 7.504703E+00 | grad norm: 2.425 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      874/  128728 | consumed samples:        13984 | consumed tokens:     28639232 | elapsed time per iteration (s): 15.25 | learning rate: 4.582E-06 | global batch size:    16 | lm loss: 7.594444E+00 | grad norm: 1.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      875/  128728 | consumed samples:        14000 | consumed tokens:     28672000 | elapsed time per iteration (s): 15.19 | learning rate: 4.588E-06 | global batch size:    16 | lm loss: 7.600210E+00 | grad norm: 1.639 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      876/  128728 | consumed samples:        14016 | consumed tokens:     28704768 | elapsed time per iteration (s): 15.22 | learning rate: 4.593E-06 | global batch size:    16 | lm loss: 7.522717E+00 | grad norm: 1.620 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      877/  128728 | consumed samples:        14032 | consumed tokens:     28737536 | elapsed time per iteration (s): 15.26 | learning rate: 4.598E-06 | global batch size:    16 | lm loss: 7.450993E+00 | grad norm: 1.517 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      878/  128728 | consumed samples:        14048 | consumed tokens:     28770304 | elapsed time per iteration (s): 15.26 | learning rate: 4.603E-06 | global batch size:    16 | lm loss: 7.297291E+00 | grad norm: 1.619 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      879/  128728 | consumed samples:        14064 | consumed tokens:     28803072 | elapsed time per iteration (s): 15.22 | learning rate: 4.609E-06 | global batch size:    16 | lm loss: 7.489501E+00 | grad norm: 2.042 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      880/  128728 | consumed samples:        14080 | consumed tokens:     28835840 | elapsed time per iteration (s): 15.24 | learning rate: 4.614E-06 | global batch size:    16 | lm loss: 7.403663E+00 | grad norm: 1.527 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      881/  128728 | consumed samples:        14096 | consumed tokens:     28868608 | elapsed time per iteration (s): 15.23 | learning rate: 4.619E-06 | global batch size:    16 | lm loss: 7.537346E+00 | grad norm: 1.324 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      882/  128728 | consumed samples:        14112 | consumed tokens:     28901376 | elapsed time per iteration (s): 15.23 | learning rate: 4.624E-06 | global batch size:    16 | lm loss: 7.363647E+00 | grad norm: 1.856 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      883/  128728 | consumed samples:        14128 | consumed tokens:     28934144 | elapsed time per iteration (s): 15.24 | learning rate: 4.629E-06 | global batch size:    16 | lm loss: 7.634407E+00 | grad norm: 1.268 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      884/  128728 | consumed samples:        14144 | consumed tokens:     28966912 | elapsed time per iteration (s): 15.22 | learning rate: 4.635E-06 | global batch size:    16 | lm loss: 7.377182E+00 | grad norm: 1.882 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      885/  128728 | consumed samples:        14160 | consumed tokens:     28999680 | elapsed time per iteration (s): 15.24 | learning rate: 4.640E-06 | global batch size:    16 | lm loss: 7.484207E+00 | grad norm: 1.435 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      886/  128728 | consumed samples:        14176 | consumed tokens:     29032448 | elapsed time per iteration (s): 15.23 | learning rate: 4.645E-06 | global batch size:    16 | lm loss: 7.508356E+00 | grad norm: 1.357 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      887/  128728 | consumed samples:        14192 | consumed tokens:     29065216 | elapsed time per iteration (s): 15.24 | learning rate: 4.650E-06 | global batch size:    16 | lm loss: 7.583908E+00 | grad norm: 1.316 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      888/  128728 | consumed samples:        14208 | consumed tokens:     29097984 | elapsed time per iteration (s): 15.25 | learning rate: 4.656E-06 | global batch size:    16 | lm loss: 7.400177E+00 | grad norm: 1.628 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      889/  128728 | consumed samples:        14224 | consumed tokens:     29130752 | elapsed time per iteration (s): 15.23 | learning rate: 4.661E-06 | global batch size:    16 | lm loss: 7.434398E+00 | grad norm: 1.261 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      890/  128728 | consumed samples:        14240 | consumed tokens:     29163520 | elapsed time per iteration (s): 15.23 | learning rate: 4.666E-06 | global batch size:    16 | lm loss: 7.919844E+00 | grad norm: 1.676 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      891/  128728 | consumed samples:        14256 | consumed tokens:     29196288 | elapsed time per iteration (s): 15.19 | learning rate: 4.671E-06 | global batch size:    16 | lm loss: 7.375011E+00 | grad norm: 2.644 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration      892/  128728 | consumed samples:        14272 | consumed tokens:     29229056 | elapsed time per iteration (s): 15.30 | learning rate: 4.677E-06 | global batch size:    16 | lm loss: 7.455361E+00 | grad norm: 1.290 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration      893/  128728 | consumed samples:        14288 | consumed tokens:     29261824 | elapsed time per iteration (s): 15.25 | learning rate: 4.682E-06 | global batch size:    16 | lm loss: 7.363049E+00 | grad norm: 1.175 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      894/  128728 | consumed samples:        14304 | consumed tokens:     29294592 | elapsed time per iteration (s): 15.22 | learning rate: 4.687E-06 | global batch size:    16 | lm loss: 7.459336E+00 | grad norm: 1.293 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      895/  128728 | consumed samples:        14320 | consumed tokens:     29327360 | elapsed time per iteration (s): 15.26 | learning rate: 4.692E-06 | global batch size:    16 | lm loss: 7.505486E+00 | grad norm: 1.498 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      896/  128728 | consumed samples:        14336 | consumed tokens:     29360128 | elapsed time per iteration (s): 15.21 | learning rate: 4.698E-06 | global batch size:    16 | lm loss: 7.412171E+00 | grad norm: 1.231 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      897/  128728 | consumed samples:        14352 | consumed tokens:     29392896 | elapsed time per iteration (s): 15.29 | learning rate: 4.703E-06 | global batch size:    16 | lm loss: 7.677485E+00 | grad norm: 2.570 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration      898/  128728 | consumed samples:        14368 | consumed tokens:     29425664 | elapsed time per iteration (s): 15.31 | learning rate: 4.708E-06 | global batch size:    16 | lm loss: 7.416935E+00 | grad norm: 1.291 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.045 | TFLOPs: 8.00 |
[default7]: iteration      899/  128728 | consumed samples:        14384 | consumed tokens:     29458432 | elapsed time per iteration (s): 15.17 | learning rate: 4.713E-06 | global batch size:    16 | lm loss: 7.279807E+00 | grad norm: 2.479 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      900/  128728 | consumed samples:        14400 | consumed tokens:     29491200 | elapsed time per iteration (s): 15.22 | learning rate: 4.719E-06 | global batch size:    16 | lm loss: 7.462852E+00 | grad norm: 1.378 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      901/  128728 | consumed samples:        14416 | consumed tokens:     29523968 | elapsed time per iteration (s): 15.30 | learning rate: 4.724E-06 | global batch size:    16 | lm loss: 7.639120E+00 | grad norm: 1.333 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration      902/  128728 | consumed samples:        14432 | consumed tokens:     29556736 | elapsed time per iteration (s): 15.25 | learning rate: 4.729E-06 | global batch size:    16 | lm loss: 7.405077E+00 | grad norm: 1.354 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      903/  128728 | consumed samples:        14448 | consumed tokens:     29589504 | elapsed time per iteration (s): 15.28 | learning rate: 4.734E-06 | global batch size:    16 | lm loss: 7.423763E+00 | grad norm: 1.243 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration      904/  128728 | consumed samples:        14464 | consumed tokens:     29622272 | elapsed time per iteration (s): 15.26 | learning rate: 4.740E-06 | global batch size:    16 | lm loss: 7.548100E+00 | grad norm: 2.421 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      905/  128728 | consumed samples:        14480 | consumed tokens:     29655040 | elapsed time per iteration (s): 15.22 | learning rate: 4.745E-06 | global batch size:    16 | lm loss: 7.505497E+00 | grad norm: 1.492 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      906/  128728 | consumed samples:        14496 | consumed tokens:     29687808 | elapsed time per iteration (s): 15.22 | learning rate: 4.750E-06 | global batch size:    16 | lm loss: 7.657626E+00 | grad norm: 1.663 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      907/  128728 | consumed samples:        14512 | consumed tokens:     29720576 | elapsed time per iteration (s): 15.27 | learning rate: 4.755E-06 | global batch size:    16 | lm loss: 7.370772E+00 | grad norm: 1.471 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      908/  128728 | consumed samples:        14528 | consumed tokens:     29753344 | elapsed time per iteration (s): 15.24 | learning rate: 4.761E-06 | global batch size:    16 | lm loss: 7.308388E+00 | grad norm: 1.584 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      909/  128728 | consumed samples:        14544 | consumed tokens:     29786112 | elapsed time per iteration (s): 15.18 | learning rate: 4.766E-06 | global batch size:    16 | lm loss: 7.730386E+00 | grad norm: 1.707 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      910/  128728 | consumed samples:        14560 | consumed tokens:     29818880 | elapsed time per iteration (s): 15.27 | learning rate: 4.771E-06 | global batch size:    16 | lm loss: 7.448133E+00 | grad norm: 1.317 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      911/  128728 | consumed samples:        14576 | consumed tokens:     29851648 | elapsed time per iteration (s): 15.25 | learning rate: 4.776E-06 | global batch size:    16 | lm loss: 7.687496E+00 | grad norm: 1.577 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      912/  128728 | consumed samples:        14592 | consumed tokens:     29884416 | elapsed time per iteration (s): 15.22 | learning rate: 4.782E-06 | global batch size:    16 | lm loss: 7.360633E+00 | grad norm: 1.229 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      913/  128728 | consumed samples:        14608 | consumed tokens:     29917184 | elapsed time per iteration (s): 15.25 | learning rate: 4.787E-06 | global batch size:    16 | lm loss: 7.608915E+00 | grad norm: 1.138 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration      914/  128728 | consumed samples:        14624 | consumed tokens:     29949952 | elapsed time per iteration (s): 15.23 | learning rate: 4.792E-06 | global batch size:    16 | lm loss: 7.448811E+00 | grad norm: 1.322 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      915/  128728 | consumed samples:        14640 | consumed tokens:     29982720 | elapsed time per iteration (s): 15.21 | learning rate: 4.797E-06 | global batch size:    16 | lm loss: 7.706942E+00 | grad norm: 1.295 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      916/  128728 | consumed samples:        14656 | consumed tokens:     30015488 | elapsed time per iteration (s): 15.24 | learning rate: 4.802E-06 | global batch size:    16 | lm loss: 7.413746E+00 | grad norm: 1.522 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      917/  128728 | consumed samples:        14672 | consumed tokens:     30048256 | elapsed time per iteration (s): 15.22 | learning rate: 4.808E-06 | global batch size:    16 | lm loss: 7.521213E+00 | grad norm: 1.249 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      918/  128728 | consumed samples:        14688 | consumed tokens:     30081024 | elapsed time per iteration (s): 15.26 | learning rate: 4.813E-06 | global batch size:    16 | lm loss: 7.561061E+00 | grad norm: 1.997 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      919/  128728 | consumed samples:        14704 | consumed tokens:     30113792 | elapsed time per iteration (s): 15.22 | learning rate: 4.818E-06 | global batch size:    16 | lm loss: 7.206147E+00 | grad norm: 1.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      920/  128728 | consumed samples:        14720 | consumed tokens:     30146560 | elapsed time per iteration (s): 15.23 | learning rate: 4.823E-06 | global batch size:    16 | lm loss: 7.452460E+00 | grad norm: 1.208 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      921/  128728 | consumed samples:        14736 | consumed tokens:     30179328 | elapsed time per iteration (s): 15.27 | learning rate: 4.829E-06 | global batch size:    16 | lm loss: 7.200177E+00 | grad norm: 1.998 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      922/  128728 | consumed samples:        14752 | consumed tokens:     30212096 | elapsed time per iteration (s): 15.24 | learning rate: 4.834E-06 | global batch size:    16 | lm loss: 7.294092E+00 | grad norm: 1.499 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      923/  128728 | consumed samples:        14768 | consumed tokens:     30244864 | elapsed time per iteration (s): 15.28 | learning rate: 4.839E-06 | global batch size:    16 | lm loss: 7.780398E+00 | grad norm: 1.805 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration      924/  128728 | consumed samples:        14784 | consumed tokens:     30277632 | elapsed time per iteration (s): 15.24 | learning rate: 4.844E-06 | global batch size:    16 | lm loss: 7.456621E+00 | grad norm: 2.455 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      925/  128728 | consumed samples:        14800 | consumed tokens:     30310400 | elapsed time per iteration (s): 15.25 | learning rate: 4.850E-06 | global batch size:    16 | lm loss: 7.646140E+00 | grad norm: 1.438 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      926/  128728 | consumed samples:        14816 | consumed tokens:     30343168 | elapsed time per iteration (s): 15.24 | learning rate: 4.855E-06 | global batch size:    16 | lm loss: 7.587268E+00 | grad norm: 1.168 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      927/  128728 | consumed samples:        14832 | consumed tokens:     30375936 | elapsed time per iteration (s): 15.27 | learning rate: 4.860E-06 | global batch size:    16 | lm loss: 7.366327E+00 | grad norm: 1.278 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      928/  128728 | consumed samples:        14848 | consumed tokens:     30408704 | elapsed time per iteration (s): 15.20 | learning rate: 4.865E-06 | global batch size:    16 | lm loss: 7.437315E+00 | grad norm: 1.486 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      929/  128728 | consumed samples:        14864 | consumed tokens:     30441472 | elapsed time per iteration (s): 15.18 | learning rate: 4.871E-06 | global batch size:    16 | lm loss: 7.467528E+00 | grad norm: 1.623 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      930/  128728 | consumed samples:        14880 | consumed tokens:     30474240 | elapsed time per iteration (s): 15.21 | learning rate: 4.876E-06 | global batch size:    16 | lm loss: 7.356944E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      931/  128728 | consumed samples:        14896 | consumed tokens:     30507008 | elapsed time per iteration (s): 15.24 | learning rate: 4.881E-06 | global batch size:    16 | lm loss: 7.382359E+00 | grad norm: 1.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      932/  128728 | consumed samples:        14912 | consumed tokens:     30539776 | elapsed time per iteration (s): 15.24 | learning rate: 4.886E-06 | global batch size:    16 | lm loss: 7.406995E+00 | grad norm: 1.849 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      933/  128728 | consumed samples:        14928 | consumed tokens:     30572544 | elapsed time per iteration (s): 15.20 | learning rate: 4.892E-06 | global batch size:    16 | lm loss: 7.376684E+00 | grad norm: 1.453 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      934/  128728 | consumed samples:        14944 | consumed tokens:     30605312 | elapsed time per iteration (s): 15.24 | learning rate: 4.897E-06 | global batch size:    16 | lm loss: 7.531736E+00 | grad norm: 1.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      935/  128728 | consumed samples:        14960 | consumed tokens:     30638080 | elapsed time per iteration (s): 15.23 | learning rate: 4.902E-06 | global batch size:    16 | lm loss: 7.509977E+00 | grad norm: 1.438 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      936/  128728 | consumed samples:        14976 | consumed tokens:     30670848 | elapsed time per iteration (s): 15.24 | learning rate: 4.907E-06 | global batch size:    16 | lm loss: 7.370396E+00 | grad norm: 1.157 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      937/  128728 | consumed samples:        14992 | consumed tokens:     30703616 | elapsed time per iteration (s): 15.16 | learning rate: 4.913E-06 | global batch size:    16 | lm loss: 7.500789E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      938/  128728 | consumed samples:        15008 | consumed tokens:     30736384 | elapsed time per iteration (s): 15.21 | learning rate: 4.918E-06 | global batch size:    16 | lm loss: 7.531604E+00 | grad norm: 1.082 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      939/  128728 | consumed samples:        15024 | consumed tokens:     30769152 | elapsed time per iteration (s): 15.25 | learning rate: 4.923E-06 | global batch size:    16 | lm loss: 7.307188E+00 | grad norm: 1.447 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      940/  128728 | consumed samples:        15040 | consumed tokens:     30801920 | elapsed time per iteration (s): 15.18 | learning rate: 4.928E-06 | global batch size:    16 | lm loss: 7.548573E+00 | grad norm: 1.044 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      941/  128728 | consumed samples:        15056 | consumed tokens:     30834688 | elapsed time per iteration (s): 15.15 | learning rate: 4.934E-06 | global batch size:    16 | lm loss: 7.376065E+00 | grad norm: 1.133 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration      942/  128728 | consumed samples:        15072 | consumed tokens:     30867456 | elapsed time per iteration (s): 15.26 | learning rate: 4.939E-06 | global batch size:    16 | lm loss: 7.403994E+00 | grad norm: 0.935 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      943/  128728 | consumed samples:        15088 | consumed tokens:     30900224 | elapsed time per iteration (s): 15.19 | learning rate: 4.944E-06 | global batch size:    16 | lm loss: 7.430916E+00 | grad norm: 1.234 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      944/  128728 | consumed samples:        15104 | consumed tokens:     30932992 | elapsed time per iteration (s): 15.23 | learning rate: 4.949E-06 | global batch size:    16 | lm loss: 7.367596E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      945/  128728 | consumed samples:        15120 | consumed tokens:     30965760 | elapsed time per iteration (s): 15.22 | learning rate: 4.955E-06 | global batch size:    16 | lm loss: 7.401025E+00 | grad norm: 1.141 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      946/  128728 | consumed samples:        15136 | consumed tokens:     30998528 | elapsed time per iteration (s): 15.22 | learning rate: 4.960E-06 | global batch size:    16 | lm loss: 7.536839E+00 | grad norm: 1.355 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      947/  128728 | consumed samples:        15152 | consumed tokens:     31031296 | elapsed time per iteration (s): 15.25 | learning rate: 4.965E-06 | global batch size:    16 | lm loss: 7.108221E+00 | grad norm: 1.170 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      948/  128728 | consumed samples:        15168 | consumed tokens:     31064064 | elapsed time per iteration (s): 15.23 | learning rate: 4.970E-06 | global batch size:    16 | lm loss: 7.302841E+00 | grad norm: 1.602 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      949/  128728 | consumed samples:        15184 | consumed tokens:     31096832 | elapsed time per iteration (s): 15.27 | learning rate: 4.976E-06 | global batch size:    16 | lm loss: 7.204376E+00 | grad norm: 1.067 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      950/  128728 | consumed samples:        15200 | consumed tokens:     31129600 | elapsed time per iteration (s): 15.22 | learning rate: 4.981E-06 | global batch size:    16 | lm loss: 7.323405E+00 | grad norm: 1.476 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      951/  128728 | consumed samples:        15216 | consumed tokens:     31162368 | elapsed time per iteration (s): 15.27 | learning rate: 4.986E-06 | global batch size:    16 | lm loss: 7.413459E+00 | grad norm: 1.044 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      952/  128728 | consumed samples:        15232 | consumed tokens:     31195136 | elapsed time per iteration (s): 15.21 | learning rate: 4.991E-06 | global batch size:    16 | lm loss: 7.621178E+00 | grad norm: 1.750 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      953/  128728 | consumed samples:        15248 | consumed tokens:     31227904 | elapsed time per iteration (s): 15.20 | learning rate: 4.996E-06 | global batch size:    16 | lm loss: 7.608077E+00 | grad norm: 1.428 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      954/  128728 | consumed samples:        15264 | consumed tokens:     31260672 | elapsed time per iteration (s): 15.26 | learning rate: 5.002E-06 | global batch size:    16 | lm loss: 7.327553E+00 | grad norm: 1.315 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      955/  128728 | consumed samples:        15280 | consumed tokens:     31293440 | elapsed time per iteration (s): 15.17 | learning rate: 5.007E-06 | global batch size:    16 | lm loss: 7.498928E+00 | grad norm: 1.282 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      956/  128728 | consumed samples:        15296 | consumed tokens:     31326208 | elapsed time per iteration (s): 15.25 | learning rate: 5.012E-06 | global batch size:    16 | lm loss: 7.481583E+00 | grad norm: 1.408 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      957/  128728 | consumed samples:        15312 | consumed tokens:     31358976 | elapsed time per iteration (s): 15.19 | learning rate: 5.017E-06 | global batch size:    16 | lm loss: 7.372598E+00 | grad norm: 1.637 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      958/  128728 | consumed samples:        15328 | consumed tokens:     31391744 | elapsed time per iteration (s): 15.18 | learning rate: 5.023E-06 | global batch size:    16 | lm loss: 7.266788E+00 | grad norm: 0.984 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      959/  128728 | consumed samples:        15344 | consumed tokens:     31424512 | elapsed time per iteration (s): 15.21 | learning rate: 5.028E-06 | global batch size:    16 | lm loss: 7.610543E+00 | grad norm: 1.214 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      960/  128728 | consumed samples:        15360 | consumed tokens:     31457280 | elapsed time per iteration (s): 15.19 | learning rate: 5.033E-06 | global batch size:    16 | lm loss: 7.411926E+00 | grad norm: 1.393 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      961/  128728 | consumed samples:        15376 | consumed tokens:     31490048 | elapsed time per iteration (s): 15.17 | learning rate: 5.038E-06 | global batch size:    16 | lm loss: 7.298542E+00 | grad norm: 1.096 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      962/  128728 | consumed samples:        15392 | consumed tokens:     31522816 | elapsed time per iteration (s): 15.24 | learning rate: 5.044E-06 | global batch size:    16 | lm loss: 7.530574E+00 | grad norm: 1.634 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      963/  128728 | consumed samples:        15408 | consumed tokens:     31555584 | elapsed time per iteration (s): 15.25 | learning rate: 5.049E-06 | global batch size:    16 | lm loss: 7.191813E+00 | grad norm: 1.394 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      964/  128728 | consumed samples:        15424 | consumed tokens:     31588352 | elapsed time per iteration (s): 15.29 | learning rate: 5.054E-06 | global batch size:    16 | lm loss: 7.466516E+00 | grad norm: 1.555 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration      965/  128728 | consumed samples:        15440 | consumed tokens:     31621120 | elapsed time per iteration (s): 15.23 | learning rate: 5.059E-06 | global batch size:    16 | lm loss: 7.481571E+00 | grad norm: 1.539 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      966/  128728 | consumed samples:        15456 | consumed tokens:     31653888 | elapsed time per iteration (s): 15.25 | learning rate: 5.065E-06 | global batch size:    16 | lm loss: 7.445633E+00 | grad norm: 1.134 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      967/  128728 | consumed samples:        15472 | consumed tokens:     31686656 | elapsed time per iteration (s): 15.26 | learning rate: 5.070E-06 | global batch size:    16 | lm loss: 7.634816E+00 | grad norm: 1.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      968/  128728 | consumed samples:        15488 | consumed tokens:     31719424 | elapsed time per iteration (s): 15.26 | learning rate: 5.075E-06 | global batch size:    16 | lm loss: 7.474030E+00 | grad norm: 1.973 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      969/  128728 | consumed samples:        15504 | consumed tokens:     31752192 | elapsed time per iteration (s): 15.25 | learning rate: 5.080E-06 | global batch size:    16 | lm loss: 7.217330E+00 | grad norm: 1.419 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      970/  128728 | consumed samples:        15520 | consumed tokens:     31784960 | elapsed time per iteration (s): 15.25 | learning rate: 5.086E-06 | global batch size:    16 | lm loss: 7.412174E+00 | grad norm: 1.102 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      971/  128728 | consumed samples:        15536 | consumed tokens:     31817728 | elapsed time per iteration (s): 15.19 | learning rate: 5.091E-06 | global batch size:    16 | lm loss: 7.506372E+00 | grad norm: 1.399 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      972/  128728 | consumed samples:        15552 | consumed tokens:     31850496 | elapsed time per iteration (s): 15.17 | learning rate: 5.096E-06 | global batch size:    16 | lm loss: 7.401738E+00 | grad norm: 1.298 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      973/  128728 | consumed samples:        15568 | consumed tokens:     31883264 | elapsed time per iteration (s): 15.16 | learning rate: 5.101E-06 | global batch size:    16 | lm loss: 7.248646E+00 | grad norm: 1.126 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      974/  128728 | consumed samples:        15584 | consumed tokens:     31916032 | elapsed time per iteration (s): 15.25 | learning rate: 5.107E-06 | global batch size:    16 | lm loss: 7.523051E+00 | grad norm: 1.707 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      975/  128728 | consumed samples:        15600 | consumed tokens:     31948800 | elapsed time per iteration (s): 15.21 | learning rate: 5.112E-06 | global batch size:    16 | lm loss: 7.623046E+00 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration      976/  128728 | consumed samples:        15616 | consumed tokens:     31981568 | elapsed time per iteration (s): 15.19 | learning rate: 5.117E-06 | global batch size:    16 | lm loss: 7.583755E+00 | grad norm: 1.307 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration      977/  128728 | consumed samples:        15632 | consumed tokens:     32014336 | elapsed time per iteration (s): 15.26 | learning rate: 5.122E-06 | global batch size:    16 | lm loss: 7.316653E+00 | grad norm: 1.253 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      978/  128728 | consumed samples:        15648 | consumed tokens:     32047104 | elapsed time per iteration (s): 15.26 | learning rate: 5.128E-06 | global batch size:    16 | lm loss: 7.298987E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration      979/  128728 | consumed samples:        15664 | consumed tokens:     32079872 | elapsed time per iteration (s): 15.25 | learning rate: 5.133E-06 | global batch size:    16 | lm loss: 7.467144E+00 | grad norm: 1.544 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      980/  128728 | consumed samples:        15680 | consumed tokens:     32112640 | elapsed time per iteration (s): 15.22 | learning rate: 5.138E-06 | global batch size:    16 | lm loss: 7.399050E+00 | grad norm: 1.285 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      981/  128728 | consumed samples:        15696 | consumed tokens:     32145408 | elapsed time per iteration (s): 15.23 | learning rate: 5.143E-06 | global batch size:    16 | lm loss: 7.307127E+00 | grad norm: 1.364 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      982/  128728 | consumed samples:        15712 | consumed tokens:     32178176 | elapsed time per iteration (s): 15.21 | learning rate: 5.149E-06 | global batch size:    16 | lm loss: 7.372665E+00 | grad norm: 1.068 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      983/  128728 | consumed samples:        15728 | consumed tokens:     32210944 | elapsed time per iteration (s): 15.17 | learning rate: 5.154E-06 | global batch size:    16 | lm loss: 7.395346E+00 | grad norm: 1.203 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration      984/  128728 | consumed samples:        15744 | consumed tokens:     32243712 | elapsed time per iteration (s): 15.26 | learning rate: 5.159E-06 | global batch size:    16 | lm loss: 7.418610E+00 | grad norm: 1.037 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      985/  128728 | consumed samples:        15760 | consumed tokens:     32276480 | elapsed time per iteration (s): 15.22 | learning rate: 5.164E-06 | global batch size:    16 | lm loss: 7.631675E+00 | grad norm: 1.184 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      986/  128728 | consumed samples:        15776 | consumed tokens:     32309248 | elapsed time per iteration (s): 15.24 | learning rate: 5.169E-06 | global batch size:    16 | lm loss: 7.382019E+00 | grad norm: 1.287 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      987/  128728 | consumed samples:        15792 | consumed tokens:     32342016 | elapsed time per iteration (s): 15.25 | learning rate: 5.175E-06 | global batch size:    16 | lm loss: 7.357999E+00 | grad norm: 1.141 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration      988/  128728 | consumed samples:        15808 | consumed tokens:     32374784 | elapsed time per iteration (s): 15.18 | learning rate: 5.180E-06 | global batch size:    16 | lm loss: 7.538756E+00 | grad norm: 1.027 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration      989/  128728 | consumed samples:        15824 | consumed tokens:     32407552 | elapsed time per iteration (s): 15.23 | learning rate: 5.185E-06 | global batch size:    16 | lm loss: 7.230034E+00 | grad norm: 1.265 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      990/  128728 | consumed samples:        15840 | consumed tokens:     32440320 | elapsed time per iteration (s): 15.23 | learning rate: 5.190E-06 | global batch size:    16 | lm loss: 7.380984E+00 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration      991/  128728 | consumed samples:        15856 | consumed tokens:     32473088 | elapsed time per iteration (s): 15.23 | learning rate: 5.196E-06 | global batch size:    16 | lm loss: 7.412922E+00 | grad norm: 1.216 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      992/  128728 | consumed samples:        15872 | consumed tokens:     32505856 | elapsed time per iteration (s): 15.23 | learning rate: 5.201E-06 | global batch size:    16 | lm loss: 7.293040E+00 | grad norm: 1.339 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration      993/  128728 | consumed samples:        15888 | consumed tokens:     32538624 | elapsed time per iteration (s): 15.14 | learning rate: 5.206E-06 | global batch size:    16 | lm loss: 7.172251E+00 | grad norm: 1.524 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration      994/  128728 | consumed samples:        15904 | consumed tokens:     32571392 | elapsed time per iteration (s): 15.27 | learning rate: 5.211E-06 | global batch size:    16 | lm loss: 7.383713E+00 | grad norm: 1.159 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration      995/  128728 | consumed samples:        15920 | consumed tokens:     32604160 | elapsed time per iteration (s): 15.22 | learning rate: 5.217E-06 | global batch size:    16 | lm loss: 7.343609E+00 | grad norm: 0.969 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      996/  128728 | consumed samples:        15936 | consumed tokens:     32636928 | elapsed time per iteration (s): 15.21 | learning rate: 5.222E-06 | global batch size:    16 | lm loss: 7.478510E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration      997/  128728 | consumed samples:        15952 | consumed tokens:     32669696 | elapsed time per iteration (s): 15.28 | learning rate: 5.227E-06 | global batch size:    16 | lm loss: 7.494905E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration      998/  128728 | consumed samples:        15968 | consumed tokens:     32702464 | elapsed time per iteration (s): 15.22 | learning rate: 5.232E-06 | global batch size:    16 | lm loss: 7.248654E+00 | grad norm: 1.117 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration      999/  128728 | consumed samples:        15984 | consumed tokens:     32735232 | elapsed time per iteration (s): 15.25 | learning rate: 5.238E-06 | global batch size:    16 | lm loss: 7.334100E+00 | grad norm: 1.284 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1000/  128728 | consumed samples:        16000 | consumed tokens:     32768000 | elapsed time per iteration (s): 15.23 | learning rate: 5.243E-06 | global batch size:    16 | lm loss: 7.241666E+00 | grad norm: 1.150 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]:------------------------------------------------------------------------------------------
[default7]:valid loss at iteration 1000 | lm loss value: 7.702314E+00 | lm loss PPL: 2.213464E+03 | 
[default7]:------------------------------------------------------------------------------------------
[default0]:saving checkpoint at iteration    1000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:[2022-03-03 10:08:59,597] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/mp_rank_00_model_states.pt
[default1]:[2022-03-03 10:08:59,711] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/mp_rank_01_model_states.pt
[default4]:[2022-03-03 10:09:11,547] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 10:09:11,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt
[default0]:[2022-03-03 10:09:11,926] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 10:09:11,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt
[default1]:[2022-03-03 10:09:12,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt
[default4]:[2022-03-03 10:09:12,253] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt
[default5]:[2022-03-03 10:09:12,319] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt
[default6]:[2022-03-03 10:09:12,289] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt
[default2]:[2022-03-03 10:09:12,265] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt
[default5]:[2022-03-03 10:09:12,620] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 10:09:12,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt
[default5]:[2022-03-03 10:09:12,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt
[default6]:[2022-03-03 10:09:12,691] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt
[default2]:[2022-03-03 10:09:12,815] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 10:09:12,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt
[default1]:[2022-03-03 10:09:12,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 10:09:12,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt
[default7]:[2022-03-03 10:09:12,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt
[default0]:[2022-03-03 10:09:13,012] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt
[default4]:[2022-03-03 10:09:12,962] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt
[default6]:[2022-03-03 10:09:13,085] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt
[default7]:[2022-03-03 10:09:13,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt
[default2]:[2022-03-03 10:09:13,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt
[default0]:[2022-03-03 10:09:13,332] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt
[default6]:[2022-03-03 10:09:13,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 10:09:13,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 10:09:13,415] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt
[default0]:[2022-03-03 10:09:13,450] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt
[default4]:[2022-03-03 10:09:13,550] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 10:09:13,610] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 10:09:13,558] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 10:09:13,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt
[default5]:[2022-03-03 10:09:13,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt
[default1]:[2022-03-03 10:09:13,663] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt
[default5]:[2022-03-03 10:09:13,637] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt
[default3]:[2022-03-03 10:09:13,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt
[default7]:[2022-03-03 10:09:13,800] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt
[default1]:[2022-03-03 10:09:13,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 10:09:13,916] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 10:09:13,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt
[default2]:[2022-03-03 10:09:14,004] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt
[default5]:[2022-03-03 10:09:14,025] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt
[default3]:[2022-03-03 10:09:14,078] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt
[default5]:[2022-03-03 10:09:14,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 10:09:14,242] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt
[default1]:[2022-03-03 10:09:14,270] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 10:09:14,591] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 10:09:14,636] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 10:09:14,663] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt
[default6]:[2022-03-03 10:09:14,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 10:09:14,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 10:09:14,879] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt
[default4]:[2022-03-03 10:09:14,941] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt
[default0]:[2022-03-03 10:09:14,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 10:09:15,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt
[default1]:[2022-03-03 10:09:15,042] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt
[default5]:[2022-03-03 10:09:15,204] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt
[default2]:[2022-03-03 10:09:15,234] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt
[default7]:[2022-03-03 10:09:15,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt
[default5]:[2022-03-03 10:09:15,269] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt
[default2]:[2022-03-03 10:09:15,367] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt
[default5]:[2022-03-03 10:09:15,492] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 10:09:15,517] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt
[default2]:[2022-03-03 10:09:15,601] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt
[default3]:[2022-03-03 10:09:15,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt
[default1]:[2022-03-03 10:09:15,629] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 10:09:15,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt
[default2]:[2022-03-03 10:09:15,625] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt
[default4]:[2022-03-03 10:09:15,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt
[default0]:[2022-03-03 10:09:15,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt
[default5]:[2022-03-03 10:09:15,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 10:09:15,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt
[default0]:[2022-03-03 10:09:15,728] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt
[default6]:[2022-03-03 10:09:15,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt
[default6]:[2022-03-03 10:09:15,976] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt
[default4]:[2022-03-03 10:09:16,005] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt
[default3]:[2022-03-03 10:09:16,009] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 10:09:16,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt
[default6]:[2022-03-03 10:09:16,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 10:09:16,094] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt
[default5]:[2022-03-03 10:09:16,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 10:09:16,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt
[default6]:[2022-03-03 10:09:16,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt
[default7]:[2022-03-03 10:09:16,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt
[default3]:[2022-03-03 10:09:16,292] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt
[default6]:[2022-03-03 10:09:16,334] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 10:09:16,405] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt
[default4]:[2022-03-03 10:09:16,348] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt
[default4]:[2022-03-03 10:09:16,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt
[default6]:[2022-03-03 10:09:16,508] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt
[default5]:[2022-03-03 10:09:16,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 10:09:16,539] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt
[default7]:[2022-03-03 10:09:16,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt
[default2]:[2022-03-03 10:09:16,594] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt
[default7]:[2022-03-03 10:09:16,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt
[default5]:[2022-03-03 10:09:16,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt
[default3]:[2022-03-03 10:09:16,743] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt
[default6]:[2022-03-03 10:09:16,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 10:09:16,897] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 10:09:17,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt
[default7]:[2022-03-03 10:09:17,102] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt
[default0]:[2022-03-03 10:09:17,105] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[default4]:[2022-03-03 10:09:17,064] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
[default6]:[2022-03-03 10:09:17,141] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt
[default3]:[2022-03-03 10:09:17,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt
[default0]:[2022-03-03 10:09:17,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 10:09:17,317] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt
[default7]:[2022-03-03 10:09:17,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt
[default2]:[2022-03-03 10:09:17,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt
[default1]:[2022-03-03 10:09:17,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 10:09:17,398] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt
[default5]:[2022-03-03 10:09:17,448] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 10:09:17,484] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt
[default1]:[2022-03-03 10:09:17,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt
[default0]:[2022-03-03 10:09:17,555] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt
[default0]:[2022-03-03 10:09:17,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt
[default7]:[2022-03-03 10:09:17,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt
[default7]:[2022-03-03 10:09:17,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 10:09:17,677] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt
[default4]:[2022-03-03 10:09:17,618] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt
[default6]:[2022-03-03 10:09:17,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt
[default4]:[2022-03-03 10:09:17,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt
[default4]:[2022-03-03 10:09:17,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 10:09:17,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt
[default1]:[2022-03-03 10:09:17,828] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt
[default6]:[2022-03-03 10:09:17,803] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt
[default0]:[2022-03-03 10:09:17,796] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt
[default2]:[2022-03-03 10:09:17,797] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt
[default5]:[2022-03-03 10:09:17,887] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt
[default6]:[2022-03-03 10:09:17,894] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 10:09:17,913] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt
[default4]:[2022-03-03 10:09:17,843] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
[default0]:[2022-03-03 10:09:17,902] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt
[default1]:[2022-03-03 10:09:17,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 10:09:17,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
[default1]:[2022-03-03 10:09:17,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 10:09:18,046] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt
[default4]:[2022-03-03 10:09:18,251] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 10:09:18,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt
[default0]:[2022-03-03 10:09:18,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt
[default5]:[2022-03-03 10:09:18,280] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt
[default3]:[2022-03-03 10:09:18,397] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt
[default2]:[2022-03-03 10:09:18,337] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt
[default0]:[2022-03-03 10:09:18,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt
[default7]:[2022-03-03 10:09:18,371] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt
[default6]:[2022-03-03 10:09:18,442] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt
[default6]:[2022-03-03 10:09:18,411] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 10:09:18,389] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt
[default1]:[2022-03-03 10:09:18,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt
[default7]:[2022-03-03 10:09:18,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt
[default3]:[2022-03-03 10:09:18,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt
[default0]:[2022-03-03 10:09:18,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 10:09:18,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 10:09:18,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt
[default6]:[2022-03-03 10:09:18,500] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt
[default2]:[2022-03-03 10:09:18,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt
[default6]:[2022-03-03 10:09:18,505] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt
[default1]:[2022-03-03 10:09:18,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 10:09:18,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt
[default7]:[2022-03-03 10:09:18,609] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 10:09:18,676] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt
[default5]:[2022-03-03 10:09:18,622] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt
[default0]:[2022-03-03 10:09:18,651] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt
[default2]:[2022-03-03 10:09:18,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt
[default2]:[2022-03-03 10:09:18,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 10:09:18,697] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 10:09:18,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt
[default5]:[2022-03-03 10:09:18,771] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 10:09:18,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt
[default5]:[2022-03-03 10:09:18,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt
[default0]:[2022-03-03 10:09:18,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt
[default0]:[2022-03-03 10:09:18,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt
[default3]:[2022-03-03 10:09:18,860] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt
[default2]:[2022-03-03 10:09:18,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 10:09:19,029] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt
[default1]:[2022-03-03 10:09:18,996] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt
[default5]:[2022-03-03 10:09:18,998] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt
[default6]:[2022-03-03 10:09:19,047] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt
[default2]:[2022-03-03 10:09:19,086] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt
[default6]:[2022-03-03 10:09:19,076] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt
[default1]:[2022-03-03 10:09:19,098] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt
[default0]:[2022-03-03 10:09:19,066] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 10:09:19,105] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt
[default5]:[2022-03-03 10:09:19,133] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt
[default4]:[2022-03-03 10:09:19,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt
[default7]:[2022-03-03 10:09:19,171] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt
[default5]:[2022-03-03 10:09:19,231] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt
[default2]:[2022-03-03 10:09:19,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt
[default7]:[2022-03-03 10:09:19,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt
[default1]:[2022-03-03 10:09:19,340] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 10:09:19,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt
[default1]:[2022-03-03 10:09:19,253] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt
[default6]:[2022-03-03 10:09:19,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt
[default1]:[2022-03-03 10:09:19,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 10:09:19,470] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt
[default4]:[2022-03-03 10:09:19,436] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt
[default1]:[2022-03-03 10:09:19,421] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt
[default2]:[2022-03-03 10:09:19,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt
[default0]:[2022-03-03 10:09:19,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
[default6]:[2022-03-03 10:09:19,548] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 10:09:19,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 10:09:19,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt
[default3]:[2022-03-03 10:09:19,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt
[default3]:[2022-03-03 10:09:19,635] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt
[default0]:[2022-03-03 10:09:19,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 10:09:19,682] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 10:09:19,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt
[default6]:[2022-03-03 10:09:19,680] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt
[default0]:[2022-03-03 10:09:19,754] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
[default0]:[2022-03-03 10:09:19,734] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 10:09:19,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 10:09:19,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 10:09:19,743] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt
[default1]:[2022-03-03 10:09:19,799] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 10:09:19,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt
[default0]:[2022-03-03 10:09:19,824] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt
[default0]:[2022-03-03 10:09:20,058] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 10:09:20,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt
[default1]:[2022-03-03 10:09:20,101] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 10:09:20,117] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt
[default3]:[2022-03-03 10:09:20,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt
[default4]:[2022-03-03 10:09:20,152] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 10:09:20,230] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt
[default7]:[2022-03-03 10:09:20,240] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt
[default5]:[2022-03-03 10:09:20,269] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 10:09:20,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt
[default7]:[2022-03-03 10:09:20,355] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 10:09:20,302] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt
[default5]:[2022-03-03 10:09:20,269] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt
[default0]:[2022-03-03 10:09:20,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 10:09:20,566] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt
[default5]:[2022-03-03 10:09:20,623] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 10:09:20,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt
[default3]:[2022-03-03 10:09:20,811] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt
[default4]:[2022-03-03 10:09:20,811] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 10:09:20,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt
[default6]:[2022-03-03 10:09:20,887] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt
[default2]:[2022-03-03 10:09:20,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt
[default4]:[2022-03-03 10:09:20,818] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt
[default6]:[2022-03-03 10:09:20,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt
[default0]:[2022-03-03 10:09:20,920] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 10:09:20,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 10:09:20,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 10:09:20,935] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt
[default2]:[2022-03-03 10:09:21,024] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt
[default4]:[2022-03-03 10:09:21,170] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt
[default7]:[2022-03-03 10:09:21,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt
[default5]:[2022-03-03 10:09:21,145] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 10:09:21,214] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 10:09:21,212] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt
[default7]:[2022-03-03 10:09:21,251] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt
[default0]:[2022-03-03 10:09:21,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 10:09:21,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 10:09:21,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt
[default7]:[2022-03-03 10:09:21,398] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 10:09:21,317] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt
[default4]:[2022-03-03 10:09:21,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 10:09:21,534] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 10:09:21,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt
[default3]:[2022-03-03 10:09:21,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt
[default5]:[2022-03-03 10:09:21,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt
[default1]:[2022-03-03 10:09:21,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt
[default5]:[2022-03-03 10:09:21,554] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt
[default6]:[2022-03-03 10:09:21,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt
[default2]:[2022-03-03 10:09:21,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 10:09:21,635] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt
[default1]:[2022-03-03 10:09:21,662] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt
[default2]:[2022-03-03 10:09:21,663] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt
[default6]:[2022-03-03 10:09:21,625] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt
[default2]:[2022-03-03 10:09:21,655] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt
[default5]:[2022-03-03 10:09:21,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 10:09:21,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt
[default7]:[2022-03-03 10:09:21,707] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt
[default4]:[2022-03-03 10:09:21,699] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt
[default6]:[2022-03-03 10:09:21,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt
[default3]:[2022-03-03 10:09:21,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 10:09:21,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt
[default4]:[2022-03-03 10:09:21,800] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt
[default4]:[2022-03-03 10:09:21,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 10:09:21,860] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt
[default2]:[2022-03-03 10:09:21,986] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt
[default7]:[2022-03-03 10:09:22,015] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt
[default1]:[2022-03-03 10:09:22,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt
[default0]:[2022-03-03 10:09:22,016] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt
[default3]:[2022-03-03 10:09:21,988] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt
[default2]:[2022-03-03 10:09:21,993] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt
[default1]:[2022-03-03 10:09:22,023] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt
[default6]:[2022-03-03 10:09:22,045] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt
[default4]:[2022-03-03 10:09:22,134] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt
[default5]:[2022-03-03 10:09:22,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 10:09:22,234] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
[default5]:[2022-03-03 10:09:22,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt
[default0]:[2022-03-03 10:09:22,238] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt
[default5]:[2022-03-03 10:09:22,309] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt
[default3]:[2022-03-03 10:09:22,251] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt
[default3]:[2022-03-03 10:09:22,287] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 10:09:22,260] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 10:09:22,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt
[default1]:[2022-03-03 10:09:22,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt
[default6]:[2022-03-03 10:09:22,566] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 10:09:22,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt
[default0]:[2022-03-03 10:09:22,576] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt
[default0]:[2022-03-03 10:09:22,523] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt
[default6]:[2022-03-03 10:09:22,574] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt
[default0]:[2022-03-03 10:09:22,627] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt
[default7]:[2022-03-03 10:09:22,694] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt
[default6]:[2022-03-03 10:09:22,660] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 10:09:22,896] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt
[default3]:[2022-03-03 10:09:22,921] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 10:09:22,964] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt
[default3]:[2022-03-03 10:09:22,990] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt
[default2]:[2022-03-03 10:09:23,028] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt
[default3]:[2022-03-03 10:09:23,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt
[default1]:[2022-03-03 10:09:23,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 10:09:23,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 10:09:23,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt
[default2]:[2022-03-03 10:09:23,109] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt
[default0]:[2022-03-03 10:09:23,167] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt
[default3]:[2022-03-03 10:09:23,247] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt
[default2]:[2022-03-03 10:09:23,256] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt
[default1]:[2022-03-03 10:09:23,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt
[default5]:[2022-03-03 10:09:23,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt
[default3]:[2022-03-03 10:09:23,569] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 10:09:23,575] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 10:09:23,607] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt
[default3]:[2022-03-03 10:09:23,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt
[default1]:[2022-03-03 10:09:23,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt
[default4]:[2022-03-03 10:09:23,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt
[default2]:[2022-03-03 10:09:23,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt
[default0]:[2022-03-03 10:09:23,669] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt
[default1]:[2022-03-03 10:09:23,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt
[default2]:[2022-03-03 10:09:23,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt
[default0]:[2022-03-03 10:09:23,746] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 10:09:23,782] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt
[default3]:[2022-03-03 10:09:23,782] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 10:09:23,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt
[default7]:[2022-03-03 10:09:23,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 10:09:24,052] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt
[default6]:[2022-03-03 10:09:24,056] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 10:09:24,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt
[default4]:[2022-03-03 10:09:24,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt
[default1]:[2022-03-03 10:09:24,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 10:09:24,354] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt
[default3]:[2022-03-03 10:09:24,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 10:09:24,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 10:09:24,787] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 10:09:24,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt
[default5]:[2022-03-03 10:09:24,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 10:09:24,862] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt
[default2]:[2022-03-03 10:09:24,970] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt
[default7]:[2022-03-03 10:09:25,068] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 10:09:25,020] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt
[default6]:[2022-03-03 10:09:25,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 10:09:25,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt
[default6]:[2022-03-03 10:09:25,158] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 10:09:25,164] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 10:09:25,259] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt
[default3]:[2022-03-03 10:09:25,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt
[default4]:[2022-03-03 10:09:25,488] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 10:09:25,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt
[default4]:[2022-03-03 10:09:25,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt
[default4]:[2022-03-03 10:09:25,492] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 10:09:25,529] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt
[default5]:[2022-03-03 10:09:25,538] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt
[default7]:[2022-03-03 10:09:25,574] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt
[default3]:[2022-03-03 10:09:25,551] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 10:09:25,640] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt
[default1]:[2022-03-03 10:09:25,777] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt
[default1]:[2022-03-03 10:09:26,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt
[default0]:[2022-03-03 10:09:26,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt
[default5]:[2022-03-03 10:09:26,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 10:09:26,338] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
[default4]:[2022-03-03 10:09:26,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt
[default7]:[2022-03-03 10:09:26,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 10:09:26,497] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt
[default6]:[2022-03-03 10:09:26,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 10:09:26,896] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt
[default1]:[2022-03-03 10:09:26,901] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 10:09:26,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt
[default0]:[2022-03-03 10:09:27,557] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt
[default6]:[2022-03-03 10:09:27,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt
[default7]:[2022-03-03 10:09:27,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt
[default1]:[2022-03-03 10:09:28,279] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt
[default0]:[2022-03-03 10:09:28,249] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt
[default7]:time (ms) | save-checkpoint: 35998.51
[default0]:  successfully saved checkpoint at iteration    1000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default7]: iteration     1001/  128728 | consumed samples:        16016 | consumed tokens:     32800768 | elapsed time per iteration (s): 70.83 | learning rate: 5.248E-06 | global batch size:    16 | lm loss: 7.257627E+00 | grad norm: 1.325 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.226 | TFLOPs: 1.73 |
[default7]: iteration     1002/  128728 | consumed samples:        16032 | consumed tokens:     32833536 | elapsed time per iteration (s): 15.28 | learning rate: 5.253E-06 | global batch size:    16 | lm loss: 7.265201E+00 | grad norm: 1.279 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1003/  128728 | consumed samples:        16048 | consumed tokens:     32866304 | elapsed time per iteration (s): 15.25 | learning rate: 5.259E-06 | global batch size:    16 | lm loss: 7.525159E+00 | grad norm: 1.773 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1004/  128728 | consumed samples:        16064 | consumed tokens:     32899072 | elapsed time per iteration (s): 15.25 | learning rate: 5.264E-06 | global batch size:    16 | lm loss: 7.367915E+00 | grad norm: 1.173 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1005/  128728 | consumed samples:        16080 | consumed tokens:     32931840 | elapsed time per iteration (s): 15.26 | learning rate: 5.269E-06 | global batch size:    16 | lm loss: 7.435073E+00 | grad norm: 1.129 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1006/  128728 | consumed samples:        16096 | consumed tokens:     32964608 | elapsed time per iteration (s): 15.24 | learning rate: 5.274E-06 | global batch size:    16 | lm loss: 7.265368E+00 | grad norm: 1.547 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1007/  128728 | consumed samples:        16112 | consumed tokens:     32997376 | elapsed time per iteration (s): 15.24 | learning rate: 5.280E-06 | global batch size:    16 | lm loss: 7.300901E+00 | grad norm: 0.991 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1008/  128728 | consumed samples:        16128 | consumed tokens:     33030144 | elapsed time per iteration (s): 15.24 | learning rate: 5.285E-06 | global batch size:    16 | lm loss: 7.472819E+00 | grad norm: 1.289 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1009/  128728 | consumed samples:        16144 | consumed tokens:     33062912 | elapsed time per iteration (s): 15.25 | learning rate: 5.290E-06 | global batch size:    16 | lm loss: 7.227314E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1010/  128728 | consumed samples:        16160 | consumed tokens:     33095680 | elapsed time per iteration (s): 15.22 | learning rate: 5.295E-06 | global batch size:    16 | lm loss: 7.344738E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1011/  128728 | consumed samples:        16176 | consumed tokens:     33128448 | elapsed time per iteration (s): 15.26 | learning rate: 5.301E-06 | global batch size:    16 | lm loss: 7.324342E+00 | grad norm: 1.128 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1012/  128728 | consumed samples:        16192 | consumed tokens:     33161216 | elapsed time per iteration (s): 15.24 | learning rate: 5.306E-06 | global batch size:    16 | lm loss: 7.071029E+00 | grad norm: 1.354 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1013/  128728 | consumed samples:        16208 | consumed tokens:     33193984 | elapsed time per iteration (s): 15.25 | learning rate: 5.311E-06 | global batch size:    16 | lm loss: 7.107207E+00 | grad norm: 0.975 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1014/  128728 | consumed samples:        16224 | consumed tokens:     33226752 | elapsed time per iteration (s): 15.26 | learning rate: 5.316E-06 | global batch size:    16 | lm loss: 7.222437E+00 | grad norm: 1.220 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1015/  128728 | consumed samples:        16240 | consumed tokens:     33259520 | elapsed time per iteration (s): 15.25 | learning rate: 5.322E-06 | global batch size:    16 | lm loss: 7.451645E+00 | grad norm: 2.238 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1016/  128728 | consumed samples:        16256 | consumed tokens:     33292288 | elapsed time per iteration (s): 15.19 | learning rate: 5.327E-06 | global batch size:    16 | lm loss: 7.183714E+00 | grad norm: 1.511 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1017/  128728 | consumed samples:        16272 | consumed tokens:     33325056 | elapsed time per iteration (s): 15.26 | learning rate: 5.332E-06 | global batch size:    16 | lm loss: 7.206068E+00 | grad norm: 1.397 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1018/  128728 | consumed samples:        16288 | consumed tokens:     33357824 | elapsed time per iteration (s): 15.25 | learning rate: 5.337E-06 | global batch size:    16 | lm loss: 7.339333E+00 | grad norm: 1.075 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1019/  128728 | consumed samples:        16304 | consumed tokens:     33390592 | elapsed time per iteration (s): 15.24 | learning rate: 5.343E-06 | global batch size:    16 | lm loss: 7.346642E+00 | grad norm: 1.131 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1020/  128728 | consumed samples:        16320 | consumed tokens:     33423360 | elapsed time per iteration (s): 15.26 | learning rate: 5.348E-06 | global batch size:    16 | lm loss: 7.557926E+00 | grad norm: 1.374 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1021/  128728 | consumed samples:        16336 | consumed tokens:     33456128 | elapsed time per iteration (s): 15.20 | learning rate: 5.353E-06 | global batch size:    16 | lm loss: 7.477837E+00 | grad norm: 1.650 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1022/  128728 | consumed samples:        16352 | consumed tokens:     33488896 | elapsed time per iteration (s): 15.21 | learning rate: 5.358E-06 | global batch size:    16 | lm loss: 7.073501E+00 | grad norm: 1.186 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1023/  128728 | consumed samples:        16368 | consumed tokens:     33521664 | elapsed time per iteration (s): 15.15 | learning rate: 5.363E-06 | global batch size:    16 | lm loss: 7.267119E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     1024/  128728 | consumed samples:        16384 | consumed tokens:     33554432 | elapsed time per iteration (s): 15.23 | learning rate: 5.369E-06 | global batch size:    16 | lm loss: 7.294874E+00 | grad norm: 1.535 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1025/  128728 | consumed samples:        16400 | consumed tokens:     33587200 | elapsed time per iteration (s): 15.25 | learning rate: 5.374E-06 | global batch size:    16 | lm loss: 7.133692E+00 | grad norm: 1.165 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1026/  128728 | consumed samples:        16416 | consumed tokens:     33619968 | elapsed time per iteration (s): 15.22 | learning rate: 5.379E-06 | global batch size:    16 | lm loss: 7.371020E+00 | grad norm: 1.414 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1027/  128728 | consumed samples:        16432 | consumed tokens:     33652736 | elapsed time per iteration (s): 15.24 | learning rate: 5.384E-06 | global batch size:    16 | lm loss: 7.288789E+00 | grad norm: 1.520 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1028/  128728 | consumed samples:        16448 | consumed tokens:     33685504 | elapsed time per iteration (s): 15.30 | learning rate: 5.390E-06 | global batch size:    16 | lm loss: 7.304897E+00 | grad norm: 1.235 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     1029/  128728 | consumed samples:        16464 | consumed tokens:     33718272 | elapsed time per iteration (s): 15.26 | learning rate: 5.395E-06 | global batch size:    16 | lm loss: 7.384569E+00 | grad norm: 1.106 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1030/  128728 | consumed samples:        16480 | consumed tokens:     33751040 | elapsed time per iteration (s): 15.26 | learning rate: 5.400E-06 | global batch size:    16 | lm loss: 7.309175E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1031/  128728 | consumed samples:        16496 | consumed tokens:     33783808 | elapsed time per iteration (s): 15.24 | learning rate: 5.405E-06 | global batch size:    16 | lm loss: 7.343480E+00 | grad norm: 1.259 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1032/  128728 | consumed samples:        16512 | consumed tokens:     33816576 | elapsed time per iteration (s): 15.24 | learning rate: 5.411E-06 | global batch size:    16 | lm loss: 7.319173E+00 | grad norm: 1.307 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1033/  128728 | consumed samples:        16528 | consumed tokens:     33849344 | elapsed time per iteration (s): 15.23 | learning rate: 5.416E-06 | global batch size:    16 | lm loss: 7.423133E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1034/  128728 | consumed samples:        16544 | consumed tokens:     33882112 | elapsed time per iteration (s): 15.21 | learning rate: 5.421E-06 | global batch size:    16 | lm loss: 7.386244E+00 | grad norm: 1.324 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1035/  128728 | consumed samples:        16560 | consumed tokens:     33914880 | elapsed time per iteration (s): 15.25 | learning rate: 5.426E-06 | global batch size:    16 | lm loss: 7.329965E+00 | grad norm: 1.246 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1036/  128728 | consumed samples:        16576 | consumed tokens:     33947648 | elapsed time per iteration (s): 15.26 | learning rate: 5.432E-06 | global batch size:    16 | lm loss: 7.282664E+00 | grad norm: 1.586 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1037/  128728 | consumed samples:        16592 | consumed tokens:     33980416 | elapsed time per iteration (s): 15.25 | learning rate: 5.437E-06 | global batch size:    16 | lm loss: 7.157454E+00 | grad norm: 1.644 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1038/  128728 | consumed samples:        16608 | consumed tokens:     34013184 | elapsed time per iteration (s): 15.23 | learning rate: 5.442E-06 | global batch size:    16 | lm loss: 7.269532E+00 | grad norm: 1.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1039/  128728 | consumed samples:        16624 | consumed tokens:     34045952 | elapsed time per iteration (s): 15.26 | learning rate: 5.447E-06 | global batch size:    16 | lm loss: 7.390067E+00 | grad norm: 1.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1040/  128728 | consumed samples:        16640 | consumed tokens:     34078720 | elapsed time per iteration (s): 15.20 | learning rate: 5.453E-06 | global batch size:    16 | lm loss: 7.319128E+00 | grad norm: 1.051 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1041/  128728 | consumed samples:        16656 | consumed tokens:     34111488 | elapsed time per iteration (s): 15.25 | learning rate: 5.458E-06 | global batch size:    16 | lm loss: 7.343173E+00 | grad norm: 1.067 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1042/  128728 | consumed samples:        16672 | consumed tokens:     34144256 | elapsed time per iteration (s): 15.23 | learning rate: 5.463E-06 | global batch size:    16 | lm loss: 7.418891E+00 | grad norm: 1.234 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1043/  128728 | consumed samples:        16688 | consumed tokens:     34177024 | elapsed time per iteration (s): 15.23 | learning rate: 5.468E-06 | global batch size:    16 | lm loss: 7.088163E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1044/  128728 | consumed samples:        16704 | consumed tokens:     34209792 | elapsed time per iteration (s): 15.28 | learning rate: 5.474E-06 | global batch size:    16 | lm loss: 7.283275E+00 | grad norm: 1.184 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1045/  128728 | consumed samples:        16720 | consumed tokens:     34242560 | elapsed time per iteration (s): 15.23 | learning rate: 5.479E-06 | global batch size:    16 | lm loss: 7.177429E+00 | grad norm: 1.163 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1046/  128728 | consumed samples:        16736 | consumed tokens:     34275328 | elapsed time per iteration (s): 15.24 | learning rate: 5.484E-06 | global batch size:    16 | lm loss: 7.403968E+00 | grad norm: 1.264 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1047/  128728 | consumed samples:        16752 | consumed tokens:     34308096 | elapsed time per iteration (s): 15.25 | learning rate: 5.489E-06 | global batch size:    16 | lm loss: 7.409142E+00 | grad norm: 1.580 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     1048/  128728 | consumed samples:        16768 | consumed tokens:     34340864 | elapsed time per iteration (s): 15.21 | learning rate: 5.495E-06 | global batch size:    16 | lm loss: 7.269386E+00 | grad norm: 1.087 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1049/  128728 | consumed samples:        16784 | consumed tokens:     34373632 | elapsed time per iteration (s): 15.17 | learning rate: 5.500E-06 | global batch size:    16 | lm loss: 7.443803E+00 | grad norm: 0.967 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1050/  128728 | consumed samples:        16800 | consumed tokens:     34406400 | elapsed time per iteration (s): 15.25 | learning rate: 5.505E-06 | global batch size:    16 | lm loss: 7.035776E+00 | grad norm: 1.203 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1051/  128728 | consumed samples:        16816 | consumed tokens:     34439168 | elapsed time per iteration (s): 15.28 | learning rate: 5.510E-06 | global batch size:    16 | lm loss: 7.198908E+00 | grad norm: 1.057 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1052/  128728 | consumed samples:        16832 | consumed tokens:     34471936 | elapsed time per iteration (s): 15.24 | learning rate: 5.516E-06 | global batch size:    16 | lm loss: 7.287247E+00 | grad norm: 1.212 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1053/  128728 | consumed samples:        16848 | consumed tokens:     34504704 | elapsed time per iteration (s): 15.22 | learning rate: 5.521E-06 | global batch size:    16 | lm loss: 7.180941E+00 | grad norm: 0.943 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1054/  128728 | consumed samples:        16864 | consumed tokens:     34537472 | elapsed time per iteration (s): 15.25 | learning rate: 5.526E-06 | global batch size:    16 | lm loss: 7.035480E+00 | grad norm: 1.118 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1055/  128728 | consumed samples:        16880 | consumed tokens:     34570240 | elapsed time per iteration (s): 15.25 | learning rate: 5.531E-06 | global batch size:    16 | lm loss: 7.411442E+00 | grad norm: 1.504 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1056/  128728 | consumed samples:        16896 | consumed tokens:     34603008 | elapsed time per iteration (s): 15.24 | learning rate: 5.536E-06 | global batch size:    16 | lm loss: 7.284391E+00 | grad norm: 1.081 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1057/  128728 | consumed samples:        16912 | consumed tokens:     34635776 | elapsed time per iteration (s): 15.24 | learning rate: 5.542E-06 | global batch size:    16 | lm loss: 7.234114E+00 | grad norm: 1.008 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1058/  128728 | consumed samples:        16928 | consumed tokens:     34668544 | elapsed time per iteration (s): 15.29 | learning rate: 5.547E-06 | global batch size:    16 | lm loss: 7.331013E+00 | grad norm: 1.068 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     1059/  128728 | consumed samples:        16944 | consumed tokens:     34701312 | elapsed time per iteration (s): 15.23 | learning rate: 5.552E-06 | global batch size:    16 | lm loss: 7.221325E+00 | grad norm: 1.413 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1060/  128728 | consumed samples:        16960 | consumed tokens:     34734080 | elapsed time per iteration (s): 15.23 | learning rate: 5.557E-06 | global batch size:    16 | lm loss: 7.175035E+00 | grad norm: 1.219 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1061/  128728 | consumed samples:        16976 | consumed tokens:     34766848 | elapsed time per iteration (s): 15.26 | learning rate: 5.563E-06 | global batch size:    16 | lm loss: 7.444801E+00 | grad norm: 1.268 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1062/  128728 | consumed samples:        16992 | consumed tokens:     34799616 | elapsed time per iteration (s): 15.26 | learning rate: 5.568E-06 | global batch size:    16 | lm loss: 7.480289E+00 | grad norm: 1.083 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1063/  128728 | consumed samples:        17008 | consumed tokens:     34832384 | elapsed time per iteration (s): 15.23 | learning rate: 5.573E-06 | global batch size:    16 | lm loss: 7.148155E+00 | grad norm: 0.926 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1064/  128728 | consumed samples:        17024 | consumed tokens:     34865152 | elapsed time per iteration (s): 15.14 | learning rate: 5.578E-06 | global batch size:    16 | lm loss: 7.344573E+00 | grad norm: 1.057 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     1065/  128728 | consumed samples:        17040 | consumed tokens:     34897920 | elapsed time per iteration (s): 15.25 | learning rate: 5.584E-06 | global batch size:    16 | lm loss: 7.196020E+00 | grad norm: 1.599 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1066/  128728 | consumed samples:        17056 | consumed tokens:     34930688 | elapsed time per iteration (s): 15.26 | learning rate: 5.589E-06 | global batch size:    16 | lm loss: 7.104638E+00 | grad norm: 1.020 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1067/  128728 | consumed samples:        17072 | consumed tokens:     34963456 | elapsed time per iteration (s): 15.24 | learning rate: 5.594E-06 | global batch size:    16 | lm loss: 7.402941E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1068/  128728 | consumed samples:        17088 | consumed tokens:     34996224 | elapsed time per iteration (s): 15.27 | learning rate: 5.599E-06 | global batch size:    16 | lm loss: 7.603527E+00 | grad norm: 1.426 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1069/  128728 | consumed samples:        17104 | consumed tokens:     35028992 | elapsed time per iteration (s): 15.24 | learning rate: 5.605E-06 | global batch size:    16 | lm loss: 7.494851E+00 | grad norm: 1.085 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1070/  128728 | consumed samples:        17120 | consumed tokens:     35061760 | elapsed time per iteration (s): 15.23 | learning rate: 5.610E-06 | global batch size:    16 | lm loss: 7.395302E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1071/  128728 | consumed samples:        17136 | consumed tokens:     35094528 | elapsed time per iteration (s): 15.29 | learning rate: 5.615E-06 | global batch size:    16 | lm loss: 7.198095E+00 | grad norm: 1.182 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration     1072/  128728 | consumed samples:        17152 | consumed tokens:     35127296 | elapsed time per iteration (s): 15.27 | learning rate: 5.620E-06 | global batch size:    16 | lm loss: 7.297481E+00 | grad norm: 1.358 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1073/  128728 | consumed samples:        17168 | consumed tokens:     35160064 | elapsed time per iteration (s): 15.27 | learning rate: 5.626E-06 | global batch size:    16 | lm loss: 7.169433E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1074/  128728 | consumed samples:        17184 | consumed tokens:     35192832 | elapsed time per iteration (s): 15.26 | learning rate: 5.631E-06 | global batch size:    16 | lm loss: 7.143753E+00 | grad norm: 1.150 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1075/  128728 | consumed samples:        17200 | consumed tokens:     35225600 | elapsed time per iteration (s): 15.24 | learning rate: 5.636E-06 | global batch size:    16 | lm loss: 7.086334E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1076/  128728 | consumed samples:        17216 | consumed tokens:     35258368 | elapsed time per iteration (s): 15.28 | learning rate: 5.641E-06 | global batch size:    16 | lm loss: 7.248414E+00 | grad norm: 1.153 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration     1077/  128728 | consumed samples:        17232 | consumed tokens:     35291136 | elapsed time per iteration (s): 15.27 | learning rate: 5.647E-06 | global batch size:    16 | lm loss: 7.515269E+00 | grad norm: 2.106 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1078/  128728 | consumed samples:        17248 | consumed tokens:     35323904 | elapsed time per iteration (s): 15.25 | learning rate: 5.652E-06 | global batch size:    16 | lm loss: 7.372351E+00 | grad norm: 2.026 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1079/  128728 | consumed samples:        17264 | consumed tokens:     35356672 | elapsed time per iteration (s): 15.23 | learning rate: 5.657E-06 | global batch size:    16 | lm loss: 7.441353E+00 | grad norm: 2.469 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1080/  128728 | consumed samples:        17280 | consumed tokens:     35389440 | elapsed time per iteration (s): 15.25 | learning rate: 5.662E-06 | global batch size:    16 | lm loss: 7.178278E+00 | grad norm: 1.792 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1081/  128728 | consumed samples:        17296 | consumed tokens:     35422208 | elapsed time per iteration (s): 15.19 | learning rate: 5.668E-06 | global batch size:    16 | lm loss: 7.478823E+00 | grad norm: 1.421 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1082/  128728 | consumed samples:        17312 | consumed tokens:     35454976 | elapsed time per iteration (s): 15.26 | learning rate: 5.673E-06 | global batch size:    16 | lm loss: 7.295471E+00 | grad norm: 1.048 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1083/  128728 | consumed samples:        17328 | consumed tokens:     35487744 | elapsed time per iteration (s): 15.22 | learning rate: 5.678E-06 | global batch size:    16 | lm loss: 7.328071E+00 | grad norm: 1.106 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1084/  128728 | consumed samples:        17344 | consumed tokens:     35520512 | elapsed time per iteration (s): 15.24 | learning rate: 5.683E-06 | global batch size:    16 | lm loss: 7.163485E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1085/  128728 | consumed samples:        17360 | consumed tokens:     35553280 | elapsed time per iteration (s): 15.28 | learning rate: 5.689E-06 | global batch size:    16 | lm loss: 7.288455E+00 | grad norm: 0.989 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1086/  128728 | consumed samples:        17376 | consumed tokens:     35586048 | elapsed time per iteration (s): 15.24 | learning rate: 5.694E-06 | global batch size:    16 | lm loss: 7.212840E+00 | grad norm: 1.046 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1087/  128728 | consumed samples:        17392 | consumed tokens:     35618816 | elapsed time per iteration (s): 15.24 | learning rate: 5.699E-06 | global batch size:    16 | lm loss: 7.166890E+00 | grad norm: 1.003 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1088/  128728 | consumed samples:        17408 | consumed tokens:     35651584 | elapsed time per iteration (s): 15.24 | learning rate: 5.704E-06 | global batch size:    16 | lm loss: 7.437174E+00 | grad norm: 1.190 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1089/  128728 | consumed samples:        17424 | consumed tokens:     35684352 | elapsed time per iteration (s): 15.27 | learning rate: 5.710E-06 | global batch size:    16 | lm loss: 7.178500E+00 | grad norm: 1.345 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1090/  128728 | consumed samples:        17440 | consumed tokens:     35717120 | elapsed time per iteration (s): 15.24 | learning rate: 5.715E-06 | global batch size:    16 | lm loss: 7.343741E+00 | grad norm: 1.133 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1091/  128728 | consumed samples:        17456 | consumed tokens:     35749888 | elapsed time per iteration (s): 15.26 | learning rate: 5.720E-06 | global batch size:    16 | lm loss: 7.443361E+00 | grad norm: 1.581 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1092/  128728 | consumed samples:        17472 | consumed tokens:     35782656 | elapsed time per iteration (s): 15.23 | learning rate: 5.725E-06 | global batch size:    16 | lm loss: 7.196815E+00 | grad norm: 1.375 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1093/  128728 | consumed samples:        17488 | consumed tokens:     35815424 | elapsed time per iteration (s): 15.19 | learning rate: 5.730E-06 | global batch size:    16 | lm loss: 7.417691E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1094/  128728 | consumed samples:        17504 | consumed tokens:     35848192 | elapsed time per iteration (s): 15.23 | learning rate: 5.736E-06 | global batch size:    16 | lm loss: 7.217441E+00 | grad norm: 1.279 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1095/  128728 | consumed samples:        17520 | consumed tokens:     35880960 | elapsed time per iteration (s): 15.26 | learning rate: 5.741E-06 | global batch size:    16 | lm loss: 7.141168E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1096/  128728 | consumed samples:        17536 | consumed tokens:     35913728 | elapsed time per iteration (s): 15.23 | learning rate: 5.746E-06 | global batch size:    16 | lm loss: 7.413390E+00 | grad norm: 1.202 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1097/  128728 | consumed samples:        17552 | consumed tokens:     35946496 | elapsed time per iteration (s): 15.23 | learning rate: 5.751E-06 | global batch size:    16 | lm loss: 7.284686E+00 | grad norm: 1.047 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1098/  128728 | consumed samples:        17568 | consumed tokens:     35979264 | elapsed time per iteration (s): 15.25 | learning rate: 5.757E-06 | global batch size:    16 | lm loss: 7.118299E+00 | grad norm: 1.404 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1099/  128728 | consumed samples:        17584 | consumed tokens:     36012032 | elapsed time per iteration (s): 15.26 | learning rate: 5.762E-06 | global batch size:    16 | lm loss: 7.185723E+00 | grad norm: 1.040 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1100/  128728 | consumed samples:        17600 | consumed tokens:     36044800 | elapsed time per iteration (s): 15.24 | learning rate: 5.767E-06 | global batch size:    16 | lm loss: 7.335216E+00 | grad norm: 1.337 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1101/  128728 | consumed samples:        17616 | consumed tokens:     36077568 | elapsed time per iteration (s): 15.21 | learning rate: 5.772E-06 | global batch size:    16 | lm loss: 7.115668E+00 | grad norm: 1.106 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1102/  128728 | consumed samples:        17632 | consumed tokens:     36110336 | elapsed time per iteration (s): 15.23 | learning rate: 5.778E-06 | global batch size:    16 | lm loss: 7.229290E+00 | grad norm: 1.643 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1103/  128728 | consumed samples:        17648 | consumed tokens:     36143104 | elapsed time per iteration (s): 15.21 | learning rate: 5.783E-06 | global batch size:    16 | lm loss: 7.195288E+00 | grad norm: 0.993 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1104/  128728 | consumed samples:        17664 | consumed tokens:     36175872 | elapsed time per iteration (s): 15.24 | learning rate: 5.788E-06 | global batch size:    16 | lm loss: 7.160654E+00 | grad norm: 1.414 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1105/  128728 | consumed samples:        17680 | consumed tokens:     36208640 | elapsed time per iteration (s): 15.26 | learning rate: 5.793E-06 | global batch size:    16 | lm loss: 7.244509E+00 | grad norm: 0.998 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1106/  128728 | consumed samples:        17696 | consumed tokens:     36241408 | elapsed time per iteration (s): 15.25 | learning rate: 5.799E-06 | global batch size:    16 | lm loss: 7.244285E+00 | grad norm: 1.281 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1107/  128728 | consumed samples:        17712 | consumed tokens:     36274176 | elapsed time per iteration (s): 15.25 | learning rate: 5.804E-06 | global batch size:    16 | lm loss: 7.213204E+00 | grad norm: 0.970 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1108/  128728 | consumed samples:        17728 | consumed tokens:     36306944 | elapsed time per iteration (s): 15.25 | learning rate: 5.809E-06 | global batch size:    16 | lm loss: 7.250452E+00 | grad norm: 1.345 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1109/  128728 | consumed samples:        17744 | consumed tokens:     36339712 | elapsed time per iteration (s): 15.22 | learning rate: 5.814E-06 | global batch size:    16 | lm loss: 7.291537E+00 | grad norm: 0.972 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1110/  128728 | consumed samples:        17760 | consumed tokens:     36372480 | elapsed time per iteration (s): 15.21 | learning rate: 5.820E-06 | global batch size:    16 | lm loss: 7.145199E+00 | grad norm: 1.012 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1111/  128728 | consumed samples:        17776 | consumed tokens:     36405248 | elapsed time per iteration (s): 15.25 | learning rate: 5.825E-06 | global batch size:    16 | lm loss: 7.345960E+00 | grad norm: 1.036 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1112/  128728 | consumed samples:        17792 | consumed tokens:     36438016 | elapsed time per iteration (s): 15.22 | learning rate: 5.830E-06 | global batch size:    16 | lm loss: 7.107178E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1113/  128728 | consumed samples:        17808 | consumed tokens:     36470784 | elapsed time per iteration (s): 15.23 | learning rate: 5.835E-06 | global batch size:    16 | lm loss: 6.999576E+00 | grad norm: 0.989 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1114/  128728 | consumed samples:        17824 | consumed tokens:     36503552 | elapsed time per iteration (s): 15.23 | learning rate: 5.841E-06 | global batch size:    16 | lm loss: 7.287607E+00 | grad norm: 1.070 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1115/  128728 | consumed samples:        17840 | consumed tokens:     36536320 | elapsed time per iteration (s): 15.24 | learning rate: 5.846E-06 | global batch size:    16 | lm loss: 7.054477E+00 | grad norm: 1.085 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1116/  128728 | consumed samples:        17856 | consumed tokens:     36569088 | elapsed time per iteration (s): 15.23 | learning rate: 5.851E-06 | global batch size:    16 | lm loss: 7.217619E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1117/  128728 | consumed samples:        17872 | consumed tokens:     36601856 | elapsed time per iteration (s): 15.23 | learning rate: 5.856E-06 | global batch size:    16 | lm loss: 7.185878E+00 | grad norm: 0.975 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1118/  128728 | consumed samples:        17888 | consumed tokens:     36634624 | elapsed time per iteration (s): 15.21 | learning rate: 5.862E-06 | global batch size:    16 | lm loss: 7.304596E+00 | grad norm: 0.967 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1119/  128728 | consumed samples:        17904 | consumed tokens:     36667392 | elapsed time per iteration (s): 15.22 | learning rate: 5.867E-06 | global batch size:    16 | lm loss: 7.287797E+00 | grad norm: 0.989 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1120/  128728 | consumed samples:        17920 | consumed tokens:     36700160 | elapsed time per iteration (s): 15.26 | learning rate: 5.872E-06 | global batch size:    16 | lm loss: 7.236816E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1121/  128728 | consumed samples:        17936 | consumed tokens:     36732928 | elapsed time per iteration (s): 15.22 | learning rate: 5.877E-06 | global batch size:    16 | lm loss: 7.148897E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1122/  128728 | consumed samples:        17952 | consumed tokens:     36765696 | elapsed time per iteration (s): 15.23 | learning rate: 5.883E-06 | global batch size:    16 | lm loss: 7.309883E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1123/  128728 | consumed samples:        17968 | consumed tokens:     36798464 | elapsed time per iteration (s): 15.18 | learning rate: 5.888E-06 | global batch size:    16 | lm loss: 7.121294E+00 | grad norm: 1.055 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1124/  128728 | consumed samples:        17984 | consumed tokens:     36831232 | elapsed time per iteration (s): 15.23 | learning rate: 5.893E-06 | global batch size:    16 | lm loss: 7.235108E+00 | grad norm: 0.990 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1125/  128728 | consumed samples:        18000 | consumed tokens:     36864000 | elapsed time per iteration (s): 15.22 | learning rate: 5.898E-06 | global batch size:    16 | lm loss: 7.221193E+00 | grad norm: 1.288 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1126/  128728 | consumed samples:        18016 | consumed tokens:     36896768 | elapsed time per iteration (s): 15.24 | learning rate: 5.903E-06 | global batch size:    16 | lm loss: 7.522739E+00 | grad norm: 1.545 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1127/  128728 | consumed samples:        18032 | consumed tokens:     36929536 | elapsed time per iteration (s): 15.25 | learning rate: 5.909E-06 | global batch size:    16 | lm loss: 7.258095E+00 | grad norm: 1.239 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1128/  128728 | consumed samples:        18048 | consumed tokens:     36962304 | elapsed time per iteration (s): 15.24 | learning rate: 5.914E-06 | global batch size:    16 | lm loss: 7.177681E+00 | grad norm: 1.109 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1129/  128728 | consumed samples:        18064 | consumed tokens:     36995072 | elapsed time per iteration (s): 15.23 | learning rate: 5.919E-06 | global batch size:    16 | lm loss: 7.164636E+00 | grad norm: 1.038 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1130/  128728 | consumed samples:        18080 | consumed tokens:     37027840 | elapsed time per iteration (s): 15.23 | learning rate: 5.924E-06 | global batch size:    16 | lm loss: 6.921859E+00 | grad norm: 1.051 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1131/  128728 | consumed samples:        18096 | consumed tokens:     37060608 | elapsed time per iteration (s): 15.23 | learning rate: 5.930E-06 | global batch size:    16 | lm loss: 6.996799E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1132/  128728 | consumed samples:        18112 | consumed tokens:     37093376 | elapsed time per iteration (s): 15.26 | learning rate: 5.935E-06 | global batch size:    16 | lm loss: 7.323952E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1133/  128728 | consumed samples:        18128 | consumed tokens:     37126144 | elapsed time per iteration (s): 15.22 | learning rate: 5.940E-06 | global batch size:    16 | lm loss: 7.006363E+00 | grad norm: 1.066 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1134/  128728 | consumed samples:        18144 | consumed tokens:     37158912 | elapsed time per iteration (s): 15.26 | learning rate: 5.945E-06 | global batch size:    16 | lm loss: 7.190140E+00 | grad norm: 1.403 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1135/  128728 | consumed samples:        18160 | consumed tokens:     37191680 | elapsed time per iteration (s): 15.27 | learning rate: 5.951E-06 | global batch size:    16 | lm loss: 7.225429E+00 | grad norm: 1.054 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1136/  128728 | consumed samples:        18176 | consumed tokens:     37224448 | elapsed time per iteration (s): 15.17 | learning rate: 5.956E-06 | global batch size:    16 | lm loss: 7.188299E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1137/  128728 | consumed samples:        18192 | consumed tokens:     37257216 | elapsed time per iteration (s): 15.20 | learning rate: 5.961E-06 | global batch size:    16 | lm loss: 7.277708E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1138/  128728 | consumed samples:        18208 | consumed tokens:     37289984 | elapsed time per iteration (s): 15.23 | learning rate: 5.966E-06 | global batch size:    16 | lm loss: 7.208605E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1139/  128728 | consumed samples:        18224 | consumed tokens:     37322752 | elapsed time per iteration (s): 15.24 | learning rate: 5.972E-06 | global batch size:    16 | lm loss: 7.097051E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1140/  128728 | consumed samples:        18240 | consumed tokens:     37355520 | elapsed time per iteration (s): 15.22 | learning rate: 5.977E-06 | global batch size:    16 | lm loss: 7.225067E+00 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1141/  128728 | consumed samples:        18256 | consumed tokens:     37388288 | elapsed time per iteration (s): 15.21 | learning rate: 5.982E-06 | global batch size:    16 | lm loss: 7.149609E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1142/  128728 | consumed samples:        18272 | consumed tokens:     37421056 | elapsed time per iteration (s): 15.22 | learning rate: 5.987E-06 | global batch size:    16 | lm loss: 7.092099E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1143/  128728 | consumed samples:        18288 | consumed tokens:     37453824 | elapsed time per iteration (s): 15.27 | learning rate: 5.993E-06 | global batch size:    16 | lm loss: 7.053136E+00 | grad norm: 1.227 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1144/  128728 | consumed samples:        18304 | consumed tokens:     37486592 | elapsed time per iteration (s): 15.25 | learning rate: 5.998E-06 | global batch size:    16 | lm loss: 7.427276E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1145/  128728 | consumed samples:        18320 | consumed tokens:     37519360 | elapsed time per iteration (s): 15.17 | learning rate: 6.003E-06 | global batch size:    16 | lm loss: 7.303183E+00 | grad norm: 1.194 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     1146/  128728 | consumed samples:        18336 | consumed tokens:     37552128 | elapsed time per iteration (s): 15.22 | learning rate: 6.008E-06 | global batch size:    16 | lm loss: 7.172232E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1147/  128728 | consumed samples:        18352 | consumed tokens:     37584896 | elapsed time per iteration (s): 15.25 | learning rate: 6.014E-06 | global batch size:    16 | lm loss: 7.312620E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1148/  128728 | consumed samples:        18368 | consumed tokens:     37617664 | elapsed time per iteration (s): 15.23 | learning rate: 6.019E-06 | global batch size:    16 | lm loss: 7.070820E+00 | grad norm: 1.201 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1149/  128728 | consumed samples:        18384 | consumed tokens:     37650432 | elapsed time per iteration (s): 15.16 | learning rate: 6.024E-06 | global batch size:    16 | lm loss: 7.238826E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1150/  128728 | consumed samples:        18400 | consumed tokens:     37683200 | elapsed time per iteration (s): 15.22 | learning rate: 6.029E-06 | global batch size:    16 | lm loss: 7.159998E+00 | grad norm: 1.042 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1151/  128728 | consumed samples:        18416 | consumed tokens:     37715968 | elapsed time per iteration (s): 15.23 | learning rate: 6.035E-06 | global batch size:    16 | lm loss: 7.089765E+00 | grad norm: 1.200 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1152/  128728 | consumed samples:        18432 | consumed tokens:     37748736 | elapsed time per iteration (s): 15.23 | learning rate: 6.040E-06 | global batch size:    16 | lm loss: 7.016187E+00 | grad norm: 1.086 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1153/  128728 | consumed samples:        18448 | consumed tokens:     37781504 | elapsed time per iteration (s): 15.19 | learning rate: 6.045E-06 | global batch size:    16 | lm loss: 7.231027E+00 | grad norm: 1.328 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1154/  128728 | consumed samples:        18464 | consumed tokens:     37814272 | elapsed time per iteration (s): 15.23 | learning rate: 6.050E-06 | global batch size:    16 | lm loss: 7.197011E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1155/  128728 | consumed samples:        18480 | consumed tokens:     37847040 | elapsed time per iteration (s): 15.26 | learning rate: 6.056E-06 | global batch size:    16 | lm loss: 7.368340E+00 | grad norm: 1.038 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1156/  128728 | consumed samples:        18496 | consumed tokens:     37879808 | elapsed time per iteration (s): 15.20 | learning rate: 6.061E-06 | global batch size:    16 | lm loss: 7.069404E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1157/  128728 | consumed samples:        18512 | consumed tokens:     37912576 | elapsed time per iteration (s): 15.22 | learning rate: 6.066E-06 | global batch size:    16 | lm loss: 7.192194E+00 | grad norm: 1.085 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1158/  128728 | consumed samples:        18528 | consumed tokens:     37945344 | elapsed time per iteration (s): 15.22 | learning rate: 6.071E-06 | global batch size:    16 | lm loss: 7.340763E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1159/  128728 | consumed samples:        18544 | consumed tokens:     37978112 | elapsed time per iteration (s): 15.25 | learning rate: 6.077E-06 | global batch size:    16 | lm loss: 6.942504E+00 | grad norm: 1.469 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1160/  128728 | consumed samples:        18560 | consumed tokens:     38010880 | elapsed time per iteration (s): 15.26 | learning rate: 6.082E-06 | global batch size:    16 | lm loss: 7.018706E+00 | grad norm: 1.079 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1161/  128728 | consumed samples:        18576 | consumed tokens:     38043648 | elapsed time per iteration (s): 15.21 | learning rate: 6.087E-06 | global batch size:    16 | lm loss: 7.082819E+00 | grad norm: 1.087 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1162/  128728 | consumed samples:        18592 | consumed tokens:     38076416 | elapsed time per iteration (s): 15.24 | learning rate: 6.092E-06 | global batch size:    16 | lm loss: 7.236361E+00 | grad norm: 1.023 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1163/  128728 | consumed samples:        18608 | consumed tokens:     38109184 | elapsed time per iteration (s): 15.20 | learning rate: 6.097E-06 | global batch size:    16 | lm loss: 7.258739E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1164/  128728 | consumed samples:        18624 | consumed tokens:     38141952 | elapsed time per iteration (s): 15.25 | learning rate: 6.103E-06 | global batch size:    16 | lm loss: 6.894892E+00 | grad norm: 0.914 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1165/  128728 | consumed samples:        18640 | consumed tokens:     38174720 | elapsed time per iteration (s): 15.26 | learning rate: 6.108E-06 | global batch size:    16 | lm loss: 7.280957E+00 | grad norm: 1.115 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1166/  128728 | consumed samples:        18656 | consumed tokens:     38207488 | elapsed time per iteration (s): 15.21 | learning rate: 6.113E-06 | global batch size:    16 | lm loss: 7.098267E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1167/  128728 | consumed samples:        18672 | consumed tokens:     38240256 | elapsed time per iteration (s): 15.30 | learning rate: 6.118E-06 | global batch size:    16 | lm loss: 7.147165E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     1168/  128728 | consumed samples:        18688 | consumed tokens:     38273024 | elapsed time per iteration (s): 15.23 | learning rate: 6.124E-06 | global batch size:    16 | lm loss: 7.112779E+00 | grad norm: 1.163 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1169/  128728 | consumed samples:        18704 | consumed tokens:     38305792 | elapsed time per iteration (s): 15.26 | learning rate: 6.129E-06 | global batch size:    16 | lm loss: 7.251498E+00 | grad norm: 1.173 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1170/  128728 | consumed samples:        18720 | consumed tokens:     38338560 | elapsed time per iteration (s): 15.27 | learning rate: 6.134E-06 | global batch size:    16 | lm loss: 7.245819E+00 | grad norm: 1.799 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1171/  128728 | consumed samples:        18736 | consumed tokens:     38371328 | elapsed time per iteration (s): 15.21 | learning rate: 6.139E-06 | global batch size:    16 | lm loss: 7.118947E+00 | grad norm: 1.379 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1172/  128728 | consumed samples:        18752 | consumed tokens:     38404096 | elapsed time per iteration (s): 15.22 | learning rate: 6.145E-06 | global batch size:    16 | lm loss: 7.312955E+00 | grad norm: 1.041 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1173/  128728 | consumed samples:        18768 | consumed tokens:     38436864 | elapsed time per iteration (s): 15.21 | learning rate: 6.150E-06 | global batch size:    16 | lm loss: 7.203588E+00 | grad norm: 1.038 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1174/  128728 | consumed samples:        18784 | consumed tokens:     38469632 | elapsed time per iteration (s): 15.16 | learning rate: 6.155E-06 | global batch size:    16 | lm loss: 7.083356E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1175/  128728 | consumed samples:        18800 | consumed tokens:     38502400 | elapsed time per iteration (s): 15.23 | learning rate: 6.160E-06 | global batch size:    16 | lm loss: 7.164299E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1176/  128728 | consumed samples:        18816 | consumed tokens:     38535168 | elapsed time per iteration (s): 15.25 | learning rate: 6.166E-06 | global batch size:    16 | lm loss: 7.204933E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1177/  128728 | consumed samples:        18832 | consumed tokens:     38567936 | elapsed time per iteration (s): 15.22 | learning rate: 6.171E-06 | global batch size:    16 | lm loss: 7.019668E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1178/  128728 | consumed samples:        18848 | consumed tokens:     38600704 | elapsed time per iteration (s): 15.23 | learning rate: 6.176E-06 | global batch size:    16 | lm loss: 7.238056E+00 | grad norm: 1.089 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1179/  128728 | consumed samples:        18864 | consumed tokens:     38633472 | elapsed time per iteration (s): 15.25 | learning rate: 6.181E-06 | global batch size:    16 | lm loss: 7.101101E+00 | grad norm: 1.046 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1180/  128728 | consumed samples:        18880 | consumed tokens:     38666240 | elapsed time per iteration (s): 15.21 | learning rate: 6.187E-06 | global batch size:    16 | lm loss: 7.030687E+00 | grad norm: 0.987 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1181/  128728 | consumed samples:        18896 | consumed tokens:     38699008 | elapsed time per iteration (s): 15.25 | learning rate: 6.192E-06 | global batch size:    16 | lm loss: 7.330659E+00 | grad norm: 1.406 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1182/  128728 | consumed samples:        18912 | consumed tokens:     38731776 | elapsed time per iteration (s): 15.24 | learning rate: 6.197E-06 | global batch size:    16 | lm loss: 7.227168E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1183/  128728 | consumed samples:        18928 | consumed tokens:     38764544 | elapsed time per iteration (s): 15.25 | learning rate: 6.202E-06 | global batch size:    16 | lm loss: 7.105655E+00 | grad norm: 1.249 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1184/  128728 | consumed samples:        18944 | consumed tokens:     38797312 | elapsed time per iteration (s): 15.19 | learning rate: 6.208E-06 | global batch size:    16 | lm loss: 7.421823E+00 | grad norm: 1.126 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1185/  128728 | consumed samples:        18960 | consumed tokens:     38830080 | elapsed time per iteration (s): 15.11 | learning rate: 6.213E-06 | global batch size:    16 | lm loss: 7.161137E+00 | grad norm: 0.996 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.059 | TFLOPs: 8.11 |
[default7]: iteration     1186/  128728 | consumed samples:        18976 | consumed tokens:     38862848 | elapsed time per iteration (s): 15.24 | learning rate: 6.218E-06 | global batch size:    16 | lm loss: 7.420480E+00 | grad norm: 1.236 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1187/  128728 | consumed samples:        18992 | consumed tokens:     38895616 | elapsed time per iteration (s): 15.26 | learning rate: 6.223E-06 | global batch size:    16 | lm loss: 7.459645E+00 | grad norm: 1.665 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1188/  128728 | consumed samples:        19008 | consumed tokens:     38928384 | elapsed time per iteration (s): 15.24 | learning rate: 6.229E-06 | global batch size:    16 | lm loss: 7.134075E+00 | grad norm: 1.416 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1189/  128728 | consumed samples:        19024 | consumed tokens:     38961152 | elapsed time per iteration (s): 15.23 | learning rate: 6.234E-06 | global batch size:    16 | lm loss: 7.168115E+00 | grad norm: 1.050 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1190/  128728 | consumed samples:        19040 | consumed tokens:     38993920 | elapsed time per iteration (s): 15.20 | learning rate: 6.239E-06 | global batch size:    16 | lm loss: 7.134392E+00 | grad norm: 1.056 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1191/  128728 | consumed samples:        19056 | consumed tokens:     39026688 | elapsed time per iteration (s): 15.18 | learning rate: 6.244E-06 | global batch size:    16 | lm loss: 7.327762E+00 | grad norm: 1.271 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1192/  128728 | consumed samples:        19072 | consumed tokens:     39059456 | elapsed time per iteration (s): 15.27 | learning rate: 6.250E-06 | global batch size:    16 | lm loss: 7.085316E+00 | grad norm: 1.333 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1193/  128728 | consumed samples:        19088 | consumed tokens:     39092224 | elapsed time per iteration (s): 15.24 | learning rate: 6.255E-06 | global batch size:    16 | lm loss: 7.026468E+00 | grad norm: 1.155 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1194/  128728 | consumed samples:        19104 | consumed tokens:     39124992 | elapsed time per iteration (s): 15.26 | learning rate: 6.260E-06 | global batch size:    16 | lm loss: 7.376468E+00 | grad norm: 1.375 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1195/  128728 | consumed samples:        19120 | consumed tokens:     39157760 | elapsed time per iteration (s): 15.25 | learning rate: 6.265E-06 | global batch size:    16 | lm loss: 7.219844E+00 | grad norm: 1.118 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1196/  128728 | consumed samples:        19136 | consumed tokens:     39190528 | elapsed time per iteration (s): 15.25 | learning rate: 6.271E-06 | global batch size:    16 | lm loss: 7.149906E+00 | grad norm: 1.236 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1197/  128728 | consumed samples:        19152 | consumed tokens:     39223296 | elapsed time per iteration (s): 15.22 | learning rate: 6.276E-06 | global batch size:    16 | lm loss: 6.934923E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1198/  128728 | consumed samples:        19168 | consumed tokens:     39256064 | elapsed time per iteration (s): 15.22 | learning rate: 6.281E-06 | global batch size:    16 | lm loss: 6.979043E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1199/  128728 | consumed samples:        19184 | consumed tokens:     39288832 | elapsed time per iteration (s): 15.23 | learning rate: 6.286E-06 | global batch size:    16 | lm loss: 7.078469E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1200/  128728 | consumed samples:        19200 | consumed tokens:     39321600 | elapsed time per iteration (s): 15.25 | learning rate: 6.291E-06 | global batch size:    16 | lm loss: 7.111989E+00 | grad norm: 0.881 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1201/  128728 | consumed samples:        19216 | consumed tokens:     39354368 | elapsed time per iteration (s): 15.23 | learning rate: 6.297E-06 | global batch size:    16 | lm loss: 7.255686E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1202/  128728 | consumed samples:        19232 | consumed tokens:     39387136 | elapsed time per iteration (s): 15.23 | learning rate: 6.302E-06 | global batch size:    16 | lm loss: 7.404012E+00 | grad norm: 1.110 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1203/  128728 | consumed samples:        19248 | consumed tokens:     39419904 | elapsed time per iteration (s): 15.26 | learning rate: 6.307E-06 | global batch size:    16 | lm loss: 7.017631E+00 | grad norm: 1.472 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1204/  128728 | consumed samples:        19264 | consumed tokens:     39452672 | elapsed time per iteration (s): 15.22 | learning rate: 6.312E-06 | global batch size:    16 | lm loss: 7.073680E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1205/  128728 | consumed samples:        19280 | consumed tokens:     39485440 | elapsed time per iteration (s): 15.24 | learning rate: 6.318E-06 | global batch size:    16 | lm loss: 7.345861E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1206/  128728 | consumed samples:        19296 | consumed tokens:     39518208 | elapsed time per iteration (s): 15.21 | learning rate: 6.323E-06 | global batch size:    16 | lm loss: 7.009941E+00 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1207/  128728 | consumed samples:        19312 | consumed tokens:     39550976 | elapsed time per iteration (s): 15.23 | learning rate: 6.328E-06 | global batch size:    16 | lm loss: 7.123629E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1208/  128728 | consumed samples:        19328 | consumed tokens:     39583744 | elapsed time per iteration (s): 15.17 | learning rate: 6.333E-06 | global batch size:    16 | lm loss: 7.077274E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1209/  128728 | consumed samples:        19344 | consumed tokens:     39616512 | elapsed time per iteration (s): 15.20 | learning rate: 6.339E-06 | global batch size:    16 | lm loss: 7.096000E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1210/  128728 | consumed samples:        19360 | consumed tokens:     39649280 | elapsed time per iteration (s): 15.23 | learning rate: 6.344E-06 | global batch size:    16 | lm loss: 7.476648E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1211/  128728 | consumed samples:        19376 | consumed tokens:     39682048 | elapsed time per iteration (s): 15.22 | learning rate: 6.349E-06 | global batch size:    16 | lm loss: 6.972303E+00 | grad norm: 1.382 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1212/  128728 | consumed samples:        19392 | consumed tokens:     39714816 | elapsed time per iteration (s): 15.24 | learning rate: 6.354E-06 | global batch size:    16 | lm loss: 7.088462E+00 | grad norm: 1.216 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1213/  128728 | consumed samples:        19408 | consumed tokens:     39747584 | elapsed time per iteration (s): 15.25 | learning rate: 6.360E-06 | global batch size:    16 | lm loss: 7.357036E+00 | grad norm: 1.961 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1214/  128728 | consumed samples:        19424 | consumed tokens:     39780352 | elapsed time per iteration (s): 15.24 | learning rate: 6.365E-06 | global batch size:    16 | lm loss: 7.337027E+00 | grad norm: 1.164 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1215/  128728 | consumed samples:        19440 | consumed tokens:     39813120 | elapsed time per iteration (s): 15.24 | learning rate: 6.370E-06 | global batch size:    16 | lm loss: 6.935066E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1216/  128728 | consumed samples:        19456 | consumed tokens:     39845888 | elapsed time per iteration (s): 15.22 | learning rate: 6.375E-06 | global batch size:    16 | lm loss: 7.197056E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1217/  128728 | consumed samples:        19472 | consumed tokens:     39878656 | elapsed time per iteration (s): 15.24 | learning rate: 6.381E-06 | global batch size:    16 | lm loss: 7.179683E+00 | grad norm: 1.091 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1218/  128728 | consumed samples:        19488 | consumed tokens:     39911424 | elapsed time per iteration (s): 15.22 | learning rate: 6.386E-06 | global batch size:    16 | lm loss: 7.041315E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1219/  128728 | consumed samples:        19504 | consumed tokens:     39944192 | elapsed time per iteration (s): 15.19 | learning rate: 6.391E-06 | global batch size:    16 | lm loss: 7.058975E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     1220/  128728 | consumed samples:        19520 | consumed tokens:     39976960 | elapsed time per iteration (s): 15.25 | learning rate: 6.396E-06 | global batch size:    16 | lm loss: 7.103866E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1221/  128728 | consumed samples:        19536 | consumed tokens:     40009728 | elapsed time per iteration (s): 15.22 | learning rate: 6.402E-06 | global batch size:    16 | lm loss: 7.216382E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1222/  128728 | consumed samples:        19552 | consumed tokens:     40042496 | elapsed time per iteration (s): 15.24 | learning rate: 6.407E-06 | global batch size:    16 | lm loss: 6.964835E+00 | grad norm: 0.956 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1223/  128728 | consumed samples:        19568 | consumed tokens:     40075264 | elapsed time per iteration (s): 15.19 | learning rate: 6.412E-06 | global batch size:    16 | lm loss: 6.933653E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1224/  128728 | consumed samples:        19584 | consumed tokens:     40108032 | elapsed time per iteration (s): 15.26 | learning rate: 6.417E-06 | global batch size:    16 | lm loss: 7.316370E+00 | grad norm: 1.229 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1225/  128728 | consumed samples:        19600 | consumed tokens:     40140800 | elapsed time per iteration (s): 15.22 | learning rate: 6.423E-06 | global batch size:    16 | lm loss: 7.115275E+00 | grad norm: 1.164 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1226/  128728 | consumed samples:        19616 | consumed tokens:     40173568 | elapsed time per iteration (s): 15.25 | learning rate: 6.428E-06 | global batch size:    16 | lm loss: 7.078380E+00 | grad norm: 1.123 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1227/  128728 | consumed samples:        19632 | consumed tokens:     40206336 | elapsed time per iteration (s): 15.32 | learning rate: 6.433E-06 | global batch size:    16 | lm loss: 7.140039E+00 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.045 | TFLOPs: 8.00 |
[default7]: iteration     1228/  128728 | consumed samples:        19648 | consumed tokens:     40239104 | elapsed time per iteration (s): 15.24 | learning rate: 6.438E-06 | global batch size:    16 | lm loss: 6.979059E+00 | grad norm: 0.991 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1229/  128728 | consumed samples:        19664 | consumed tokens:     40271872 | elapsed time per iteration (s): 15.24 | learning rate: 6.444E-06 | global batch size:    16 | lm loss: 7.118724E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1230/  128728 | consumed samples:        19680 | consumed tokens:     40304640 | elapsed time per iteration (s): 15.24 | learning rate: 6.449E-06 | global batch size:    16 | lm loss: 7.120239E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1231/  128728 | consumed samples:        19696 | consumed tokens:     40337408 | elapsed time per iteration (s): 15.18 | learning rate: 6.454E-06 | global batch size:    16 | lm loss: 7.180079E+00 | grad norm: 1.100 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1232/  128728 | consumed samples:        19712 | consumed tokens:     40370176 | elapsed time per iteration (s): 15.25 | learning rate: 6.459E-06 | global batch size:    16 | lm loss: 7.335692E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1233/  128728 | consumed samples:        19728 | consumed tokens:     40402944 | elapsed time per iteration (s): 15.23 | learning rate: 6.464E-06 | global batch size:    16 | lm loss: 7.010607E+00 | grad norm: 0.975 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1234/  128728 | consumed samples:        19744 | consumed tokens:     40435712 | elapsed time per iteration (s): 15.20 | learning rate: 6.470E-06 | global batch size:    16 | lm loss: 6.938548E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1235/  128728 | consumed samples:        19760 | consumed tokens:     40468480 | elapsed time per iteration (s): 15.27 | learning rate: 6.475E-06 | global batch size:    16 | lm loss: 7.146415E+00 | grad norm: 1.280 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1236/  128728 | consumed samples:        19776 | consumed tokens:     40501248 | elapsed time per iteration (s): 15.26 | learning rate: 6.480E-06 | global batch size:    16 | lm loss: 7.039947E+00 | grad norm: 1.246 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1237/  128728 | consumed samples:        19792 | consumed tokens:     40534016 | elapsed time per iteration (s): 15.27 | learning rate: 6.485E-06 | global batch size:    16 | lm loss: 7.084141E+00 | grad norm: 1.252 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1238/  128728 | consumed samples:        19808 | consumed tokens:     40566784 | elapsed time per iteration (s): 15.27 | learning rate: 6.491E-06 | global batch size:    16 | lm loss: 7.073313E+00 | grad norm: 1.034 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1239/  128728 | consumed samples:        19824 | consumed tokens:     40599552 | elapsed time per iteration (s): 15.24 | learning rate: 6.496E-06 | global batch size:    16 | lm loss: 6.969284E+00 | grad norm: 1.236 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1240/  128728 | consumed samples:        19840 | consumed tokens:     40632320 | elapsed time per iteration (s): 15.20 | learning rate: 6.501E-06 | global batch size:    16 | lm loss: 7.203765E+00 | grad norm: 1.238 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1241/  128728 | consumed samples:        19856 | consumed tokens:     40665088 | elapsed time per iteration (s): 15.24 | learning rate: 6.506E-06 | global batch size:    16 | lm loss: 7.026887E+00 | grad norm: 1.078 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1242/  128728 | consumed samples:        19872 | consumed tokens:     40697856 | elapsed time per iteration (s): 15.25 | learning rate: 6.512E-06 | global batch size:    16 | lm loss: 7.141012E+00 | grad norm: 1.024 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1243/  128728 | consumed samples:        19888 | consumed tokens:     40730624 | elapsed time per iteration (s): 15.27 | learning rate: 6.517E-06 | global batch size:    16 | lm loss: 6.841239E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1244/  128728 | consumed samples:        19904 | consumed tokens:     40763392 | elapsed time per iteration (s): 15.64 | learning rate: 6.522E-06 | global batch size:    16 | lm loss: 6.917506E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.023 | TFLOPs: 7.83 |
[default7]: iteration     1245/  128728 | consumed samples:        19920 | consumed tokens:     40796160 | elapsed time per iteration (s): 19.06 | learning rate: 6.527E-06 | global batch size:    16 | lm loss: 7.028550E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.840 | TFLOPs: 6.43 |
[default7]: iteration     1246/  128728 | consumed samples:        19936 | consumed tokens:     40828928 | elapsed time per iteration (s): 17.59 | learning rate: 6.533E-06 | global batch size:    16 | lm loss: 7.041822E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.910 | TFLOPs: 6.96 |
[default7]: iteration     1247/  128728 | consumed samples:        19952 | consumed tokens:     40861696 | elapsed time per iteration (s): 18.59 | learning rate: 6.538E-06 | global batch size:    16 | lm loss: 6.829185E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.861 | TFLOPs: 6.59 |
[default7]: iteration     1248/  128728 | consumed samples:        19968 | consumed tokens:     40894464 | elapsed time per iteration (s): 23.66 | learning rate: 6.543E-06 | global batch size:    16 | lm loss: 7.007943E+00 | grad norm: 1.088 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.676 | TFLOPs: 5.18 |
[default7]: iteration     1249/  128728 | consumed samples:        19984 | consumed tokens:     40927232 | elapsed time per iteration (s): 16.26 | learning rate: 6.548E-06 | global batch size:    16 | lm loss: 7.074346E+00 | grad norm: 1.148 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.984 | TFLOPs: 7.53 |
[default7]: iteration     1250/  128728 | consumed samples:        20000 | consumed tokens:     40960000 | elapsed time per iteration (s): 15.27 | learning rate: 6.554E-06 | global batch size:    16 | lm loss: 7.107431E+00 | grad norm: 1.066 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1251/  128728 | consumed samples:        20016 | consumed tokens:     40992768 | elapsed time per iteration (s): 15.25 | learning rate: 6.559E-06 | global batch size:    16 | lm loss: 6.935212E+00 | grad norm: 0.984 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1252/  128728 | consumed samples:        20032 | consumed tokens:     41025536 | elapsed time per iteration (s): 15.20 | learning rate: 6.564E-06 | global batch size:    16 | lm loss: 7.023438E+00 | grad norm: 0.908 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1253/  128728 | consumed samples:        20048 | consumed tokens:     41058304 | elapsed time per iteration (s): 15.21 | learning rate: 6.569E-06 | global batch size:    16 | lm loss: 7.031582E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1254/  128728 | consumed samples:        20064 | consumed tokens:     41091072 | elapsed time per iteration (s): 15.27 | learning rate: 6.575E-06 | global batch size:    16 | lm loss: 7.013303E+00 | grad norm: 1.230 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1255/  128728 | consumed samples:        20080 | consumed tokens:     41123840 | elapsed time per iteration (s): 15.22 | learning rate: 6.580E-06 | global batch size:    16 | lm loss: 7.063211E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1256/  128728 | consumed samples:        20096 | consumed tokens:     41156608 | elapsed time per iteration (s): 15.24 | learning rate: 6.585E-06 | global batch size:    16 | lm loss: 6.951248E+00 | grad norm: 0.999 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1257/  128728 | consumed samples:        20112 | consumed tokens:     41189376 | elapsed time per iteration (s): 15.22 | learning rate: 6.590E-06 | global batch size:    16 | lm loss: 7.142652E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1258/  128728 | consumed samples:        20128 | consumed tokens:     41222144 | elapsed time per iteration (s): 15.21 | learning rate: 6.596E-06 | global batch size:    16 | lm loss: 7.207096E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1259/  128728 | consumed samples:        20144 | consumed tokens:     41254912 | elapsed time per iteration (s): 15.25 | learning rate: 6.601E-06 | global batch size:    16 | lm loss: 7.017394E+00 | grad norm: 0.893 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1260/  128728 | consumed samples:        20160 | consumed tokens:     41287680 | elapsed time per iteration (s): 15.25 | learning rate: 6.606E-06 | global batch size:    16 | lm loss: 7.106571E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1261/  128728 | consumed samples:        20176 | consumed tokens:     41320448 | elapsed time per iteration (s): 15.26 | learning rate: 6.611E-06 | global batch size:    16 | lm loss: 7.189216E+00 | grad norm: 1.126 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1262/  128728 | consumed samples:        20192 | consumed tokens:     41353216 | elapsed time per iteration (s): 15.21 | learning rate: 6.617E-06 | global batch size:    16 | lm loss: 7.057990E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1263/  128728 | consumed samples:        20208 | consumed tokens:     41385984 | elapsed time per iteration (s): 15.23 | learning rate: 6.622E-06 | global batch size:    16 | lm loss: 7.032105E+00 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1264/  128728 | consumed samples:        20224 | consumed tokens:     41418752 | elapsed time per iteration (s): 15.24 | learning rate: 6.627E-06 | global batch size:    16 | lm loss: 7.253157E+00 | grad norm: 1.383 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1265/  128728 | consumed samples:        20240 | consumed tokens:     41451520 | elapsed time per iteration (s): 15.25 | learning rate: 6.632E-06 | global batch size:    16 | lm loss: 6.969168E+00 | grad norm: 1.025 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1266/  128728 | consumed samples:        20256 | consumed tokens:     41484288 | elapsed time per iteration (s): 15.23 | learning rate: 6.638E-06 | global batch size:    16 | lm loss: 7.029213E+00 | grad norm: 0.997 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1267/  128728 | consumed samples:        20272 | consumed tokens:     41517056 | elapsed time per iteration (s): 15.27 | learning rate: 6.643E-06 | global batch size:    16 | lm loss: 7.001297E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1268/  128728 | consumed samples:        20288 | consumed tokens:     41549824 | elapsed time per iteration (s): 15.25 | learning rate: 6.648E-06 | global batch size:    16 | lm loss: 7.099933E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1269/  128728 | consumed samples:        20304 | consumed tokens:     41582592 | elapsed time per iteration (s): 15.28 | learning rate: 6.653E-06 | global batch size:    16 | lm loss: 6.980592E+00 | grad norm: 1.030 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration     1270/  128728 | consumed samples:        20320 | consumed tokens:     41615360 | elapsed time per iteration (s): 15.22 | learning rate: 6.658E-06 | global batch size:    16 | lm loss: 7.266665E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1271/  128728 | consumed samples:        20336 | consumed tokens:     41648128 | elapsed time per iteration (s): 15.24 | learning rate: 6.664E-06 | global batch size:    16 | lm loss: 6.836196E+00 | grad norm: 0.993 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1272/  128728 | consumed samples:        20352 | consumed tokens:     41680896 | elapsed time per iteration (s): 15.23 | learning rate: 6.669E-06 | global batch size:    16 | lm loss: 7.449065E+00 | grad norm: 1.189 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1273/  128728 | consumed samples:        20368 | consumed tokens:     41713664 | elapsed time per iteration (s): 15.24 | learning rate: 6.674E-06 | global batch size:    16 | lm loss: 7.271956E+00 | grad norm: 1.244 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1274/  128728 | consumed samples:        20384 | consumed tokens:     41746432 | elapsed time per iteration (s): 15.24 | learning rate: 6.679E-06 | global batch size:    16 | lm loss: 7.223175E+00 | grad norm: 1.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1275/  128728 | consumed samples:        20400 | consumed tokens:     41779200 | elapsed time per iteration (s): 15.25 | learning rate: 6.685E-06 | global batch size:    16 | lm loss: 7.255591E+00 | grad norm: 1.267 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1276/  128728 | consumed samples:        20416 | consumed tokens:     41811968 | elapsed time per iteration (s): 15.24 | learning rate: 6.690E-06 | global batch size:    16 | lm loss: 7.017190E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1277/  128728 | consumed samples:        20432 | consumed tokens:     41844736 | elapsed time per iteration (s): 15.23 | learning rate: 6.695E-06 | global batch size:    16 | lm loss: 7.104808E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1278/  128728 | consumed samples:        20448 | consumed tokens:     41877504 | elapsed time per iteration (s): 15.23 | learning rate: 6.700E-06 | global batch size:    16 | lm loss: 7.052327E+00 | grad norm: 1.028 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1279/  128728 | consumed samples:        20464 | consumed tokens:     41910272 | elapsed time per iteration (s): 15.24 | learning rate: 6.706E-06 | global batch size:    16 | lm loss: 7.316154E+00 | grad norm: 1.347 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1280/  128728 | consumed samples:        20480 | consumed tokens:     41943040 | elapsed time per iteration (s): 15.25 | learning rate: 6.711E-06 | global batch size:    16 | lm loss: 7.109064E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1281/  128728 | consumed samples:        20496 | consumed tokens:     41975808 | elapsed time per iteration (s): 15.18 | learning rate: 6.716E-06 | global batch size:    16 | lm loss: 7.014742E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1282/  128728 | consumed samples:        20512 | consumed tokens:     42008576 | elapsed time per iteration (s): 15.24 | learning rate: 6.721E-06 | global batch size:    16 | lm loss: 7.076769E+00 | grad norm: 1.098 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1283/  128728 | consumed samples:        20528 | consumed tokens:     42041344 | elapsed time per iteration (s): 15.26 | learning rate: 6.727E-06 | global batch size:    16 | lm loss: 7.277905E+00 | grad norm: 1.161 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1284/  128728 | consumed samples:        20544 | consumed tokens:     42074112 | elapsed time per iteration (s): 15.24 | learning rate: 6.732E-06 | global batch size:    16 | lm loss: 7.167206E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1285/  128728 | consumed samples:        20560 | consumed tokens:     42106880 | elapsed time per iteration (s): 15.26 | learning rate: 6.737E-06 | global batch size:    16 | lm loss: 6.924407E+00 | grad norm: 3.384 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1286/  128728 | consumed samples:        20576 | consumed tokens:     42139648 | elapsed time per iteration (s): 15.25 | learning rate: 6.742E-06 | global batch size:    16 | lm loss: 7.007799E+00 | grad norm: 1.172 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1287/  128728 | consumed samples:        20592 | consumed tokens:     42172416 | elapsed time per iteration (s): 15.25 | learning rate: 6.748E-06 | global batch size:    16 | lm loss: 6.977216E+00 | grad norm: 1.107 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1288/  128728 | consumed samples:        20608 | consumed tokens:     42205184 | elapsed time per iteration (s): 15.20 | learning rate: 6.753E-06 | global batch size:    16 | lm loss: 6.887696E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1289/  128728 | consumed samples:        20624 | consumed tokens:     42237952 | elapsed time per iteration (s): 15.24 | learning rate: 6.758E-06 | global batch size:    16 | lm loss: 7.158238E+00 | grad norm: 1.292 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1290/  128728 | consumed samples:        20640 | consumed tokens:     42270720 | elapsed time per iteration (s): 15.26 | learning rate: 6.763E-06 | global batch size:    16 | lm loss: 7.162902E+00 | grad norm: 1.583 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1291/  128728 | consumed samples:        20656 | consumed tokens:     42303488 | elapsed time per iteration (s): 15.21 | learning rate: 6.769E-06 | global batch size:    16 | lm loss: 7.018879E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1292/  128728 | consumed samples:        20672 | consumed tokens:     42336256 | elapsed time per iteration (s): 15.23 | learning rate: 6.774E-06 | global batch size:    16 | lm loss: 6.894781E+00 | grad norm: 1.090 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1293/  128728 | consumed samples:        20688 | consumed tokens:     42369024 | elapsed time per iteration (s): 15.25 | learning rate: 6.779E-06 | global batch size:    16 | lm loss: 6.989740E+00 | grad norm: 1.050 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1294/  128728 | consumed samples:        20704 | consumed tokens:     42401792 | elapsed time per iteration (s): 15.22 | learning rate: 6.784E-06 | global batch size:    16 | lm loss: 7.075770E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1295/  128728 | consumed samples:        20720 | consumed tokens:     42434560 | elapsed time per iteration (s): 15.22 | learning rate: 6.790E-06 | global batch size:    16 | lm loss: 7.155486E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1296/  128728 | consumed samples:        20736 | consumed tokens:     42467328 | elapsed time per iteration (s): 15.22 | learning rate: 6.795E-06 | global batch size:    16 | lm loss: 7.141552E+00 | grad norm: 0.922 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1297/  128728 | consumed samples:        20752 | consumed tokens:     42500096 | elapsed time per iteration (s): 15.24 | learning rate: 6.800E-06 | global batch size:    16 | lm loss: 7.286495E+00 | grad norm: 1.145 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1298/  128728 | consumed samples:        20768 | consumed tokens:     42532864 | elapsed time per iteration (s): 15.23 | learning rate: 6.805E-06 | global batch size:    16 | lm loss: 7.142656E+00 | grad norm: 1.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1299/  128728 | consumed samples:        20784 | consumed tokens:     42565632 | elapsed time per iteration (s): 15.23 | learning rate: 6.811E-06 | global batch size:    16 | lm loss: 6.876920E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1300/  128728 | consumed samples:        20800 | consumed tokens:     42598400 | elapsed time per iteration (s): 15.22 | learning rate: 6.816E-06 | global batch size:    16 | lm loss: 6.969202E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1301/  128728 | consumed samples:        20816 | consumed tokens:     42631168 | elapsed time per iteration (s): 15.25 | learning rate: 6.821E-06 | global batch size:    16 | lm loss: 7.109032E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1302/  128728 | consumed samples:        20832 | consumed tokens:     42663936 | elapsed time per iteration (s): 15.22 | learning rate: 6.826E-06 | global batch size:    16 | lm loss: 6.858071E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1303/  128728 | consumed samples:        20848 | consumed tokens:     42696704 | elapsed time per iteration (s): 15.20 | learning rate: 6.831E-06 | global batch size:    16 | lm loss: 6.878172E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1304/  128728 | consumed samples:        20864 | consumed tokens:     42729472 | elapsed time per iteration (s): 15.25 | learning rate: 6.837E-06 | global batch size:    16 | lm loss: 6.795415E+00 | grad norm: 0.981 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1305/  128728 | consumed samples:        20880 | consumed tokens:     42762240 | elapsed time per iteration (s): 15.25 | learning rate: 6.842E-06 | global batch size:    16 | lm loss: 7.055003E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1306/  128728 | consumed samples:        20896 | consumed tokens:     42795008 | elapsed time per iteration (s): 15.23 | learning rate: 6.847E-06 | global batch size:    16 | lm loss: 7.091806E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1307/  128728 | consumed samples:        20912 | consumed tokens:     42827776 | elapsed time per iteration (s): 15.22 | learning rate: 6.852E-06 | global batch size:    16 | lm loss: 7.148190E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1308/  128728 | consumed samples:        20928 | consumed tokens:     42860544 | elapsed time per iteration (s): 15.22 | learning rate: 6.858E-06 | global batch size:    16 | lm loss: 7.025421E+00 | grad norm: 0.972 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1309/  128728 | consumed samples:        20944 | consumed tokens:     42893312 | elapsed time per iteration (s): 15.23 | learning rate: 6.863E-06 | global batch size:    16 | lm loss: 6.918103E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1310/  128728 | consumed samples:        20960 | consumed tokens:     42926080 | elapsed time per iteration (s): 15.25 | learning rate: 6.868E-06 | global batch size:    16 | lm loss: 7.181193E+00 | grad norm: 1.143 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1311/  128728 | consumed samples:        20976 | consumed tokens:     42958848 | elapsed time per iteration (s): 15.23 | learning rate: 6.873E-06 | global batch size:    16 | lm loss: 6.954082E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1312/  128728 | consumed samples:        20992 | consumed tokens:     42991616 | elapsed time per iteration (s): 15.22 | learning rate: 6.879E-06 | global batch size:    16 | lm loss: 7.259136E+00 | grad norm: 1.120 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1313/  128728 | consumed samples:        21008 | consumed tokens:     43024384 | elapsed time per iteration (s): 15.25 | learning rate: 6.884E-06 | global batch size:    16 | lm loss: 7.125967E+00 | grad norm: 1.046 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     1314/  128728 | consumed samples:        21024 | consumed tokens:     43057152 | elapsed time per iteration (s): 15.22 | learning rate: 6.889E-06 | global batch size:    16 | lm loss: 6.829364E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1315/  128728 | consumed samples:        21040 | consumed tokens:     43089920 | elapsed time per iteration (s): 15.26 | learning rate: 6.894E-06 | global batch size:    16 | lm loss: 6.958238E+00 | grad norm: 1.957 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1316/  128728 | consumed samples:        21056 | consumed tokens:     43122688 | elapsed time per iteration (s): 15.24 | learning rate: 6.900E-06 | global batch size:    16 | lm loss: 7.172208E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1317/  128728 | consumed samples:        21072 | consumed tokens:     43155456 | elapsed time per iteration (s): 15.22 | learning rate: 6.905E-06 | global batch size:    16 | lm loss: 7.113717E+00 | grad norm: 0.922 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1318/  128728 | consumed samples:        21088 | consumed tokens:     43188224 | elapsed time per iteration (s): 15.23 | learning rate: 6.910E-06 | global batch size:    16 | lm loss: 6.939925E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1319/  128728 | consumed samples:        21104 | consumed tokens:     43220992 | elapsed time per iteration (s): 15.21 | learning rate: 6.915E-06 | global batch size:    16 | lm loss: 7.170483E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1320/  128728 | consumed samples:        21120 | consumed tokens:     43253760 | elapsed time per iteration (s): 15.21 | learning rate: 6.921E-06 | global batch size:    16 | lm loss: 6.819268E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1321/  128728 | consumed samples:        21136 | consumed tokens:     43286528 | elapsed time per iteration (s): 15.23 | learning rate: 6.926E-06 | global batch size:    16 | lm loss: 6.952059E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1322/  128728 | consumed samples:        21152 | consumed tokens:     43319296 | elapsed time per iteration (s): 15.20 | learning rate: 6.931E-06 | global batch size:    16 | lm loss: 7.053830E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1323/  128728 | consumed samples:        21168 | consumed tokens:     43352064 | elapsed time per iteration (s): 15.20 | learning rate: 6.936E-06 | global batch size:    16 | lm loss: 6.946447E+00 | grad norm: 0.967 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1324/  128728 | consumed samples:        21184 | consumed tokens:     43384832 | elapsed time per iteration (s): 15.23 | learning rate: 6.942E-06 | global batch size:    16 | lm loss: 7.168796E+00 | grad norm: 1.158 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1325/  128728 | consumed samples:        21200 | consumed tokens:     43417600 | elapsed time per iteration (s): 15.21 | learning rate: 6.947E-06 | global batch size:    16 | lm loss: 6.921078E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1326/  128728 | consumed samples:        21216 | consumed tokens:     43450368 | elapsed time per iteration (s): 15.24 | learning rate: 6.952E-06 | global batch size:    16 | lm loss: 7.017085E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1327/  128728 | consumed samples:        21232 | consumed tokens:     43483136 | elapsed time per iteration (s): 15.20 | learning rate: 6.957E-06 | global batch size:    16 | lm loss: 7.202669E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1328/  128728 | consumed samples:        21248 | consumed tokens:     43515904 | elapsed time per iteration (s): 15.29 | learning rate: 6.963E-06 | global batch size:    16 | lm loss: 7.048963E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     1329/  128728 | consumed samples:        21264 | consumed tokens:     43548672 | elapsed time per iteration (s): 15.23 | learning rate: 6.968E-06 | global batch size:    16 | lm loss: 6.967897E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1330/  128728 | consumed samples:        21280 | consumed tokens:     43581440 | elapsed time per iteration (s): 15.26 | learning rate: 6.973E-06 | global batch size:    16 | lm loss: 7.000623E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1331/  128728 | consumed samples:        21296 | consumed tokens:     43614208 | elapsed time per iteration (s): 15.25 | learning rate: 6.978E-06 | global batch size:    16 | lm loss: 7.296478E+00 | grad norm: 1.068 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1332/  128728 | consumed samples:        21312 | consumed tokens:     43646976 | elapsed time per iteration (s): 15.23 | learning rate: 6.984E-06 | global batch size:    16 | lm loss: 7.027101E+00 | grad norm: 1.250 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1333/  128728 | consumed samples:        21328 | consumed tokens:     43679744 | elapsed time per iteration (s): 15.21 | learning rate: 6.989E-06 | global batch size:    16 | lm loss: 7.019495E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1334/  128728 | consumed samples:        21344 | consumed tokens:     43712512 | elapsed time per iteration (s): 15.23 | learning rate: 6.994E-06 | global batch size:    16 | lm loss: 6.855921E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1335/  128728 | consumed samples:        21360 | consumed tokens:     43745280 | elapsed time per iteration (s): 15.24 | learning rate: 6.999E-06 | global batch size:    16 | lm loss: 7.009941E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1336/  128728 | consumed samples:        21376 | consumed tokens:     43778048 | elapsed time per iteration (s): 15.26 | learning rate: 7.005E-06 | global batch size:    16 | lm loss: 6.913696E+00 | grad norm: 1.115 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1337/  128728 | consumed samples:        21392 | consumed tokens:     43810816 | elapsed time per iteration (s): 15.23 | learning rate: 7.010E-06 | global batch size:    16 | lm loss: 7.002808E+00 | grad norm: 1.135 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1338/  128728 | consumed samples:        21408 | consumed tokens:     43843584 | elapsed time per iteration (s): 15.26 | learning rate: 7.015E-06 | global batch size:    16 | lm loss: 7.006137E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1339/  128728 | consumed samples:        21424 | consumed tokens:     43876352 | elapsed time per iteration (s): 15.26 | learning rate: 7.020E-06 | global batch size:    16 | lm loss: 6.981978E+00 | grad norm: 1.087 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1340/  128728 | consumed samples:        21440 | consumed tokens:     43909120 | elapsed time per iteration (s): 15.25 | learning rate: 7.025E-06 | global batch size:    16 | lm loss: 7.002084E+00 | grad norm: 8.999 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1341/  128728 | consumed samples:        21456 | consumed tokens:     43941888 | elapsed time per iteration (s): 15.24 | learning rate: 7.031E-06 | global batch size:    16 | lm loss: 7.157794E+00 | grad norm: 1.076 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1342/  128728 | consumed samples:        21472 | consumed tokens:     43974656 | elapsed time per iteration (s): 15.23 | learning rate: 7.036E-06 | global batch size:    16 | lm loss: 6.872018E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1343/  128728 | consumed samples:        21488 | consumed tokens:     44007424 | elapsed time per iteration (s): 15.21 | learning rate: 7.041E-06 | global batch size:    16 | lm loss: 6.791720E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1344/  128728 | consumed samples:        21504 | consumed tokens:     44040192 | elapsed time per iteration (s): 15.25 | learning rate: 7.046E-06 | global batch size:    16 | lm loss: 6.878177E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1345/  128728 | consumed samples:        21520 | consumed tokens:     44072960 | elapsed time per iteration (s): 15.24 | learning rate: 7.052E-06 | global batch size:    16 | lm loss: 6.884387E+00 | grad norm: 0.992 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1346/  128728 | consumed samples:        21536 | consumed tokens:     44105728 | elapsed time per iteration (s): 15.24 | learning rate: 7.057E-06 | global batch size:    16 | lm loss: 6.997211E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1347/  128728 | consumed samples:        21552 | consumed tokens:     44138496 | elapsed time per iteration (s): 15.26 | learning rate: 7.062E-06 | global batch size:    16 | lm loss: 7.032449E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1348/  128728 | consumed samples:        21568 | consumed tokens:     44171264 | elapsed time per iteration (s): 15.23 | learning rate: 7.067E-06 | global batch size:    16 | lm loss: 7.008165E+00 | grad norm: 0.936 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1349/  128728 | consumed samples:        21584 | consumed tokens:     44204032 | elapsed time per iteration (s): 15.21 | learning rate: 7.073E-06 | global batch size:    16 | lm loss: 7.024583E+00 | grad norm: 0.965 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1350/  128728 | consumed samples:        21600 | consumed tokens:     44236800 | elapsed time per iteration (s): 15.27 | learning rate: 7.078E-06 | global batch size:    16 | lm loss: 6.845006E+00 | grad norm: 2.138 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1351/  128728 | consumed samples:        21616 | consumed tokens:     44269568 | elapsed time per iteration (s): 15.24 | learning rate: 7.083E-06 | global batch size:    16 | lm loss: 6.779938E+00 | grad norm: 1.443 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1352/  128728 | consumed samples:        21632 | consumed tokens:     44302336 | elapsed time per iteration (s): 15.26 | learning rate: 7.088E-06 | global batch size:    16 | lm loss: 6.868844E+00 | grad norm: 1.217 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1353/  128728 | consumed samples:        21648 | consumed tokens:     44335104 | elapsed time per iteration (s): 15.26 | learning rate: 7.094E-06 | global batch size:    16 | lm loss: 7.071971E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1354/  128728 | consumed samples:        21664 | consumed tokens:     44367872 | elapsed time per iteration (s): 15.24 | learning rate: 7.099E-06 | global batch size:    16 | lm loss: 6.839797E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1355/  128728 | consumed samples:        21680 | consumed tokens:     44400640 | elapsed time per iteration (s): 15.20 | learning rate: 7.104E-06 | global batch size:    16 | lm loss: 6.854629E+00 | grad norm: 1.044 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1356/  128728 | consumed samples:        21696 | consumed tokens:     44433408 | elapsed time per iteration (s): 15.28 | learning rate: 7.109E-06 | global batch size:    16 | lm loss: 6.967502E+00 | grad norm: 1.278 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1357/  128728 | consumed samples:        21712 | consumed tokens:     44466176 | elapsed time per iteration (s): 15.24 | learning rate: 7.115E-06 | global batch size:    16 | lm loss: 7.005933E+00 | grad norm: 1.271 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1358/  128728 | consumed samples:        21728 | consumed tokens:     44498944 | elapsed time per iteration (s): 15.26 | learning rate: 7.120E-06 | global batch size:    16 | lm loss: 7.089840E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1359/  128728 | consumed samples:        21744 | consumed tokens:     44531712 | elapsed time per iteration (s): 15.24 | learning rate: 7.125E-06 | global batch size:    16 | lm loss: 6.807289E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1360/  128728 | consumed samples:        21760 | consumed tokens:     44564480 | elapsed time per iteration (s): 15.27 | learning rate: 7.130E-06 | global batch size:    16 | lm loss: 6.980482E+00 | grad norm: 1.172 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1361/  128728 | consumed samples:        21776 | consumed tokens:     44597248 | elapsed time per iteration (s): 15.21 | learning rate: 7.136E-06 | global batch size:    16 | lm loss: 6.865876E+00 | grad norm: 0.911 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1362/  128728 | consumed samples:        21792 | consumed tokens:     44630016 | elapsed time per iteration (s): 15.25 | learning rate: 7.141E-06 | global batch size:    16 | lm loss: 6.621922E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1363/  128728 | consumed samples:        21808 | consumed tokens:     44662784 | elapsed time per iteration (s): 15.26 | learning rate: 7.146E-06 | global batch size:    16 | lm loss: 6.988260E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1364/  128728 | consumed samples:        21824 | consumed tokens:     44695552 | elapsed time per iteration (s): 15.25 | learning rate: 7.151E-06 | global batch size:    16 | lm loss: 7.108578E+00 | grad norm: 1.353 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1365/  128728 | consumed samples:        21840 | consumed tokens:     44728320 | elapsed time per iteration (s): 15.25 | learning rate: 7.157E-06 | global batch size:    16 | lm loss: 6.960870E+00 | grad norm: 1.106 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1366/  128728 | consumed samples:        21856 | consumed tokens:     44761088 | elapsed time per iteration (s): 15.25 | learning rate: 7.162E-06 | global batch size:    16 | lm loss: 7.074971E+00 | grad norm: 1.118 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1367/  128728 | consumed samples:        21872 | consumed tokens:     44793856 | elapsed time per iteration (s): 15.25 | learning rate: 7.167E-06 | global batch size:    16 | lm loss: 6.846851E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1368/  128728 | consumed samples:        21888 | consumed tokens:     44826624 | elapsed time per iteration (s): 15.21 | learning rate: 7.172E-06 | global batch size:    16 | lm loss: 7.031826E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1369/  128728 | consumed samples:        21904 | consumed tokens:     44859392 | elapsed time per iteration (s): 15.26 | learning rate: 7.178E-06 | global batch size:    16 | lm loss: 6.957930E+00 | grad norm: 1.013 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1370/  128728 | consumed samples:        21920 | consumed tokens:     44892160 | elapsed time per iteration (s): 15.23 | learning rate: 7.183E-06 | global batch size:    16 | lm loss: 6.889624E+00 | grad norm: 0.884 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1371/  128728 | consumed samples:        21936 | consumed tokens:     44924928 | elapsed time per iteration (s): 15.22 | learning rate: 7.188E-06 | global batch size:    16 | lm loss: 6.951301E+00 | grad norm: 0.891 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1372/  128728 | consumed samples:        21952 | consumed tokens:     44957696 | elapsed time per iteration (s): 15.25 | learning rate: 7.193E-06 | global batch size:    16 | lm loss: 7.280240E+00 | grad norm: 1.485 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1373/  128728 | consumed samples:        21968 | consumed tokens:     44990464 | elapsed time per iteration (s): 15.15 | learning rate: 7.198E-06 | global batch size:    16 | lm loss: 7.068165E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     1374/  128728 | consumed samples:        21984 | consumed tokens:     45023232 | elapsed time per iteration (s): 15.20 | learning rate: 7.204E-06 | global batch size:    16 | lm loss: 6.842229E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1375/  128728 | consumed samples:        22000 | consumed tokens:     45056000 | elapsed time per iteration (s): 15.19 | learning rate: 7.209E-06 | global batch size:    16 | lm loss: 6.986506E+00 | grad norm: 0.884 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     1376/  128728 | consumed samples:        22016 | consumed tokens:     45088768 | elapsed time per iteration (s): 15.20 | learning rate: 7.214E-06 | global batch size:    16 | lm loss: 6.987074E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1377/  128728 | consumed samples:        22032 | consumed tokens:     45121536 | elapsed time per iteration (s): 15.23 | learning rate: 7.219E-06 | global batch size:    16 | lm loss: 6.934793E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1378/  128728 | consumed samples:        22048 | consumed tokens:     45154304 | elapsed time per iteration (s): 15.25 | learning rate: 7.225E-06 | global batch size:    16 | lm loss: 7.082214E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1379/  128728 | consumed samples:        22064 | consumed tokens:     45187072 | elapsed time per iteration (s): 15.22 | learning rate: 7.230E-06 | global batch size:    16 | lm loss: 6.853665E+00 | grad norm: 0.877 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1380/  128728 | consumed samples:        22080 | consumed tokens:     45219840 | elapsed time per iteration (s): 15.20 | learning rate: 7.235E-06 | global batch size:    16 | lm loss: 7.111278E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1381/  128728 | consumed samples:        22096 | consumed tokens:     45252608 | elapsed time per iteration (s): 15.24 | learning rate: 7.240E-06 | global batch size:    16 | lm loss: 6.896193E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1382/  128728 | consumed samples:        22112 | consumed tokens:     45285376 | elapsed time per iteration (s): 15.21 | learning rate: 7.246E-06 | global batch size:    16 | lm loss: 6.947161E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1383/  128728 | consumed samples:        22128 | consumed tokens:     45318144 | elapsed time per iteration (s): 15.23 | learning rate: 7.251E-06 | global batch size:    16 | lm loss: 7.008558E+00 | grad norm: 0.914 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1384/  128728 | consumed samples:        22144 | consumed tokens:     45350912 | elapsed time per iteration (s): 15.22 | learning rate: 7.256E-06 | global batch size:    16 | lm loss: 6.837069E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1385/  128728 | consumed samples:        22160 | consumed tokens:     45383680 | elapsed time per iteration (s): 15.23 | learning rate: 7.261E-06 | global batch size:    16 | lm loss: 6.870586E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1386/  128728 | consumed samples:        22176 | consumed tokens:     45416448 | elapsed time per iteration (s): 15.25 | learning rate: 7.267E-06 | global batch size:    16 | lm loss: 6.940043E+00 | grad norm: 1.014 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1387/  128728 | consumed samples:        22192 | consumed tokens:     45449216 | elapsed time per iteration (s): 15.24 | learning rate: 7.272E-06 | global batch size:    16 | lm loss: 6.792444E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1388/  128728 | consumed samples:        22208 | consumed tokens:     45481984 | elapsed time per iteration (s): 15.22 | learning rate: 7.277E-06 | global batch size:    16 | lm loss: 6.868528E+00 | grad norm: 0.913 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1389/  128728 | consumed samples:        22224 | consumed tokens:     45514752 | elapsed time per iteration (s): 15.23 | learning rate: 7.282E-06 | global batch size:    16 | lm loss: 6.799677E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1390/  128728 | consumed samples:        22240 | consumed tokens:     45547520 | elapsed time per iteration (s): 15.24 | learning rate: 7.288E-06 | global batch size:    16 | lm loss: 7.063715E+00 | grad norm: 1.114 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1391/  128728 | consumed samples:        22256 | consumed tokens:     45580288 | elapsed time per iteration (s): 15.24 | learning rate: 7.293E-06 | global batch size:    16 | lm loss: 7.192670E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1392/  128728 | consumed samples:        22272 | consumed tokens:     45613056 | elapsed time per iteration (s): 15.24 | learning rate: 7.298E-06 | global batch size:    16 | lm loss: 6.923073E+00 | grad norm: 1.036 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1393/  128728 | consumed samples:        22288 | consumed tokens:     45645824 | elapsed time per iteration (s): 15.20 | learning rate: 7.303E-06 | global batch size:    16 | lm loss: 7.126211E+00 | grad norm: 0.952 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1394/  128728 | consumed samples:        22304 | consumed tokens:     45678592 | elapsed time per iteration (s): 15.22 | learning rate: 7.309E-06 | global batch size:    16 | lm loss: 6.779180E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1395/  128728 | consumed samples:        22320 | consumed tokens:     45711360 | elapsed time per iteration (s): 15.20 | learning rate: 7.314E-06 | global batch size:    16 | lm loss: 6.857265E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1396/  128728 | consumed samples:        22336 | consumed tokens:     45744128 | elapsed time per iteration (s): 15.22 | learning rate: 7.319E-06 | global batch size:    16 | lm loss: 7.133854E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1397/  128728 | consumed samples:        22352 | consumed tokens:     45776896 | elapsed time per iteration (s): 15.18 | learning rate: 7.324E-06 | global batch size:    16 | lm loss: 6.703786E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1398/  128728 | consumed samples:        22368 | consumed tokens:     45809664 | elapsed time per iteration (s): 15.24 | learning rate: 7.330E-06 | global batch size:    16 | lm loss: 6.965015E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1399/  128728 | consumed samples:        22384 | consumed tokens:     45842432 | elapsed time per iteration (s): 15.26 | learning rate: 7.335E-06 | global batch size:    16 | lm loss: 7.162525E+00 | grad norm: 3.114 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1400/  128728 | consumed samples:        22400 | consumed tokens:     45875200 | elapsed time per iteration (s): 15.23 | learning rate: 7.340E-06 | global batch size:    16 | lm loss: 7.012984E+00 | grad norm: 1.054 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1401/  128728 | consumed samples:        22416 | consumed tokens:     45907968 | elapsed time per iteration (s): 15.24 | learning rate: 7.345E-06 | global batch size:    16 | lm loss: 6.882190E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1402/  128728 | consumed samples:        22432 | consumed tokens:     45940736 | elapsed time per iteration (s): 15.25 | learning rate: 7.351E-06 | global batch size:    16 | lm loss: 6.990128E+00 | grad norm: 1.141 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1403/  128728 | consumed samples:        22448 | consumed tokens:     45973504 | elapsed time per iteration (s): 15.25 | learning rate: 7.356E-06 | global batch size:    16 | lm loss: 6.974439E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1404/  128728 | consumed samples:        22464 | consumed tokens:     46006272 | elapsed time per iteration (s): 15.31 | learning rate: 7.361E-06 | global batch size:    16 | lm loss: 6.965978E+00 | grad norm: 1.030 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.045 | TFLOPs: 8.00 |
[default7]: iteration     1405/  128728 | consumed samples:        22480 | consumed tokens:     46039040 | elapsed time per iteration (s): 15.24 | learning rate: 7.366E-06 | global batch size:    16 | lm loss: 6.979227E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1406/  128728 | consumed samples:        22496 | consumed tokens:     46071808 | elapsed time per iteration (s): 15.23 | learning rate: 7.372E-06 | global batch size:    16 | lm loss: 6.995125E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1407/  128728 | consumed samples:        22512 | consumed tokens:     46104576 | elapsed time per iteration (s): 15.29 | learning rate: 7.377E-06 | global batch size:    16 | lm loss: 7.185478E+00 | grad norm: 1.070 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     1408/  128728 | consumed samples:        22528 | consumed tokens:     46137344 | elapsed time per iteration (s): 15.24 | learning rate: 7.382E-06 | global batch size:    16 | lm loss: 6.939486E+00 | grad norm: 1.078 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1409/  128728 | consumed samples:        22544 | consumed tokens:     46170112 | elapsed time per iteration (s): 15.23 | learning rate: 7.387E-06 | global batch size:    16 | lm loss: 6.841136E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1410/  128728 | consumed samples:        22560 | consumed tokens:     46202880 | elapsed time per iteration (s): 15.24 | learning rate: 7.392E-06 | global batch size:    16 | lm loss: 6.966883E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1411/  128728 | consumed samples:        22576 | consumed tokens:     46235648 | elapsed time per iteration (s): 15.24 | learning rate: 7.398E-06 | global batch size:    16 | lm loss: 6.978424E+00 | grad norm: 1.322 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1412/  128728 | consumed samples:        22592 | consumed tokens:     46268416 | elapsed time per iteration (s): 15.27 | learning rate: 7.403E-06 | global batch size:    16 | lm loss: 6.881705E+00 | grad norm: 1.066 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1413/  128728 | consumed samples:        22608 | consumed tokens:     46301184 | elapsed time per iteration (s): 15.22 | learning rate: 7.408E-06 | global batch size:    16 | lm loss: 6.892154E+00 | grad norm: 0.994 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1414/  128728 | consumed samples:        22624 | consumed tokens:     46333952 | elapsed time per iteration (s): 15.23 | learning rate: 7.413E-06 | global batch size:    16 | lm loss: 6.848379E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1415/  128728 | consumed samples:        22640 | consumed tokens:     46366720 | elapsed time per iteration (s): 15.22 | learning rate: 7.419E-06 | global batch size:    16 | lm loss: 6.779110E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1416/  128728 | consumed samples:        22656 | consumed tokens:     46399488 | elapsed time per iteration (s): 15.21 | learning rate: 7.424E-06 | global batch size:    16 | lm loss: 7.056311E+00 | grad norm: 1.157 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1417/  128728 | consumed samples:        22672 | consumed tokens:     46432256 | elapsed time per iteration (s): 15.23 | learning rate: 7.429E-06 | global batch size:    16 | lm loss: 6.982561E+00 | grad norm: 1.346 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1418/  128728 | consumed samples:        22688 | consumed tokens:     46465024 | elapsed time per iteration (s): 15.22 | learning rate: 7.434E-06 | global batch size:    16 | lm loss: 6.817053E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1419/  128728 | consumed samples:        22704 | consumed tokens:     46497792 | elapsed time per iteration (s): 15.21 | learning rate: 7.440E-06 | global batch size:    16 | lm loss: 6.851241E+00 | grad norm: 0.980 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1420/  128728 | consumed samples:        22720 | consumed tokens:     46530560 | elapsed time per iteration (s): 15.21 | learning rate: 7.445E-06 | global batch size:    16 | lm loss: 7.001087E+00 | grad norm: 1.378 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1421/  128728 | consumed samples:        22736 | consumed tokens:     46563328 | elapsed time per iteration (s): 15.25 | learning rate: 7.450E-06 | global batch size:    16 | lm loss: 6.835620E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1422/  128728 | consumed samples:        22752 | consumed tokens:     46596096 | elapsed time per iteration (s): 15.26 | learning rate: 7.455E-06 | global batch size:    16 | lm loss: 7.090675E+00 | grad norm: 1.311 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1423/  128728 | consumed samples:        22768 | consumed tokens:     46628864 | elapsed time per iteration (s): 15.26 | learning rate: 7.461E-06 | global batch size:    16 | lm loss: 6.860411E+00 | grad norm: 1.151 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1424/  128728 | consumed samples:        22784 | consumed tokens:     46661632 | elapsed time per iteration (s): 15.29 | learning rate: 7.466E-06 | global batch size:    16 | lm loss: 6.869511E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration     1425/  128728 | consumed samples:        22800 | consumed tokens:     46694400 | elapsed time per iteration (s): 15.24 | learning rate: 7.471E-06 | global batch size:    16 | lm loss: 6.994910E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1426/  128728 | consumed samples:        22816 | consumed tokens:     46727168 | elapsed time per iteration (s): 15.27 | learning rate: 7.476E-06 | global batch size:    16 | lm loss: 7.125865E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1427/  128728 | consumed samples:        22832 | consumed tokens:     46759936 | elapsed time per iteration (s): 15.22 | learning rate: 7.482E-06 | global batch size:    16 | lm loss: 7.039857E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1428/  128728 | consumed samples:        22848 | consumed tokens:     46792704 | elapsed time per iteration (s): 15.24 | learning rate: 7.487E-06 | global batch size:    16 | lm loss: 6.739178E+00 | grad norm: 1.490 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1429/  128728 | consumed samples:        22864 | consumed tokens:     46825472 | elapsed time per iteration (s): 15.23 | learning rate: 7.492E-06 | global batch size:    16 | lm loss: 6.688814E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1430/  128728 | consumed samples:        22880 | consumed tokens:     46858240 | elapsed time per iteration (s): 15.24 | learning rate: 7.497E-06 | global batch size:    16 | lm loss: 6.988650E+00 | grad norm: 0.972 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1431/  128728 | consumed samples:        22896 | consumed tokens:     46891008 | elapsed time per iteration (s): 15.20 | learning rate: 7.503E-06 | global batch size:    16 | lm loss: 7.054319E+00 | grad norm: 0.973 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1432/  128728 | consumed samples:        22912 | consumed tokens:     46923776 | elapsed time per iteration (s): 15.24 | learning rate: 7.508E-06 | global batch size:    16 | lm loss: 6.973002E+00 | grad norm: 1.011 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1433/  128728 | consumed samples:        22928 | consumed tokens:     46956544 | elapsed time per iteration (s): 15.31 | learning rate: 7.513E-06 | global batch size:    16 | lm loss: 6.700475E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.045 | TFLOPs: 8.00 |
[default7]: iteration     1434/  128728 | consumed samples:        22944 | consumed tokens:     46989312 | elapsed time per iteration (s): 15.29 | learning rate: 7.518E-06 | global batch size:    16 | lm loss: 7.003654E+00 | grad norm: 1.084 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     1435/  128728 | consumed samples:        22960 | consumed tokens:     47022080 | elapsed time per iteration (s): 15.22 | learning rate: 7.524E-06 | global batch size:    16 | lm loss: 6.904319E+00 | grad norm: 0.940 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1436/  128728 | consumed samples:        22976 | consumed tokens:     47054848 | elapsed time per iteration (s): 15.19 | learning rate: 7.529E-06 | global batch size:    16 | lm loss: 6.922503E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1437/  128728 | consumed samples:        22992 | consumed tokens:     47087616 | elapsed time per iteration (s): 15.21 | learning rate: 7.534E-06 | global batch size:    16 | lm loss: 6.798236E+00 | grad norm: 0.865 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1438/  128728 | consumed samples:        23008 | consumed tokens:     47120384 | elapsed time per iteration (s): 15.23 | learning rate: 7.539E-06 | global batch size:    16 | lm loss: 6.820006E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1439/  128728 | consumed samples:        23024 | consumed tokens:     47153152 | elapsed time per iteration (s): 15.27 | learning rate: 7.545E-06 | global batch size:    16 | lm loss: 6.920378E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1440/  128728 | consumed samples:        23040 | consumed tokens:     47185920 | elapsed time per iteration (s): 15.27 | learning rate: 7.550E-06 | global batch size:    16 | lm loss: 6.835717E+00 | grad norm: 1.298 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1441/  128728 | consumed samples:        23056 | consumed tokens:     47218688 | elapsed time per iteration (s): 15.24 | learning rate: 7.555E-06 | global batch size:    16 | lm loss: 6.969578E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1442/  128728 | consumed samples:        23072 | consumed tokens:     47251456 | elapsed time per iteration (s): 15.22 | learning rate: 7.560E-06 | global batch size:    16 | lm loss: 6.877041E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1443/  128728 | consumed samples:        23088 | consumed tokens:     47284224 | elapsed time per iteration (s): 15.22 | learning rate: 7.565E-06 | global batch size:    16 | lm loss: 6.828847E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1444/  128728 | consumed samples:        23104 | consumed tokens:     47316992 | elapsed time per iteration (s): 15.23 | learning rate: 7.571E-06 | global batch size:    16 | lm loss: 7.017298E+00 | grad norm: 1.202 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1445/  128728 | consumed samples:        23120 | consumed tokens:     47349760 | elapsed time per iteration (s): 15.25 | learning rate: 7.576E-06 | global batch size:    16 | lm loss: 6.892804E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1446/  128728 | consumed samples:        23136 | consumed tokens:     47382528 | elapsed time per iteration (s): 15.18 | learning rate: 7.581E-06 | global batch size:    16 | lm loss: 6.857821E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1447/  128728 | consumed samples:        23152 | consumed tokens:     47415296 | elapsed time per iteration (s): 15.24 | learning rate: 7.586E-06 | global batch size:    16 | lm loss: 6.927748E+00 | grad norm: 1.025 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1448/  128728 | consumed samples:        23168 | consumed tokens:     47448064 | elapsed time per iteration (s): 15.25 | learning rate: 7.592E-06 | global batch size:    16 | lm loss: 6.929221E+00 | grad norm: 1.089 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1449/  128728 | consumed samples:        23184 | consumed tokens:     47480832 | elapsed time per iteration (s): 15.26 | learning rate: 7.597E-06 | global batch size:    16 | lm loss: 6.774077E+00 | grad norm: 1.585 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1450/  128728 | consumed samples:        23200 | consumed tokens:     47513600 | elapsed time per iteration (s): 15.27 | learning rate: 7.602E-06 | global batch size:    16 | lm loss: 6.842887E+00 | grad norm: 1.032 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1451/  128728 | consumed samples:        23216 | consumed tokens:     47546368 | elapsed time per iteration (s): 15.23 | learning rate: 7.607E-06 | global batch size:    16 | lm loss: 6.983165E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1452/  128728 | consumed samples:        23232 | consumed tokens:     47579136 | elapsed time per iteration (s): 15.21 | learning rate: 7.613E-06 | global batch size:    16 | lm loss: 6.892272E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1453/  128728 | consumed samples:        23248 | consumed tokens:     47611904 | elapsed time per iteration (s): 15.23 | learning rate: 7.618E-06 | global batch size:    16 | lm loss: 6.959459E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1454/  128728 | consumed samples:        23264 | consumed tokens:     47644672 | elapsed time per iteration (s): 15.23 | learning rate: 7.623E-06 | global batch size:    16 | lm loss: 6.613215E+00 | grad norm: 1.072 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1455/  128728 | consumed samples:        23280 | consumed tokens:     47677440 | elapsed time per iteration (s): 15.24 | learning rate: 7.628E-06 | global batch size:    16 | lm loss: 6.947182E+00 | grad norm: 0.989 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1456/  128728 | consumed samples:        23296 | consumed tokens:     47710208 | elapsed time per iteration (s): 15.24 | learning rate: 7.634E-06 | global batch size:    16 | lm loss: 6.893425E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1457/  128728 | consumed samples:        23312 | consumed tokens:     47742976 | elapsed time per iteration (s): 15.26 | learning rate: 7.639E-06 | global batch size:    16 | lm loss: 6.631948E+00 | grad norm: 0.894 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1458/  128728 | consumed samples:        23328 | consumed tokens:     47775744 | elapsed time per iteration (s): 15.22 | learning rate: 7.644E-06 | global batch size:    16 | lm loss: 7.102271E+00 | grad norm: 0.988 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1459/  128728 | consumed samples:        23344 | consumed tokens:     47808512 | elapsed time per iteration (s): 15.21 | learning rate: 7.649E-06 | global batch size:    16 | lm loss: 6.629117E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1460/  128728 | consumed samples:        23360 | consumed tokens:     47841280 | elapsed time per iteration (s): 15.25 | learning rate: 7.655E-06 | global batch size:    16 | lm loss: 6.952769E+00 | grad norm: 1.593 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1461/  128728 | consumed samples:        23376 | consumed tokens:     47874048 | elapsed time per iteration (s): 15.26 | learning rate: 7.660E-06 | global batch size:    16 | lm loss: 6.996358E+00 | grad norm: 0.966 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1462/  128728 | consumed samples:        23392 | consumed tokens:     47906816 | elapsed time per iteration (s): 15.24 | learning rate: 7.665E-06 | global batch size:    16 | lm loss: 6.833821E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1463/  128728 | consumed samples:        23408 | consumed tokens:     47939584 | elapsed time per iteration (s): 15.23 | learning rate: 7.670E-06 | global batch size:    16 | lm loss: 6.710407E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1464/  128728 | consumed samples:        23424 | consumed tokens:     47972352 | elapsed time per iteration (s): 15.16 | learning rate: 7.676E-06 | global batch size:    16 | lm loss: 6.818951E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     1465/  128728 | consumed samples:        23440 | consumed tokens:     48005120 | elapsed time per iteration (s): 15.21 | learning rate: 7.681E-06 | global batch size:    16 | lm loss: 6.974868E+00 | grad norm: 0.856 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1466/  128728 | consumed samples:        23456 | consumed tokens:     48037888 | elapsed time per iteration (s): 15.17 | learning rate: 7.686E-06 | global batch size:    16 | lm loss: 6.911908E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1467/  128728 | consumed samples:        23472 | consumed tokens:     48070656 | elapsed time per iteration (s): 15.21 | learning rate: 7.691E-06 | global batch size:    16 | lm loss: 6.894742E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1468/  128728 | consumed samples:        23488 | consumed tokens:     48103424 | elapsed time per iteration (s): 15.23 | learning rate: 7.697E-06 | global batch size:    16 | lm loss: 6.738654E+00 | grad norm: 1.618 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1469/  128728 | consumed samples:        23504 | consumed tokens:     48136192 | elapsed time per iteration (s): 15.20 | learning rate: 7.702E-06 | global batch size:    16 | lm loss: 6.781757E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1470/  128728 | consumed samples:        23520 | consumed tokens:     48168960 | elapsed time per iteration (s): 15.22 | learning rate: 7.707E-06 | global batch size:    16 | lm loss: 6.828523E+00 | grad norm: 1.455 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1471/  128728 | consumed samples:        23536 | consumed tokens:     48201728 | elapsed time per iteration (s): 15.25 | learning rate: 7.712E-06 | global batch size:    16 | lm loss: 6.891495E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1472/  128728 | consumed samples:        23552 | consumed tokens:     48234496 | elapsed time per iteration (s): 15.21 | learning rate: 7.718E-06 | global batch size:    16 | lm loss: 6.899791E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1473/  128728 | consumed samples:        23568 | consumed tokens:     48267264 | elapsed time per iteration (s): 15.21 | learning rate: 7.723E-06 | global batch size:    16 | lm loss: 6.920649E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1474/  128728 | consumed samples:        23584 | consumed tokens:     48300032 | elapsed time per iteration (s): 15.19 | learning rate: 7.728E-06 | global batch size:    16 | lm loss: 6.843232E+00 | grad norm: 0.914 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1475/  128728 | consumed samples:        23600 | consumed tokens:     48332800 | elapsed time per iteration (s): 15.21 | learning rate: 7.733E-06 | global batch size:    16 | lm loss: 6.969937E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1476/  128728 | consumed samples:        23616 | consumed tokens:     48365568 | elapsed time per iteration (s): 15.20 | learning rate: 7.739E-06 | global batch size:    16 | lm loss: 6.757054E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1477/  128728 | consumed samples:        23632 | consumed tokens:     48398336 | elapsed time per iteration (s): 15.21 | learning rate: 7.744E-06 | global batch size:    16 | lm loss: 6.830835E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1478/  128728 | consumed samples:        23648 | consumed tokens:     48431104 | elapsed time per iteration (s): 15.24 | learning rate: 7.749E-06 | global batch size:    16 | lm loss: 6.957498E+00 | grad norm: 0.985 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1479/  128728 | consumed samples:        23664 | consumed tokens:     48463872 | elapsed time per iteration (s): 15.24 | learning rate: 7.754E-06 | global batch size:    16 | lm loss: 6.825459E+00 | grad norm: 1.199 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1480/  128728 | consumed samples:        23680 | consumed tokens:     48496640 | elapsed time per iteration (s): 15.23 | learning rate: 7.759E-06 | global batch size:    16 | lm loss: 6.764349E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1481/  128728 | consumed samples:        23696 | consumed tokens:     48529408 | elapsed time per iteration (s): 15.27 | learning rate: 7.765E-06 | global batch size:    16 | lm loss: 6.935419E+00 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1482/  128728 | consumed samples:        23712 | consumed tokens:     48562176 | elapsed time per iteration (s): 15.24 | learning rate: 7.770E-06 | global batch size:    16 | lm loss: 6.933623E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1483/  128728 | consumed samples:        23728 | consumed tokens:     48594944 | elapsed time per iteration (s): 15.18 | learning rate: 7.775E-06 | global batch size:    16 | lm loss: 6.809566E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1484/  128728 | consumed samples:        23744 | consumed tokens:     48627712 | elapsed time per iteration (s): 15.28 | learning rate: 7.780E-06 | global batch size:    16 | lm loss: 6.744482E+00 | grad norm: 1.168 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1485/  128728 | consumed samples:        23760 | consumed tokens:     48660480 | elapsed time per iteration (s): 15.25 | learning rate: 7.786E-06 | global batch size:    16 | lm loss: 6.929039E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     1486/  128728 | consumed samples:        23776 | consumed tokens:     48693248 | elapsed time per iteration (s): 15.25 | learning rate: 7.791E-06 | global batch size:    16 | lm loss: 6.843914E+00 | grad norm: 1.031 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1487/  128728 | consumed samples:        23792 | consumed tokens:     48726016 | elapsed time per iteration (s): 15.24 | learning rate: 7.796E-06 | global batch size:    16 | lm loss: 7.174544E+00 | grad norm: 1.182 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1488/  128728 | consumed samples:        23808 | consumed tokens:     48758784 | elapsed time per iteration (s): 15.24 | learning rate: 7.801E-06 | global batch size:    16 | lm loss: 6.827503E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1489/  128728 | consumed samples:        23824 | consumed tokens:     48791552 | elapsed time per iteration (s): 15.25 | learning rate: 7.807E-06 | global batch size:    16 | lm loss: 6.747015E+00 | grad norm: 1.034 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1490/  128728 | consumed samples:        23840 | consumed tokens:     48824320 | elapsed time per iteration (s): 15.20 | learning rate: 7.812E-06 | global batch size:    16 | lm loss: 6.738760E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1491/  128728 | consumed samples:        23856 | consumed tokens:     48857088 | elapsed time per iteration (s): 15.25 | learning rate: 7.817E-06 | global batch size:    16 | lm loss: 6.907768E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1492/  128728 | consumed samples:        23872 | consumed tokens:     48889856 | elapsed time per iteration (s): 15.25 | learning rate: 7.822E-06 | global batch size:    16 | lm loss: 6.860197E+00 | grad norm: 1.022 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1493/  128728 | consumed samples:        23888 | consumed tokens:     48922624 | elapsed time per iteration (s): 15.23 | learning rate: 7.828E-06 | global batch size:    16 | lm loss: 6.858501E+00 | grad norm: 1.000 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1494/  128728 | consumed samples:        23904 | consumed tokens:     48955392 | elapsed time per iteration (s): 15.21 | learning rate: 7.833E-06 | global batch size:    16 | lm loss: 6.810994E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1495/  128728 | consumed samples:        23920 | consumed tokens:     48988160 | elapsed time per iteration (s): 15.24 | learning rate: 7.838E-06 | global batch size:    16 | lm loss: 6.897250E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1496/  128728 | consumed samples:        23936 | consumed tokens:     49020928 | elapsed time per iteration (s): 15.20 | learning rate: 7.843E-06 | global batch size:    16 | lm loss: 7.080896E+00 | grad norm: 1.038 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1497/  128728 | consumed samples:        23952 | consumed tokens:     49053696 | elapsed time per iteration (s): 15.26 | learning rate: 7.849E-06 | global batch size:    16 | lm loss: 6.848498E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1498/  128728 | consumed samples:        23968 | consumed tokens:     49086464 | elapsed time per iteration (s): 15.23 | learning rate: 7.854E-06 | global batch size:    16 | lm loss: 6.933249E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1499/  128728 | consumed samples:        23984 | consumed tokens:     49119232 | elapsed time per iteration (s): 15.23 | learning rate: 7.859E-06 | global batch size:    16 | lm loss: 7.075923E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1500/  128728 | consumed samples:        24000 | consumed tokens:     49152000 | elapsed time per iteration (s): 15.22 | learning rate: 7.864E-06 | global batch size:    16 | lm loss: 6.872234E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default0]:saving checkpoint at iteration    1500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default1]:[2022-03-03 12:16:52,729] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/mp_rank_01_model_states.pt
[default0]:[2022-03-03 12:16:52,902] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/mp_rank_00_model_states.pt
[default5]:[2022-03-03 12:17:14,031] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 12:17:14,381] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 12:17:14,467] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt
[default6]:[2022-03-03 12:17:14,472] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt
[default6]:[2022-03-03 12:17:14,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt
[default4]:[2022-03-03 12:17:14,479] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 12:17:14,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt
[default3]:[2022-03-03 12:17:14,543] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt
[default2]:[2022-03-03 12:17:14,579] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt
[default1]:[2022-03-03 12:17:14,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt
[default4]:[2022-03-03 12:17:14,730] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 12:17:14,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt
[default2]:[2022-03-03 12:17:14,725] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt
[default4]:[2022-03-03 12:17:14,757] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 12:17:14,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt
[default2]:[2022-03-03 12:17:14,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 12:17:14,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 12:17:14,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 12:17:14,982] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt
[default2]:[2022-03-03 12:17:14,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 12:17:14,982] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 12:17:14,928] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 12:17:14,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt
[default7]:[2022-03-03 12:17:15,013] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt
[default7]:[2022-03-03 12:17:15,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 12:17:15,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt
[default2]:[2022-03-03 12:17:15,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt
[default7]:[2022-03-03 12:17:15,039] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt
[default5]:[2022-03-03 12:17:15,138] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt
[default1]:[2022-03-03 12:17:15,173] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt
[default6]:[2022-03-03 12:17:15,156] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt
[default0]:[2022-03-03 12:17:15,293] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt
[default3]:[2022-03-03 12:17:15,262] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt
[default0]:[2022-03-03 12:17:15,326] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt
[default0]:[2022-03-03 12:17:15,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt
[default5]:[2022-03-03 12:17:15,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 12:17:15,488] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt
[default1]:[2022-03-03 12:17:15,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt
[default6]:[2022-03-03 12:17:15,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt
[default0]:[2022-03-03 12:17:15,622] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt
[default2]:[2022-03-03 12:17:15,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt
[default1]:[2022-03-03 12:17:15,550] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt
[default5]:[2022-03-03 12:17:15,612] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 12:17:15,659] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt
[default1]:[2022-03-03 12:17:15,681] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt
[default4]:[2022-03-03 12:17:15,922] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 12:17:15,895] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt
[default3]:[2022-03-03 12:17:15,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt
[default0]:[2022-03-03 12:17:16,034] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt
[default2]:[2022-03-03 12:17:16,387] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 12:17:16,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt
[default0]:[2022-03-03 12:17:16,961] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 12:17:16,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 12:17:16,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt
[default1]:[2022-03-03 12:17:17,062] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt
[default0]:[2022-03-03 12:17:17,096] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt
[default2]:[2022-03-03 12:17:17,182] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt
[default1]:[2022-03-03 12:17:17,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt
[default3]:[2022-03-03 12:17:17,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt
[default3]:[2022-03-03 12:17:17,329] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt
[default4]:[2022-03-03 12:17:17,317] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt
[default5]:[2022-03-03 12:17:17,581] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 12:17:17,562] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt
[default3]:[2022-03-03 12:17:17,776] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt
[default4]:[2022-03-03 12:17:17,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt
[default0]:[2022-03-03 12:17:17,918] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt
[default2]:[2022-03-03 12:17:17,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt
[default0]:[2022-03-03 12:17:18,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt
[default4]:[2022-03-03 12:17:18,006] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt
[default5]:[2022-03-03 12:17:18,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt
[default0]:[2022-03-03 12:17:17,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 12:17:18,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt
[default2]:[2022-03-03 12:17:18,160] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt
[default1]:[2022-03-03 12:17:18,143] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 12:17:18,184] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt
[default7]:[2022-03-03 12:17:18,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt
[default0]:[2022-03-03 12:17:18,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt
[default6]:[2022-03-03 12:17:18,249] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 12:17:18,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt
[default7]:[2022-03-03 12:17:18,399] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt
[default7]:[2022-03-03 12:17:18,519] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt
[default6]:[2022-03-03 12:17:18,522] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt
[default1]:[2022-03-03 12:17:18,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt
[default1]:[2022-03-03 12:17:18,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt
[default3]:[2022-03-03 12:17:18,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt
[default1]:[2022-03-03 12:17:18,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt
[default3]:[2022-03-03 12:17:18,725] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt
[default0]:[2022-03-03 12:17:18,791] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt
[default7]:[2022-03-03 12:17:18,809] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 12:17:18,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 12:17:18,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt
[default1]:[2022-03-03 12:17:18,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt
[default2]:[2022-03-03 12:17:19,093] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 12:17:19,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
[default1]:[2022-03-03 12:17:19,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt
[default4]:[2022-03-03 12:17:19,076] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
[default3]:[2022-03-03 12:17:19,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt
[default0]:[2022-03-03 12:17:19,228] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt
[default5]:[2022-03-03 12:17:19,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt
[default2]:[2022-03-03 12:17:19,157] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt
[default0]:[2022-03-03 12:17:19,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt
[default4]:[2022-03-03 12:17:19,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt
[default7]:[2022-03-03 12:17:19,270] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt
[default4]:[2022-03-03 12:17:19,315] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt
[default1]:[2022-03-03 12:17:19,338] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt
[default7]:[2022-03-03 12:17:19,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt
[default5]:[2022-03-03 12:17:19,386] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 12:17:19,430] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 12:17:19,342] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt
[default5]:[2022-03-03 12:17:19,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 12:17:19,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 12:17:19,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt
[default2]:[2022-03-03 12:17:19,504] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt
[default6]:[2022-03-03 12:17:19,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 12:17:19,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 12:17:19,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt
[default1]:[2022-03-03 12:17:19,729] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 12:17:19,851] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt
[default2]:[2022-03-03 12:17:19,804] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt
[default1]:[2022-03-03 12:17:19,831] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt
[default7]:[2022-03-03 12:17:19,878] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 12:17:19,914] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt
[default6]:[2022-03-03 12:17:19,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 12:17:19,965] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt
[default0]:[2022-03-03 12:17:19,934] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt
[default6]:[2022-03-03 12:17:19,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt
[default0]:[2022-03-03 12:17:20,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 12:17:20,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt
[default1]:[2022-03-03 12:17:20,011] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt
[default3]:[2022-03-03 12:17:20,008] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt
[default3]:[2022-03-03 12:17:20,081] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt
[default4]:[2022-03-03 12:17:19,987] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
[default6]:[2022-03-03 12:17:20,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt
[default5]:[2022-03-03 12:17:20,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt
[default0]:[2022-03-03 12:17:20,090] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt
[default3]:[2022-03-03 12:17:20,162] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt
[default5]:[2022-03-03 12:17:20,176] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 12:17:20,145] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt
[default2]:[2022-03-03 12:17:20,194] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt
[default5]:[2022-03-03 12:17:20,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt
[default1]:[2022-03-03 12:17:20,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt
[default4]:[2022-03-03 12:17:20,281] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt
[default7]:[2022-03-03 12:17:20,260] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt
[default1]:[2022-03-03 12:17:20,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt
[default7]:[2022-03-03 12:17:20,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt
[default0]:[2022-03-03 12:17:20,270] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt
[default0]:[2022-03-03 12:17:20,425] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt
[default6]:[2022-03-03 12:17:20,412] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt
[default3]:[2022-03-03 12:17:20,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt
[default6]:[2022-03-03 12:17:20,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 12:17:20,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt
[default4]:[2022-03-03 12:17:20,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt
[default5]:[2022-03-03 12:17:20,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
[default5]:[2022-03-03 12:17:20,492] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt
[default6]:[2022-03-03 12:17:20,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt
[default5]:[2022-03-03 12:17:20,546] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt
[default2]:[2022-03-03 12:17:20,481] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt
[default4]:[2022-03-03 12:17:20,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt
[default1]:[2022-03-03 12:17:20,520] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt
[default5]:[2022-03-03 12:17:20,633] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt
[default4]:[2022-03-03 12:17:20,666] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt
[default4]:[2022-03-03 12:17:20,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt
[default2]:[2022-03-03 12:17:20,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt
[default1]:[2022-03-03 12:17:20,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt
[default6]:[2022-03-03 12:17:20,720] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt
[default2]:[2022-03-03 12:17:20,697] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt
[default5]:[2022-03-03 12:17:20,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt
[default2]:[2022-03-03 12:17:20,818] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt
[default4]:[2022-03-03 12:17:20,847] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 12:17:20,803] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt
[default5]:[2022-03-03 12:17:20,587] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt
[default7]:[2022-03-03 12:17:20,885] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 12:17:20,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 12:17:20,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 12:17:20,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt
[default1]:[2022-03-03 12:17:20,830] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 12:17:20,920] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 12:17:20,903] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 12:17:20,892] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt
[default6]:[2022-03-03 12:17:20,948] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt
[default2]:[2022-03-03 12:17:21,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt
[default3]:[2022-03-03 12:17:21,034] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 12:17:21,068] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 12:17:21,049] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt
[default3]:[2022-03-03 12:17:21,057] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt
[default4]:[2022-03-03 12:17:20,843] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt
[default1]:[2022-03-03 12:17:21,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt
[default0]:[2022-03-03 12:17:21,164] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt
[default2]:[2022-03-03 12:17:21,142] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt
[default3]:[2022-03-03 12:17:21,221] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt
[default7]:[2022-03-03 12:17:21,233] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 12:17:21,292] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt
[default2]:[2022-03-03 12:17:21,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt
[default2]:[2022-03-03 12:17:21,323] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt
[default2]:[2022-03-03 12:17:21,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt
[default5]:[2022-03-03 12:17:21,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt
[default3]:[2022-03-03 12:17:21,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt
[default0]:[2022-03-03 12:17:21,505] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
[default3]:[2022-03-03 12:17:21,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt
[default1]:[2022-03-03 12:17:21,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt
[default7]:[2022-03-03 12:17:21,463] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt
[default4]:[2022-03-03 12:17:21,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt
[default2]:[2022-03-03 12:17:21,545] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt
[default6]:[2022-03-03 12:17:21,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt
[default3]:[2022-03-03 12:17:21,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 12:17:21,670] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt
[default5]:[2022-03-03 12:17:21,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 12:17:21,727] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt
[default7]:[2022-03-03 12:17:21,765] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt
[default3]:[2022-03-03 12:17:21,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt
[default2]:[2022-03-03 12:17:21,779] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt
[default3]:[2022-03-03 12:17:21,750] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt
[default1]:[2022-03-03 12:17:21,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt
[default0]:[2022-03-03 12:17:21,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt
[default6]:[2022-03-03 12:17:21,922] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 12:17:21,947] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 12:17:22,029] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt
[default6]:[2022-03-03 12:17:22,004] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt
[default0]:[2022-03-03 12:17:22,095] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt
[default3]:[2022-03-03 12:17:22,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt
[default7]:[2022-03-03 12:17:22,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt
[default5]:[2022-03-03 12:17:22,266] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 12:17:22,270] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 12:17:22,292] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt
[default2]:[2022-03-03 12:17:22,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 12:17:22,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt
[default5]:[2022-03-03 12:17:22,325] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 12:17:22,392] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt
[default5]:[2022-03-03 12:17:22,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 12:17:22,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt
[default5]:[2022-03-03 12:17:22,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 12:17:22,389] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 12:17:22,427] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt
[default5]:[2022-03-03 12:17:22,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 12:17:22,474] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt
[default4]:[2022-03-03 12:17:22,458] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt
[default5]:[2022-03-03 12:17:22,507] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt
[default2]:[2022-03-03 12:17:22,568] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 12:17:22,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt
[default1]:[2022-03-03 12:17:22,592] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 12:17:22,615] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt
[default0]:[2022-03-03 12:17:22,724] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt
[default3]:[2022-03-03 12:17:22,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt
[default3]:[2022-03-03 12:17:22,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt
[default1]:[2022-03-03 12:17:22,858] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 12:17:22,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt
[default2]:[2022-03-03 12:17:23,045] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt
[default6]:[2022-03-03 12:17:22,967] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt
[default4]:[2022-03-03 12:17:22,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 12:17:23,113] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt
[default1]:[2022-03-03 12:17:23,139] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt
[default0]:[2022-03-03 12:17:23,185] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt
[default7]:[2022-03-03 12:17:23,201] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt
[default0]:[2022-03-03 12:17:23,184] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt
[default2]:[2022-03-03 12:17:23,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt
[default0]:[2022-03-03 12:17:23,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt
[default3]:[2022-03-03 12:17:23,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt
[default4]:[2022-03-03 12:17:23,529] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
[default2]:[2022-03-03 12:17:23,517] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt
[default0]:[2022-03-03 12:17:23,475] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt
[default6]:[2022-03-03 12:17:23,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt
[default1]:[2022-03-03 12:17:23,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 12:17:23,565] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[default7]:[2022-03-03 12:17:23,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt
[default4]:[2022-03-03 12:17:23,644] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 12:17:23,781] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt
[default3]:[2022-03-03 12:17:23,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt
[default7]:[2022-03-03 12:17:23,822] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt
[default0]:[2022-03-03 12:17:23,923] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 12:17:23,868] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt
[default6]:[2022-03-03 12:17:23,884] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 12:17:23,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt
[default3]:[2022-03-03 12:17:23,935] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt
[default6]:[2022-03-03 12:17:23,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt
[default1]:[2022-03-03 12:17:23,923] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt
[default7]:[2022-03-03 12:17:23,944] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 12:17:23,957] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt
[default7]:[2022-03-03 12:17:24,015] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt
[default1]:[2022-03-03 12:17:24,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt
[default3]:[2022-03-03 12:17:24,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 12:17:24,253] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt
[default2]:[2022-03-03 12:17:24,234] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 12:17:24,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt
[default4]:[2022-03-03 12:17:24,302] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt
[default5]:[2022-03-03 12:17:24,354] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 12:17:24,473] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 12:17:24,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt
[default2]:[2022-03-03 12:17:24,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt
[default1]:[2022-03-03 12:17:24,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 12:17:24,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
[default6]:[2022-03-03 12:17:24,676] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt
[default1]:[2022-03-03 12:17:24,666] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt
[default5]:[2022-03-03 12:17:24,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt
[default7]:[2022-03-03 12:17:24,717] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt
[default7]:[2022-03-03 12:17:24,712] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt
[default6]:[2022-03-03 12:17:24,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 12:17:24,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 12:17:24,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 12:17:24,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt
[default6]:[2022-03-03 12:17:25,057] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt
[default2]:[2022-03-03 12:17:25,102] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt
[default6]:[2022-03-03 12:17:25,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt
[default7]:[2022-03-03 12:17:25,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt
[default3]:[2022-03-03 12:17:25,150] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 12:17:25,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 12:17:25,272] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt
[default7]:[2022-03-03 12:17:25,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 12:17:25,925] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt
[default0]:[2022-03-03 12:17:26,025] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 12:17:26,024] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt
[default0]:[2022-03-03 12:17:26,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 12:17:26,391] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt
[default5]:[2022-03-03 12:17:26,465] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 12:17:26,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt
[default2]:[2022-03-03 12:17:26,515] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt
[default4]:[2022-03-03 12:17:26,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt
[default6]:[2022-03-03 12:17:26,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 12:17:26,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt
[default5]:[2022-03-03 12:17:26,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt
[default6]:[2022-03-03 12:17:26,556] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt
[default3]:[2022-03-03 12:17:26,681] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt
[default1]:[2022-03-03 12:17:26,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt
[default6]:[2022-03-03 12:17:26,807] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt
[default3]:[2022-03-03 12:17:26,808] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt
[default7]:[2022-03-03 12:17:26,833] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 12:17:26,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt
[default1]:[2022-03-03 12:17:26,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt
[default3]:[2022-03-03 12:17:26,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt
[default0]:[2022-03-03 12:17:26,979] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt
[default4]:[2022-03-03 12:17:27,054] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt
[default2]:[2022-03-03 12:17:27,220] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt
[default4]:[2022-03-03 12:17:27,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 12:17:27,391] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt
[default7]:[2022-03-03 12:17:27,481] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt
[default2]:[2022-03-03 12:17:27,576] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt
[default6]:[2022-03-03 12:17:27,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt
[default7]:[2022-03-03 12:17:27,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 12:17:27,557] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt
[default5]:[2022-03-03 12:17:27,720] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt
[default0]:[2022-03-03 12:17:27,761] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt
[default1]:[2022-03-03 12:17:27,758] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 12:17:27,792] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt
[default5]:[2022-03-03 12:17:27,962] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt
[default3]:[2022-03-03 12:17:28,009] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt
[default1]:[2022-03-03 12:17:28,042] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 12:17:28,001] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
[default5]:[2022-03-03 12:17:28,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt
[default5]:[2022-03-03 12:17:28,099] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt
[default0]:[2022-03-03 12:17:28,113] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt
[default6]:[2022-03-03 12:17:28,190] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 12:17:28,123] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt
[default3]:[2022-03-03 12:17:28,162] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt
[default4]:[2022-03-03 12:17:28,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 12:17:28,222] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt
[default1]:[2022-03-03 12:17:28,241] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt
[default3]:[2022-03-03 12:17:28,331] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt
[default6]:[2022-03-03 12:17:28,473] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 12:17:28,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt
[default0]:[2022-03-03 12:17:28,544] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 12:17:28,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt
[default5]:[2022-03-03 12:17:29,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt
[default4]:[2022-03-03 12:17:29,146] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 12:17:29,222] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 12:17:30,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt
[default0]:[2022-03-03 12:17:30,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt
[default7]:[2022-03-03 12:17:30,489] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 12:17:30,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt
[default0]:[2022-03-03 12:17:30,925] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt
[default4]:[2022-03-03 12:17:30,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 12:17:31,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt
[default3]:[2022-03-03 12:17:31,029] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 12:17:30,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt
[default7]:[2022-03-03 12:17:31,421] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt
[default5]:[2022-03-03 12:17:31,498] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt
[default6]:[2022-03-03 12:17:31,481] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 12:17:31,498] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt
[default0]:[2022-03-03 12:17:31,633] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt
[default6]:[2022-03-03 12:17:31,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt
[default7]:[2022-03-03 12:17:32,091] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt
[default1]:[2022-03-03 12:17:32,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 12:17:33,895] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 12:17:33,868] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt
[default6]:[2022-03-03 12:17:35,216] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt
[default7]:[2022-03-03 12:17:35,249] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt
[default7]:time (ms) | save-checkpoint: 50223.83
[default0]:  successfully saved checkpoint at iteration    1500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default7]: iteration     1501/  128728 | consumed samples:        24016 | consumed tokens:     49184768 | elapsed time per iteration (s): 65.45 | learning rate: 7.870E-06 | global batch size:    16 | lm loss: 6.884842E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.244 | TFLOPs: 1.87 |
[default7]: iteration     1502/  128728 | consumed samples:        24032 | consumed tokens:     49217536 | elapsed time per iteration (s): 15.25 | learning rate: 7.875E-06 | global batch size:    16 | lm loss: 6.994694E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1503/  128728 | consumed samples:        24048 | consumed tokens:     49250304 | elapsed time per iteration (s): 15.24 | learning rate: 7.880E-06 | global batch size:    16 | lm loss: 6.964286E+00 | grad norm: 0.935 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1504/  128728 | consumed samples:        24064 | consumed tokens:     49283072 | elapsed time per iteration (s): 15.26 | learning rate: 7.885E-06 | global batch size:    16 | lm loss: 6.845483E+00 | grad norm: 1.042 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1505/  128728 | consumed samples:        24080 | consumed tokens:     49315840 | elapsed time per iteration (s): 15.22 | learning rate: 7.891E-06 | global batch size:    16 | lm loss: 6.827108E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1506/  128728 | consumed samples:        24096 | consumed tokens:     49348608 | elapsed time per iteration (s): 15.23 | learning rate: 7.896E-06 | global batch size:    16 | lm loss: 6.897807E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1507/  128728 | consumed samples:        24112 | consumed tokens:     49381376 | elapsed time per iteration (s): 15.19 | learning rate: 7.901E-06 | global batch size:    16 | lm loss: 6.798639E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1508/  128728 | consumed samples:        24128 | consumed tokens:     49414144 | elapsed time per iteration (s): 15.19 | learning rate: 7.906E-06 | global batch size:    16 | lm loss: 6.913240E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1509/  128728 | consumed samples:        24144 | consumed tokens:     49446912 | elapsed time per iteration (s): 15.21 | learning rate: 7.912E-06 | global batch size:    16 | lm loss: 6.769604E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1510/  128728 | consumed samples:        24160 | consumed tokens:     49479680 | elapsed time per iteration (s): 15.27 | learning rate: 7.917E-06 | global batch size:    16 | lm loss: 6.998416E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1511/  128728 | consumed samples:        24176 | consumed tokens:     49512448 | elapsed time per iteration (s): 15.21 | learning rate: 7.922E-06 | global batch size:    16 | lm loss: 6.917444E+00 | grad norm: 1.343 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1512/  128728 | consumed samples:        24192 | consumed tokens:     49545216 | elapsed time per iteration (s): 15.21 | learning rate: 7.927E-06 | global batch size:    16 | lm loss: 6.704676E+00 | grad norm: 0.944 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1513/  128728 | consumed samples:        24208 | consumed tokens:     49577984 | elapsed time per iteration (s): 15.21 | learning rate: 7.932E-06 | global batch size:    16 | lm loss: 6.625801E+00 | grad norm: 1.004 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1514/  128728 | consumed samples:        24224 | consumed tokens:     49610752 | elapsed time per iteration (s): 15.22 | learning rate: 7.938E-06 | global batch size:    16 | lm loss: 6.983078E+00 | grad norm: 1.145 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1515/  128728 | consumed samples:        24240 | consumed tokens:     49643520 | elapsed time per iteration (s): 15.16 | learning rate: 7.943E-06 | global batch size:    16 | lm loss: 6.767624E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1516/  128728 | consumed samples:        24256 | consumed tokens:     49676288 | elapsed time per iteration (s): 15.24 | learning rate: 7.948E-06 | global batch size:    16 | lm loss: 6.977883E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1517/  128728 | consumed samples:        24272 | consumed tokens:     49709056 | elapsed time per iteration (s): 15.21 | learning rate: 7.953E-06 | global batch size:    16 | lm loss: 6.882602E+00 | grad norm: 1.063 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1518/  128728 | consumed samples:        24288 | consumed tokens:     49741824 | elapsed time per iteration (s): 15.27 | learning rate: 7.959E-06 | global batch size:    16 | lm loss: 7.073375E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1519/  128728 | consumed samples:        24304 | consumed tokens:     49774592 | elapsed time per iteration (s): 15.25 | learning rate: 7.964E-06 | global batch size:    16 | lm loss: 6.735807E+00 | grad norm: 1.008 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1520/  128728 | consumed samples:        24320 | consumed tokens:     49807360 | elapsed time per iteration (s): 15.24 | learning rate: 7.969E-06 | global batch size:    16 | lm loss: 6.980803E+00 | grad norm: 1.043 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1521/  128728 | consumed samples:        24336 | consumed tokens:     49840128 | elapsed time per iteration (s): 15.26 | learning rate: 7.974E-06 | global batch size:    16 | lm loss: 6.703127E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1522/  128728 | consumed samples:        24352 | consumed tokens:     49872896 | elapsed time per iteration (s): 15.23 | learning rate: 7.980E-06 | global batch size:    16 | lm loss: 6.806160E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1523/  128728 | consumed samples:        24368 | consumed tokens:     49905664 | elapsed time per iteration (s): 15.17 | learning rate: 7.985E-06 | global batch size:    16 | lm loss: 6.960247E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1524/  128728 | consumed samples:        24384 | consumed tokens:     49938432 | elapsed time per iteration (s): 15.19 | learning rate: 7.990E-06 | global batch size:    16 | lm loss: 6.941716E+00 | grad norm: 0.877 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1525/  128728 | consumed samples:        24400 | consumed tokens:     49971200 | elapsed time per iteration (s): 15.23 | learning rate: 7.995E-06 | global batch size:    16 | lm loss: 6.999331E+00 | grad norm: 0.911 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1526/  128728 | consumed samples:        24416 | consumed tokens:     50003968 | elapsed time per iteration (s): 15.20 | learning rate: 8.001E-06 | global batch size:    16 | lm loss: 6.783548E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1527/  128728 | consumed samples:        24432 | consumed tokens:     50036736 | elapsed time per iteration (s): 15.24 | learning rate: 8.006E-06 | global batch size:    16 | lm loss: 6.833918E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1528/  128728 | consumed samples:        24448 | consumed tokens:     50069504 | elapsed time per iteration (s): 15.25 | learning rate: 8.011E-06 | global batch size:    16 | lm loss: 6.808972E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1529/  128728 | consumed samples:        24464 | consumed tokens:     50102272 | elapsed time per iteration (s): 15.19 | learning rate: 8.016E-06 | global batch size:    16 | lm loss: 6.914598E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1530/  128728 | consumed samples:        24480 | consumed tokens:     50135040 | elapsed time per iteration (s): 15.25 | learning rate: 8.022E-06 | global batch size:    16 | lm loss: 6.613030E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     1531/  128728 | consumed samples:        24496 | consumed tokens:     50167808 | elapsed time per iteration (s): 15.26 | learning rate: 8.027E-06 | global batch size:    16 | lm loss: 6.960011E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1532/  128728 | consumed samples:        24512 | consumed tokens:     50200576 | elapsed time per iteration (s): 15.22 | learning rate: 8.032E-06 | global batch size:    16 | lm loss: 6.928339E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1533/  128728 | consumed samples:        24528 | consumed tokens:     50233344 | elapsed time per iteration (s): 15.21 | learning rate: 8.037E-06 | global batch size:    16 | lm loss: 6.872521E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1534/  128728 | consumed samples:        24544 | consumed tokens:     50266112 | elapsed time per iteration (s): 15.22 | learning rate: 8.043E-06 | global batch size:    16 | lm loss: 7.037945E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1535/  128728 | consumed samples:        24560 | consumed tokens:     50298880 | elapsed time per iteration (s): 15.22 | learning rate: 8.048E-06 | global batch size:    16 | lm loss: 6.908956E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1536/  128728 | consumed samples:        24576 | consumed tokens:     50331648 | elapsed time per iteration (s): 15.22 | learning rate: 8.053E-06 | global batch size:    16 | lm loss: 6.830132E+00 | grad norm: 0.874 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1537/  128728 | consumed samples:        24592 | consumed tokens:     50364416 | elapsed time per iteration (s): 15.22 | learning rate: 8.058E-06 | global batch size:    16 | lm loss: 7.005225E+00 | grad norm: 0.901 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1538/  128728 | consumed samples:        24608 | consumed tokens:     50397184 | elapsed time per iteration (s): 15.24 | learning rate: 8.064E-06 | global batch size:    16 | lm loss: 6.873813E+00 | grad norm: 1.114 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1539/  128728 | consumed samples:        24624 | consumed tokens:     50429952 | elapsed time per iteration (s): 15.24 | learning rate: 8.069E-06 | global batch size:    16 | lm loss: 7.034050E+00 | grad norm: 1.201 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1540/  128728 | consumed samples:        24640 | consumed tokens:     50462720 | elapsed time per iteration (s): 15.20 | learning rate: 8.074E-06 | global batch size:    16 | lm loss: 6.716762E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1541/  128728 | consumed samples:        24656 | consumed tokens:     50495488 | elapsed time per iteration (s): 15.23 | learning rate: 8.079E-06 | global batch size:    16 | lm loss: 6.718003E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1542/  128728 | consumed samples:        24672 | consumed tokens:     50528256 | elapsed time per iteration (s): 15.20 | learning rate: 8.085E-06 | global batch size:    16 | lm loss: 6.716470E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1543/  128728 | consumed samples:        24688 | consumed tokens:     50561024 | elapsed time per iteration (s): 15.25 | learning rate: 8.090E-06 | global batch size:    16 | lm loss: 6.932936E+00 | grad norm: 1.057 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     1544/  128728 | consumed samples:        24704 | consumed tokens:     50593792 | elapsed time per iteration (s): 15.25 | learning rate: 8.095E-06 | global batch size:    16 | lm loss: 6.767054E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1545/  128728 | consumed samples:        24720 | consumed tokens:     50626560 | elapsed time per iteration (s): 15.24 | learning rate: 8.100E-06 | global batch size:    16 | lm loss: 6.696455E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1546/  128728 | consumed samples:        24736 | consumed tokens:     50659328 | elapsed time per iteration (s): 15.23 | learning rate: 8.106E-06 | global batch size:    16 | lm loss: 6.987879E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1547/  128728 | consumed samples:        24752 | consumed tokens:     50692096 | elapsed time per iteration (s): 15.30 | learning rate: 8.111E-06 | global batch size:    16 | lm loss: 6.568095E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.045 | TFLOPs: 8.00 |
[default7]: iteration     1548/  128728 | consumed samples:        24768 | consumed tokens:     50724864 | elapsed time per iteration (s): 15.21 | learning rate: 8.116E-06 | global batch size:    16 | lm loss: 6.986506E+00 | grad norm: 0.860 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1549/  128728 | consumed samples:        24784 | consumed tokens:     50757632 | elapsed time per iteration (s): 15.24 | learning rate: 8.121E-06 | global batch size:    16 | lm loss: 7.040531E+00 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1550/  128728 | consumed samples:        24800 | consumed tokens:     50790400 | elapsed time per iteration (s): 15.20 | learning rate: 8.126E-06 | global batch size:    16 | lm loss: 6.698561E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1551/  128728 | consumed samples:        24816 | consumed tokens:     50823168 | elapsed time per iteration (s): 15.23 | learning rate: 8.132E-06 | global batch size:    16 | lm loss: 6.824203E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1552/  128728 | consumed samples:        24832 | consumed tokens:     50855936 | elapsed time per iteration (s): 15.28 | learning rate: 8.137E-06 | global batch size:    16 | lm loss: 6.724894E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1553/  128728 | consumed samples:        24848 | consumed tokens:     50888704 | elapsed time per iteration (s): 15.22 | learning rate: 8.142E-06 | global batch size:    16 | lm loss: 6.692251E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1554/  128728 | consumed samples:        24864 | consumed tokens:     50921472 | elapsed time per iteration (s): 15.23 | learning rate: 8.147E-06 | global batch size:    16 | lm loss: 6.816679E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1555/  128728 | consumed samples:        24880 | consumed tokens:     50954240 | elapsed time per iteration (s): 15.18 | learning rate: 8.153E-06 | global batch size:    16 | lm loss: 6.784638E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1556/  128728 | consumed samples:        24896 | consumed tokens:     50987008 | elapsed time per iteration (s): 15.25 | learning rate: 8.158E-06 | global batch size:    16 | lm loss: 7.072264E+00 | grad norm: 1.032 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1557/  128728 | consumed samples:        24912 | consumed tokens:     51019776 | elapsed time per iteration (s): 15.25 | learning rate: 8.163E-06 | global batch size:    16 | lm loss: 7.026040E+00 | grad norm: 1.164 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1558/  128728 | consumed samples:        24928 | consumed tokens:     51052544 | elapsed time per iteration (s): 15.25 | learning rate: 8.168E-06 | global batch size:    16 | lm loss: 6.760884E+00 | grad norm: 1.059 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1559/  128728 | consumed samples:        24944 | consumed tokens:     51085312 | elapsed time per iteration (s): 15.24 | learning rate: 8.174E-06 | global batch size:    16 | lm loss: 6.945187E+00 | grad norm: 3.044 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1560/  128728 | consumed samples:        24960 | consumed tokens:     51118080 | elapsed time per iteration (s): 15.24 | learning rate: 8.179E-06 | global batch size:    16 | lm loss: 6.917427E+00 | grad norm: 1.624 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1561/  128728 | consumed samples:        24976 | consumed tokens:     51150848 | elapsed time per iteration (s): 15.23 | learning rate: 8.184E-06 | global batch size:    16 | lm loss: 6.880846E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1562/  128728 | consumed samples:        24992 | consumed tokens:     51183616 | elapsed time per iteration (s): 15.25 | learning rate: 8.189E-06 | global batch size:    16 | lm loss: 6.682335E+00 | grad norm: 1.081 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1563/  128728 | consumed samples:        25008 | consumed tokens:     51216384 | elapsed time per iteration (s): 15.20 | learning rate: 8.195E-06 | global batch size:    16 | lm loss: 6.699176E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1564/  128728 | consumed samples:        25024 | consumed tokens:     51249152 | elapsed time per iteration (s): 15.25 | learning rate: 8.200E-06 | global batch size:    16 | lm loss: 7.053262E+00 | grad norm: 1.216 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1565/  128728 | consumed samples:        25040 | consumed tokens:     51281920 | elapsed time per iteration (s): 15.21 | learning rate: 8.205E-06 | global batch size:    16 | lm loss: 6.849342E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1566/  128728 | consumed samples:        25056 | consumed tokens:     51314688 | elapsed time per iteration (s): 15.23 | learning rate: 8.210E-06 | global batch size:    16 | lm loss: 6.907884E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1567/  128728 | consumed samples:        25072 | consumed tokens:     51347456 | elapsed time per iteration (s): 15.24 | learning rate: 8.216E-06 | global batch size:    16 | lm loss: 6.791646E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1568/  128728 | consumed samples:        25088 | consumed tokens:     51380224 | elapsed time per iteration (s): 15.23 | learning rate: 8.221E-06 | global batch size:    16 | lm loss: 6.826554E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1569/  128728 | consumed samples:        25104 | consumed tokens:     51412992 | elapsed time per iteration (s): 15.17 | learning rate: 8.226E-06 | global batch size:    16 | lm loss: 6.818520E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1570/  128728 | consumed samples:        25120 | consumed tokens:     51445760 | elapsed time per iteration (s): 15.30 | learning rate: 8.231E-06 | global batch size:    16 | lm loss: 6.837758E+00 | grad norm: 0.975 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     1571/  128728 | consumed samples:        25136 | consumed tokens:     51478528 | elapsed time per iteration (s): 15.24 | learning rate: 8.237E-06 | global batch size:    16 | lm loss: 6.881770E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1572/  128728 | consumed samples:        25152 | consumed tokens:     51511296 | elapsed time per iteration (s): 15.22 | learning rate: 8.242E-06 | global batch size:    16 | lm loss: 6.722608E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1573/  128728 | consumed samples:        25168 | consumed tokens:     51544064 | elapsed time per iteration (s): 15.24 | learning rate: 8.247E-06 | global batch size:    16 | lm loss: 6.414332E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1574/  128728 | consumed samples:        25184 | consumed tokens:     51576832 | elapsed time per iteration (s): 15.19 | learning rate: 8.252E-06 | global batch size:    16 | lm loss: 6.807733E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1575/  128728 | consumed samples:        25200 | consumed tokens:     51609600 | elapsed time per iteration (s): 15.24 | learning rate: 8.258E-06 | global batch size:    16 | lm loss: 6.740201E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1576/  128728 | consumed samples:        25216 | consumed tokens:     51642368 | elapsed time per iteration (s): 15.26 | learning rate: 8.263E-06 | global batch size:    16 | lm loss: 7.032575E+00 | grad norm: 1.159 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1577/  128728 | consumed samples:        25232 | consumed tokens:     51675136 | elapsed time per iteration (s): 15.27 | learning rate: 8.268E-06 | global batch size:    16 | lm loss: 6.839057E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1578/  128728 | consumed samples:        25248 | consumed tokens:     51707904 | elapsed time per iteration (s): 15.24 | learning rate: 8.273E-06 | global batch size:    16 | lm loss: 7.019974E+00 | grad norm: 3.045 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1579/  128728 | consumed samples:        25264 | consumed tokens:     51740672 | elapsed time per iteration (s): 15.26 | learning rate: 8.279E-06 | global batch size:    16 | lm loss: 6.777742E+00 | grad norm: 1.144 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1580/  128728 | consumed samples:        25280 | consumed tokens:     51773440 | elapsed time per iteration (s): 15.22 | learning rate: 8.284E-06 | global batch size:    16 | lm loss: 6.943933E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1581/  128728 | consumed samples:        25296 | consumed tokens:     51806208 | elapsed time per iteration (s): 15.25 | learning rate: 8.289E-06 | global batch size:    16 | lm loss: 6.761483E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1582/  128728 | consumed samples:        25312 | consumed tokens:     51838976 | elapsed time per iteration (s): 15.19 | learning rate: 8.294E-06 | global batch size:    16 | lm loss: 6.671811E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1583/  128728 | consumed samples:        25328 | consumed tokens:     51871744 | elapsed time per iteration (s): 15.22 | learning rate: 8.300E-06 | global batch size:    16 | lm loss: 6.679467E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1584/  128728 | consumed samples:        25344 | consumed tokens:     51904512 | elapsed time per iteration (s): 15.23 | learning rate: 8.305E-06 | global batch size:    16 | lm loss: 6.834284E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1585/  128728 | consumed samples:        25360 | consumed tokens:     51937280 | elapsed time per iteration (s): 15.22 | learning rate: 8.310E-06 | global batch size:    16 | lm loss: 6.797321E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1586/  128728 | consumed samples:        25376 | consumed tokens:     51970048 | elapsed time per iteration (s): 15.17 | learning rate: 8.315E-06 | global batch size:    16 | lm loss: 6.856018E+00 | grad norm: 0.918 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1587/  128728 | consumed samples:        25392 | consumed tokens:     52002816 | elapsed time per iteration (s): 15.17 | learning rate: 8.320E-06 | global batch size:    16 | lm loss: 6.861340E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     1588/  128728 | consumed samples:        25408 | consumed tokens:     52035584 | elapsed time per iteration (s): 15.22 | learning rate: 8.326E-06 | global batch size:    16 | lm loss: 6.824626E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1589/  128728 | consumed samples:        25424 | consumed tokens:     52068352 | elapsed time per iteration (s): 15.15 | learning rate: 8.331E-06 | global batch size:    16 | lm loss: 6.807738E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     1590/  128728 | consumed samples:        25440 | consumed tokens:     52101120 | elapsed time per iteration (s): 15.21 | learning rate: 8.336E-06 | global batch size:    16 | lm loss: 6.876394E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1591/  128728 | consumed samples:        25456 | consumed tokens:     52133888 | elapsed time per iteration (s): 15.23 | learning rate: 8.341E-06 | global batch size:    16 | lm loss: 6.738904E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1592/  128728 | consumed samples:        25472 | consumed tokens:     52166656 | elapsed time per iteration (s): 15.21 | learning rate: 8.347E-06 | global batch size:    16 | lm loss: 6.556417E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1593/  128728 | consumed samples:        25488 | consumed tokens:     52199424 | elapsed time per iteration (s): 15.17 | learning rate: 8.352E-06 | global batch size:    16 | lm loss: 6.714180E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1594/  128728 | consumed samples:        25504 | consumed tokens:     52232192 | elapsed time per iteration (s): 15.18 | learning rate: 8.357E-06 | global batch size:    16 | lm loss: 6.833164E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1595/  128728 | consumed samples:        25520 | consumed tokens:     52264960 | elapsed time per iteration (s): 15.27 | learning rate: 8.362E-06 | global batch size:    16 | lm loss: 6.748915E+00 | grad norm: 1.107 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1596/  128728 | consumed samples:        25536 | consumed tokens:     52297728 | elapsed time per iteration (s): 15.21 | learning rate: 8.368E-06 | global batch size:    16 | lm loss: 6.567333E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1597/  128728 | consumed samples:        25552 | consumed tokens:     52330496 | elapsed time per iteration (s): 15.18 | learning rate: 8.373E-06 | global batch size:    16 | lm loss: 6.716132E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1598/  128728 | consumed samples:        25568 | consumed tokens:     52363264 | elapsed time per iteration (s): 15.18 | learning rate: 8.378E-06 | global batch size:    16 | lm loss: 7.036856E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1599/  128728 | consumed samples:        25584 | consumed tokens:     52396032 | elapsed time per iteration (s): 15.26 | learning rate: 8.383E-06 | global batch size:    16 | lm loss: 6.838940E+00 | grad norm: 1.007 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1600/  128728 | consumed samples:        25600 | consumed tokens:     52428800 | elapsed time per iteration (s): 15.22 | learning rate: 8.389E-06 | global batch size:    16 | lm loss: 6.934296E+00 | grad norm: 0.969 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1601/  128728 | consumed samples:        25616 | consumed tokens:     52461568 | elapsed time per iteration (s): 15.20 | learning rate: 8.394E-06 | global batch size:    16 | lm loss: 7.009863E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1602/  128728 | consumed samples:        25632 | consumed tokens:     52494336 | elapsed time per iteration (s): 15.22 | learning rate: 8.399E-06 | global batch size:    16 | lm loss: 6.868528E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1603/  128728 | consumed samples:        25648 | consumed tokens:     52527104 | elapsed time per iteration (s): 15.26 | learning rate: 8.404E-06 | global batch size:    16 | lm loss: 6.692941E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1604/  128728 | consumed samples:        25664 | consumed tokens:     52559872 | elapsed time per iteration (s): 15.25 | learning rate: 8.410E-06 | global batch size:    16 | lm loss: 6.775326E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1605/  128728 | consumed samples:        25680 | consumed tokens:     52592640 | elapsed time per iteration (s): 15.23 | learning rate: 8.415E-06 | global batch size:    16 | lm loss: 6.836594E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1606/  128728 | consumed samples:        25696 | consumed tokens:     52625408 | elapsed time per iteration (s): 15.19 | learning rate: 8.420E-06 | global batch size:    16 | lm loss: 6.700777E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1607/  128728 | consumed samples:        25712 | consumed tokens:     52658176 | elapsed time per iteration (s): 15.22 | learning rate: 8.425E-06 | global batch size:    16 | lm loss: 6.842509E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1608/  128728 | consumed samples:        25728 | consumed tokens:     52690944 | elapsed time per iteration (s): 15.22 | learning rate: 8.431E-06 | global batch size:    16 | lm loss: 6.609758E+00 | grad norm: 0.938 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1609/  128728 | consumed samples:        25744 | consumed tokens:     52723712 | elapsed time per iteration (s): 15.24 | learning rate: 8.436E-06 | global batch size:    16 | lm loss: 6.705388E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1610/  128728 | consumed samples:        25760 | consumed tokens:     52756480 | elapsed time per iteration (s): 15.22 | learning rate: 8.441E-06 | global batch size:    16 | lm loss: 7.225027E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1611/  128728 | consumed samples:        25776 | consumed tokens:     52789248 | elapsed time per iteration (s): 15.23 | learning rate: 8.446E-06 | global batch size:    16 | lm loss: 6.473947E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1612/  128728 | consumed samples:        25792 | consumed tokens:     52822016 | elapsed time per iteration (s): 15.22 | learning rate: 8.452E-06 | global batch size:    16 | lm loss: 6.753922E+00 | grad norm: 0.966 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1613/  128728 | consumed samples:        25808 | consumed tokens:     52854784 | elapsed time per iteration (s): 15.22 | learning rate: 8.457E-06 | global batch size:    16 | lm loss: 6.546061E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1614/  128728 | consumed samples:        25824 | consumed tokens:     52887552 | elapsed time per iteration (s): 15.21 | learning rate: 8.462E-06 | global batch size:    16 | lm loss: 6.621816E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1615/  128728 | consumed samples:        25840 | consumed tokens:     52920320 | elapsed time per iteration (s): 15.24 | learning rate: 8.467E-06 | global batch size:    16 | lm loss: 6.808933E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1616/  128728 | consumed samples:        25856 | consumed tokens:     52953088 | elapsed time per iteration (s): 15.24 | learning rate: 8.473E-06 | global batch size:    16 | lm loss: 6.900961E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1617/  128728 | consumed samples:        25872 | consumed tokens:     52985856 | elapsed time per iteration (s): 15.26 | learning rate: 8.478E-06 | global batch size:    16 | lm loss: 6.817991E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1618/  128728 | consumed samples:        25888 | consumed tokens:     53018624 | elapsed time per iteration (s): 15.24 | learning rate: 8.483E-06 | global batch size:    16 | lm loss: 6.795853E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1619/  128728 | consumed samples:        25904 | consumed tokens:     53051392 | elapsed time per iteration (s): 15.30 | learning rate: 8.488E-06 | global batch size:    16 | lm loss: 6.808154E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     1620/  128728 | consumed samples:        25920 | consumed tokens:     53084160 | elapsed time per iteration (s): 15.22 | learning rate: 8.493E-06 | global batch size:    16 | lm loss: 6.756622E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1621/  128728 | consumed samples:        25936 | consumed tokens:     53116928 | elapsed time per iteration (s): 15.24 | learning rate: 8.499E-06 | global batch size:    16 | lm loss: 6.731472E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1622/  128728 | consumed samples:        25952 | consumed tokens:     53149696 | elapsed time per iteration (s): 15.25 | learning rate: 8.504E-06 | global batch size:    16 | lm loss: 6.762363E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1623/  128728 | consumed samples:        25968 | consumed tokens:     53182464 | elapsed time per iteration (s): 15.22 | learning rate: 8.509E-06 | global batch size:    16 | lm loss: 6.635185E+00 | grad norm: 1.285 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1624/  128728 | consumed samples:        25984 | consumed tokens:     53215232 | elapsed time per iteration (s): 15.24 | learning rate: 8.514E-06 | global batch size:    16 | lm loss: 6.719479E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1625/  128728 | consumed samples:        26000 | consumed tokens:     53248000 | elapsed time per iteration (s): 15.20 | learning rate: 8.520E-06 | global batch size:    16 | lm loss: 6.719177E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1626/  128728 | consumed samples:        26016 | consumed tokens:     53280768 | elapsed time per iteration (s): 15.23 | learning rate: 8.525E-06 | global batch size:    16 | lm loss: 6.717042E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1627/  128728 | consumed samples:        26032 | consumed tokens:     53313536 | elapsed time per iteration (s): 15.20 | learning rate: 8.530E-06 | global batch size:    16 | lm loss: 6.696861E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1628/  128728 | consumed samples:        26048 | consumed tokens:     53346304 | elapsed time per iteration (s): 15.23 | learning rate: 8.535E-06 | global batch size:    16 | lm loss: 6.720668E+00 | grad norm: 0.949 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1629/  128728 | consumed samples:        26064 | consumed tokens:     53379072 | elapsed time per iteration (s): 15.21 | learning rate: 8.541E-06 | global batch size:    16 | lm loss: 6.852234E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1630/  128728 | consumed samples:        26080 | consumed tokens:     53411840 | elapsed time per iteration (s): 15.23 | learning rate: 8.546E-06 | global batch size:    16 | lm loss: 6.824490E+00 | grad norm: 1.112 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1631/  128728 | consumed samples:        26096 | consumed tokens:     53444608 | elapsed time per iteration (s): 15.23 | learning rate: 8.551E-06 | global batch size:    16 | lm loss: 6.849283E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1632/  128728 | consumed samples:        26112 | consumed tokens:     53477376 | elapsed time per iteration (s): 15.22 | learning rate: 8.556E-06 | global batch size:    16 | lm loss: 6.797266E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1633/  128728 | consumed samples:        26128 | consumed tokens:     53510144 | elapsed time per iteration (s): 15.24 | learning rate: 8.562E-06 | global batch size:    16 | lm loss: 6.806245E+00 | grad norm: 0.881 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1634/  128728 | consumed samples:        26144 | consumed tokens:     53542912 | elapsed time per iteration (s): 15.23 | learning rate: 8.567E-06 | global batch size:    16 | lm loss: 6.727156E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1635/  128728 | consumed samples:        26160 | consumed tokens:     53575680 | elapsed time per iteration (s): 15.22 | learning rate: 8.572E-06 | global batch size:    16 | lm loss: 6.681533E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1636/  128728 | consumed samples:        26176 | consumed tokens:     53608448 | elapsed time per iteration (s): 15.24 | learning rate: 8.577E-06 | global batch size:    16 | lm loss: 6.698471E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1637/  128728 | consumed samples:        26192 | consumed tokens:     53641216 | elapsed time per iteration (s): 15.22 | learning rate: 8.583E-06 | global batch size:    16 | lm loss: 6.622070E+00 | grad norm: 0.901 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1638/  128728 | consumed samples:        26208 | consumed tokens:     53673984 | elapsed time per iteration (s): 15.21 | learning rate: 8.588E-06 | global batch size:    16 | lm loss: 6.974732E+00 | grad norm: 0.849 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1639/  128728 | consumed samples:        26224 | consumed tokens:     53706752 | elapsed time per iteration (s): 15.23 | learning rate: 8.593E-06 | global batch size:    16 | lm loss: 6.687452E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1640/  128728 | consumed samples:        26240 | consumed tokens:     53739520 | elapsed time per iteration (s): 15.24 | learning rate: 8.598E-06 | global batch size:    16 | lm loss: 6.786465E+00 | grad norm: 1.361 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1641/  128728 | consumed samples:        26256 | consumed tokens:     53772288 | elapsed time per iteration (s): 15.21 | learning rate: 8.604E-06 | global batch size:    16 | lm loss: 6.489959E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1642/  128728 | consumed samples:        26272 | consumed tokens:     53805056 | elapsed time per iteration (s): 15.22 | learning rate: 8.609E-06 | global batch size:    16 | lm loss: 6.655648E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1643/  128728 | consumed samples:        26288 | consumed tokens:     53837824 | elapsed time per iteration (s): 15.23 | learning rate: 8.614E-06 | global batch size:    16 | lm loss: 6.969821E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1644/  128728 | consumed samples:        26304 | consumed tokens:     53870592 | elapsed time per iteration (s): 15.20 | learning rate: 8.619E-06 | global batch size:    16 | lm loss: 6.872509E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1645/  128728 | consumed samples:        26320 | consumed tokens:     53903360 | elapsed time per iteration (s): 15.19 | learning rate: 8.625E-06 | global batch size:    16 | lm loss: 6.742445E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1646/  128728 | consumed samples:        26336 | consumed tokens:     53936128 | elapsed time per iteration (s): 15.22 | learning rate: 8.630E-06 | global batch size:    16 | lm loss: 6.764326E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1647/  128728 | consumed samples:        26352 | consumed tokens:     53968896 | elapsed time per iteration (s): 15.22 | learning rate: 8.635E-06 | global batch size:    16 | lm loss: 6.716838E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1648/  128728 | consumed samples:        26368 | consumed tokens:     54001664 | elapsed time per iteration (s): 15.24 | learning rate: 8.640E-06 | global batch size:    16 | lm loss: 6.780625E+00 | grad norm: 1.095 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1649/  128728 | consumed samples:        26384 | consumed tokens:     54034432 | elapsed time per iteration (s): 15.22 | learning rate: 8.646E-06 | global batch size:    16 | lm loss: 6.868196E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1650/  128728 | consumed samples:        26400 | consumed tokens:     54067200 | elapsed time per iteration (s): 15.23 | learning rate: 8.651E-06 | global batch size:    16 | lm loss: 6.551585E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1651/  128728 | consumed samples:        26416 | consumed tokens:     54099968 | elapsed time per iteration (s): 15.21 | learning rate: 8.656E-06 | global batch size:    16 | lm loss: 6.747394E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1652/  128728 | consumed samples:        26432 | consumed tokens:     54132736 | elapsed time per iteration (s): 15.24 | learning rate: 8.661E-06 | global batch size:    16 | lm loss: 6.589448E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1653/  128728 | consumed samples:        26448 | consumed tokens:     54165504 | elapsed time per iteration (s): 15.23 | learning rate: 8.667E-06 | global batch size:    16 | lm loss: 6.797907E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1654/  128728 | consumed samples:        26464 | consumed tokens:     54198272 | elapsed time per iteration (s): 15.23 | learning rate: 8.672E-06 | global batch size:    16 | lm loss: 6.482568E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1655/  128728 | consumed samples:        26480 | consumed tokens:     54231040 | elapsed time per iteration (s): 15.23 | learning rate: 8.677E-06 | global batch size:    16 | lm loss: 6.558523E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1656/  128728 | consumed samples:        26496 | consumed tokens:     54263808 | elapsed time per iteration (s): 15.22 | learning rate: 8.682E-06 | global batch size:    16 | lm loss: 6.789701E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1657/  128728 | consumed samples:        26512 | consumed tokens:     54296576 | elapsed time per iteration (s): 15.21 | learning rate: 8.687E-06 | global batch size:    16 | lm loss: 6.682187E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1658/  128728 | consumed samples:        26528 | consumed tokens:     54329344 | elapsed time per iteration (s): 15.25 | learning rate: 8.693E-06 | global batch size:    16 | lm loss: 6.815564E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1659/  128728 | consumed samples:        26544 | consumed tokens:     54362112 | elapsed time per iteration (s): 15.22 | learning rate: 8.698E-06 | global batch size:    16 | lm loss: 6.569890E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1660/  128728 | consumed samples:        26560 | consumed tokens:     54394880 | elapsed time per iteration (s): 15.22 | learning rate: 8.703E-06 | global batch size:    16 | lm loss: 6.928395E+00 | grad norm: 1.473 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1661/  128728 | consumed samples:        26576 | consumed tokens:     54427648 | elapsed time per iteration (s): 15.24 | learning rate: 8.708E-06 | global batch size:    16 | lm loss: 6.680755E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1662/  128728 | consumed samples:        26592 | consumed tokens:     54460416 | elapsed time per iteration (s): 15.25 | learning rate: 8.714E-06 | global batch size:    16 | lm loss: 6.746358E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1663/  128728 | consumed samples:        26608 | consumed tokens:     54493184 | elapsed time per iteration (s): 15.20 | learning rate: 8.719E-06 | global batch size:    16 | lm loss: 6.800119E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1664/  128728 | consumed samples:        26624 | consumed tokens:     54525952 | elapsed time per iteration (s): 15.21 | learning rate: 8.724E-06 | global batch size:    16 | lm loss: 6.989696E+00 | grad norm: 0.926 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1665/  128728 | consumed samples:        26640 | consumed tokens:     54558720 | elapsed time per iteration (s): 15.22 | learning rate: 8.729E-06 | global batch size:    16 | lm loss: 6.610164E+00 | grad norm: 0.990 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1666/  128728 | consumed samples:        26656 | consumed tokens:     54591488 | elapsed time per iteration (s): 15.24 | learning rate: 8.735E-06 | global batch size:    16 | lm loss: 6.775718E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1667/  128728 | consumed samples:        26672 | consumed tokens:     54624256 | elapsed time per iteration (s): 15.21 | learning rate: 8.740E-06 | global batch size:    16 | lm loss: 6.772297E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1668/  128728 | consumed samples:        26688 | consumed tokens:     54657024 | elapsed time per iteration (s): 15.21 | learning rate: 8.745E-06 | global batch size:    16 | lm loss: 6.742491E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1669/  128728 | consumed samples:        26704 | consumed tokens:     54689792 | elapsed time per iteration (s): 15.21 | learning rate: 8.750E-06 | global batch size:    16 | lm loss: 6.486816E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1670/  128728 | consumed samples:        26720 | consumed tokens:     54722560 | elapsed time per iteration (s): 15.15 | learning rate: 8.756E-06 | global batch size:    16 | lm loss: 6.712128E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     1671/  128728 | consumed samples:        26736 | consumed tokens:     54755328 | elapsed time per iteration (s): 15.20 | learning rate: 8.761E-06 | global batch size:    16 | lm loss: 6.731301E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1672/  128728 | consumed samples:        26752 | consumed tokens:     54788096 | elapsed time per iteration (s): 15.22 | learning rate: 8.766E-06 | global batch size:    16 | lm loss: 6.672599E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1673/  128728 | consumed samples:        26768 | consumed tokens:     54820864 | elapsed time per iteration (s): 15.25 | learning rate: 8.771E-06 | global batch size:    16 | lm loss: 6.849351E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1674/  128728 | consumed samples:        26784 | consumed tokens:     54853632 | elapsed time per iteration (s): 15.23 | learning rate: 8.777E-06 | global batch size:    16 | lm loss: 6.601808E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1675/  128728 | consumed samples:        26800 | consumed tokens:     54886400 | elapsed time per iteration (s): 15.18 | learning rate: 8.782E-06 | global batch size:    16 | lm loss: 6.788216E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1676/  128728 | consumed samples:        26816 | consumed tokens:     54919168 | elapsed time per iteration (s): 15.23 | learning rate: 8.787E-06 | global batch size:    16 | lm loss: 6.842864E+00 | grad norm: 1.324 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1677/  128728 | consumed samples:        26832 | consumed tokens:     54951936 | elapsed time per iteration (s): 15.24 | learning rate: 8.792E-06 | global batch size:    16 | lm loss: 6.575851E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1678/  128728 | consumed samples:        26848 | consumed tokens:     54984704 | elapsed time per iteration (s): 15.25 | learning rate: 8.798E-06 | global batch size:    16 | lm loss: 6.952417E+00 | grad norm: 1.096 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1679/  128728 | consumed samples:        26864 | consumed tokens:     55017472 | elapsed time per iteration (s): 15.24 | learning rate: 8.803E-06 | global batch size:    16 | lm loss: 6.949244E+00 | grad norm: 0.932 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1680/  128728 | consumed samples:        26880 | consumed tokens:     55050240 | elapsed time per iteration (s): 15.18 | learning rate: 8.808E-06 | global batch size:    16 | lm loss: 6.862979E+00 | grad norm: 1.214 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1681/  128728 | consumed samples:        26896 | consumed tokens:     55083008 | elapsed time per iteration (s): 15.22 | learning rate: 8.813E-06 | global batch size:    16 | lm loss: 6.461435E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1682/  128728 | consumed samples:        26912 | consumed tokens:     55115776 | elapsed time per iteration (s): 15.21 | learning rate: 8.819E-06 | global batch size:    16 | lm loss: 6.786581E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1683/  128728 | consumed samples:        26928 | consumed tokens:     55148544 | elapsed time per iteration (s): 15.27 | learning rate: 8.824E-06 | global batch size:    16 | lm loss: 6.698905E+00 | grad norm: 1.028 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1684/  128728 | consumed samples:        26944 | consumed tokens:     55181312 | elapsed time per iteration (s): 15.25 | learning rate: 8.829E-06 | global batch size:    16 | lm loss: 6.586331E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1685/  128728 | consumed samples:        26960 | consumed tokens:     55214080 | elapsed time per iteration (s): 15.14 | learning rate: 8.834E-06 | global batch size:    16 | lm loss: 6.709742E+00 | grad norm: 0.892 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     1686/  128728 | consumed samples:        26976 | consumed tokens:     55246848 | elapsed time per iteration (s): 15.17 | learning rate: 8.840E-06 | global batch size:    16 | lm loss: 6.868404E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1687/  128728 | consumed samples:        26992 | consumed tokens:     55279616 | elapsed time per iteration (s): 15.28 | learning rate: 8.845E-06 | global batch size:    16 | lm loss: 6.532902E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1688/  128728 | consumed samples:        27008 | consumed tokens:     55312384 | elapsed time per iteration (s): 15.24 | learning rate: 8.850E-06 | global batch size:    16 | lm loss: 6.821950E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1689/  128728 | consumed samples:        27024 | consumed tokens:     55345152 | elapsed time per iteration (s): 15.24 | learning rate: 8.855E-06 | global batch size:    16 | lm loss: 6.559717E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1690/  128728 | consumed samples:        27040 | consumed tokens:     55377920 | elapsed time per iteration (s): 15.23 | learning rate: 8.860E-06 | global batch size:    16 | lm loss: 6.734614E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1691/  128728 | consumed samples:        27056 | consumed tokens:     55410688 | elapsed time per iteration (s): 15.23 | learning rate: 8.866E-06 | global batch size:    16 | lm loss: 6.887871E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1692/  128728 | consumed samples:        27072 | consumed tokens:     55443456 | elapsed time per iteration (s): 15.23 | learning rate: 8.871E-06 | global batch size:    16 | lm loss: 6.599645E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1693/  128728 | consumed samples:        27088 | consumed tokens:     55476224 | elapsed time per iteration (s): 15.21 | learning rate: 8.876E-06 | global batch size:    16 | lm loss: 6.764441E+00 | grad norm: 2.068 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1694/  128728 | consumed samples:        27104 | consumed tokens:     55508992 | elapsed time per iteration (s): 15.25 | learning rate: 8.881E-06 | global batch size:    16 | lm loss: 6.749056E+00 | grad norm: 0.931 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1695/  128728 | consumed samples:        27120 | consumed tokens:     55541760 | elapsed time per iteration (s): 15.23 | learning rate: 8.887E-06 | global batch size:    16 | lm loss: 6.714129E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1696/  128728 | consumed samples:        27136 | consumed tokens:     55574528 | elapsed time per iteration (s): 15.24 | learning rate: 8.892E-06 | global batch size:    16 | lm loss: 6.672210E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1697/  128728 | consumed samples:        27152 | consumed tokens:     55607296 | elapsed time per iteration (s): 15.26 | learning rate: 8.897E-06 | global batch size:    16 | lm loss: 6.633732E+00 | grad norm: 1.072 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1698/  128728 | consumed samples:        27168 | consumed tokens:     55640064 | elapsed time per iteration (s): 15.27 | learning rate: 8.902E-06 | global batch size:    16 | lm loss: 6.510969E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1699/  128728 | consumed samples:        27184 | consumed tokens:     55672832 | elapsed time per iteration (s): 15.26 | learning rate: 8.908E-06 | global batch size:    16 | lm loss: 6.668943E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1700/  128728 | consumed samples:        27200 | consumed tokens:     55705600 | elapsed time per iteration (s): 15.21 | learning rate: 8.913E-06 | global batch size:    16 | lm loss: 6.773491E+00 | grad norm: 0.973 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1701/  128728 | consumed samples:        27216 | consumed tokens:     55738368 | elapsed time per iteration (s): 15.27 | learning rate: 8.918E-06 | global batch size:    16 | lm loss: 6.664038E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1702/  128728 | consumed samples:        27232 | consumed tokens:     55771136 | elapsed time per iteration (s): 15.27 | learning rate: 8.923E-06 | global batch size:    16 | lm loss: 6.511447E+00 | grad norm: 1.224 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1703/  128728 | consumed samples:        27248 | consumed tokens:     55803904 | elapsed time per iteration (s): 15.18 | learning rate: 8.929E-06 | global batch size:    16 | lm loss: 6.659809E+00 | grad norm: 1.415 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1704/  128728 | consumed samples:        27264 | consumed tokens:     55836672 | elapsed time per iteration (s): 15.24 | learning rate: 8.934E-06 | global batch size:    16 | lm loss: 6.674672E+00 | grad norm: 0.987 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1705/  128728 | consumed samples:        27280 | consumed tokens:     55869440 | elapsed time per iteration (s): 15.25 | learning rate: 8.939E-06 | global batch size:    16 | lm loss: 6.860772E+00 | grad norm: 0.972 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1706/  128728 | consumed samples:        27296 | consumed tokens:     55902208 | elapsed time per iteration (s): 15.24 | learning rate: 8.944E-06 | global batch size:    16 | lm loss: 6.839284E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1707/  128728 | consumed samples:        27312 | consumed tokens:     55934976 | elapsed time per iteration (s): 15.24 | learning rate: 8.950E-06 | global batch size:    16 | lm loss: 6.650226E+00 | grad norm: 0.978 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1708/  128728 | consumed samples:        27328 | consumed tokens:     55967744 | elapsed time per iteration (s): 15.25 | learning rate: 8.955E-06 | global batch size:    16 | lm loss: 6.606098E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1709/  128728 | consumed samples:        27344 | consumed tokens:     56000512 | elapsed time per iteration (s): 15.23 | learning rate: 8.960E-06 | global batch size:    16 | lm loss: 6.536633E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1710/  128728 | consumed samples:        27360 | consumed tokens:     56033280 | elapsed time per iteration (s): 15.23 | learning rate: 8.965E-06 | global batch size:    16 | lm loss: 6.541372E+00 | grad norm: 1.250 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1711/  128728 | consumed samples:        27376 | consumed tokens:     56066048 | elapsed time per iteration (s): 15.22 | learning rate: 8.971E-06 | global batch size:    16 | lm loss: 6.686945E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1712/  128728 | consumed samples:        27392 | consumed tokens:     56098816 | elapsed time per iteration (s): 15.27 | learning rate: 8.976E-06 | global batch size:    16 | lm loss: 6.757609E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1713/  128728 | consumed samples:        27408 | consumed tokens:     56131584 | elapsed time per iteration (s): 15.25 | learning rate: 8.981E-06 | global batch size:    16 | lm loss: 6.817521E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1714/  128728 | consumed samples:        27424 | consumed tokens:     56164352 | elapsed time per iteration (s): 15.24 | learning rate: 8.986E-06 | global batch size:    16 | lm loss: 6.714323E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1715/  128728 | consumed samples:        27440 | consumed tokens:     56197120 | elapsed time per iteration (s): 15.25 | learning rate: 8.992E-06 | global batch size:    16 | lm loss: 6.906718E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     1716/  128728 | consumed samples:        27456 | consumed tokens:     56229888 | elapsed time per iteration (s): 15.26 | learning rate: 8.997E-06 | global batch size:    16 | lm loss: 6.650734E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1717/  128728 | consumed samples:        27472 | consumed tokens:     56262656 | elapsed time per iteration (s): 15.23 | learning rate: 9.002E-06 | global batch size:    16 | lm loss: 6.751576E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1718/  128728 | consumed samples:        27488 | consumed tokens:     56295424 | elapsed time per iteration (s): 15.24 | learning rate: 9.007E-06 | global batch size:    16 | lm loss: 6.557451E+00 | grad norm: 0.911 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1719/  128728 | consumed samples:        27504 | consumed tokens:     56328192 | elapsed time per iteration (s): 15.23 | learning rate: 9.013E-06 | global batch size:    16 | lm loss: 6.755507E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1720/  128728 | consumed samples:        27520 | consumed tokens:     56360960 | elapsed time per iteration (s): 15.24 | learning rate: 9.018E-06 | global batch size:    16 | lm loss: 6.720637E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1721/  128728 | consumed samples:        27536 | consumed tokens:     56393728 | elapsed time per iteration (s): 15.21 | learning rate: 9.023E-06 | global batch size:    16 | lm loss: 6.478833E+00 | grad norm: 1.104 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1722/  128728 | consumed samples:        27552 | consumed tokens:     56426496 | elapsed time per iteration (s): 15.23 | learning rate: 9.028E-06 | global batch size:    16 | lm loss: 6.794544E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1723/  128728 | consumed samples:        27568 | consumed tokens:     56459264 | elapsed time per iteration (s): 15.25 | learning rate: 9.034E-06 | global batch size:    16 | lm loss: 6.539186E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1724/  128728 | consumed samples:        27584 | consumed tokens:     56492032 | elapsed time per iteration (s): 15.21 | learning rate: 9.039E-06 | global batch size:    16 | lm loss: 6.716591E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1725/  128728 | consumed samples:        27600 | consumed tokens:     56524800 | elapsed time per iteration (s): 15.26 | learning rate: 9.044E-06 | global batch size:    16 | lm loss: 6.714130E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1726/  128728 | consumed samples:        27616 | consumed tokens:     56557568 | elapsed time per iteration (s): 15.24 | learning rate: 9.049E-06 | global batch size:    16 | lm loss: 6.706204E+00 | grad norm: 1.262 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1727/  128728 | consumed samples:        27632 | consumed tokens:     56590336 | elapsed time per iteration (s): 15.23 | learning rate: 9.054E-06 | global batch size:    16 | lm loss: 6.605562E+00 | grad norm: 1.054 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1728/  128728 | consumed samples:        27648 | consumed tokens:     56623104 | elapsed time per iteration (s): 15.24 | learning rate: 9.060E-06 | global batch size:    16 | lm loss: 6.806660E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1729/  128728 | consumed samples:        27664 | consumed tokens:     56655872 | elapsed time per iteration (s): 15.23 | learning rate: 9.065E-06 | global batch size:    16 | lm loss: 6.988270E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1730/  128728 | consumed samples:        27680 | consumed tokens:     56688640 | elapsed time per iteration (s): 15.23 | learning rate: 9.070E-06 | global batch size:    16 | lm loss: 6.551892E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1731/  128728 | consumed samples:        27696 | consumed tokens:     56721408 | elapsed time per iteration (s): 15.22 | learning rate: 9.075E-06 | global batch size:    16 | lm loss: 6.359119E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1732/  128728 | consumed samples:        27712 | consumed tokens:     56754176 | elapsed time per iteration (s): 15.23 | learning rate: 9.081E-06 | global batch size:    16 | lm loss: 6.745327E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1733/  128728 | consumed samples:        27728 | consumed tokens:     56786944 | elapsed time per iteration (s): 15.18 | learning rate: 9.086E-06 | global batch size:    16 | lm loss: 6.495726E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1734/  128728 | consumed samples:        27744 | consumed tokens:     56819712 | elapsed time per iteration (s): 15.23 | learning rate: 9.091E-06 | global batch size:    16 | lm loss: 6.595272E+00 | grad norm: 0.867 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1735/  128728 | consumed samples:        27760 | consumed tokens:     56852480 | elapsed time per iteration (s): 15.24 | learning rate: 9.096E-06 | global batch size:    16 | lm loss: 6.750875E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1736/  128728 | consumed samples:        27776 | consumed tokens:     56885248 | elapsed time per iteration (s): 15.25 | learning rate: 9.102E-06 | global batch size:    16 | lm loss: 6.515401E+00 | grad norm: 1.182 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1737/  128728 | consumed samples:        27792 | consumed tokens:     56918016 | elapsed time per iteration (s): 15.24 | learning rate: 9.107E-06 | global batch size:    16 | lm loss: 6.513342E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1738/  128728 | consumed samples:        27808 | consumed tokens:     56950784 | elapsed time per iteration (s): 15.23 | learning rate: 9.112E-06 | global batch size:    16 | lm loss: 6.627918E+00 | grad norm: 0.936 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1739/  128728 | consumed samples:        27824 | consumed tokens:     56983552 | elapsed time per iteration (s): 15.24 | learning rate: 9.117E-06 | global batch size:    16 | lm loss: 6.685300E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1740/  128728 | consumed samples:        27840 | consumed tokens:     57016320 | elapsed time per iteration (s): 15.26 | learning rate: 9.123E-06 | global batch size:    16 | lm loss: 6.637107E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1741/  128728 | consumed samples:        27856 | consumed tokens:     57049088 | elapsed time per iteration (s): 15.21 | learning rate: 9.128E-06 | global batch size:    16 | lm loss: 6.694432E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1742/  128728 | consumed samples:        27872 | consumed tokens:     57081856 | elapsed time per iteration (s): 15.29 | learning rate: 9.133E-06 | global batch size:    16 | lm loss: 6.972545E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     1743/  128728 | consumed samples:        27888 | consumed tokens:     57114624 | elapsed time per iteration (s): 15.23 | learning rate: 9.138E-06 | global batch size:    16 | lm loss: 6.513799E+00 | grad norm: 0.962 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1744/  128728 | consumed samples:        27904 | consumed tokens:     57147392 | elapsed time per iteration (s): 15.21 | learning rate: 9.144E-06 | global batch size:    16 | lm loss: 6.752306E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1745/  128728 | consumed samples:        27920 | consumed tokens:     57180160 | elapsed time per iteration (s): 15.20 | learning rate: 9.149E-06 | global batch size:    16 | lm loss: 6.714429E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1746/  128728 | consumed samples:        27936 | consumed tokens:     57212928 | elapsed time per iteration (s): 15.24 | learning rate: 9.154E-06 | global batch size:    16 | lm loss: 6.613607E+00 | grad norm: 1.206 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1747/  128728 | consumed samples:        27952 | consumed tokens:     57245696 | elapsed time per iteration (s): 15.20 | learning rate: 9.159E-06 | global batch size:    16 | lm loss: 6.643983E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1748/  128728 | consumed samples:        27968 | consumed tokens:     57278464 | elapsed time per iteration (s): 15.23 | learning rate: 9.165E-06 | global batch size:    16 | lm loss: 6.584989E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1749/  128728 | consumed samples:        27984 | consumed tokens:     57311232 | elapsed time per iteration (s): 15.19 | learning rate: 9.170E-06 | global batch size:    16 | lm loss: 6.636932E+00 | grad norm: 0.996 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     1750/  128728 | consumed samples:        28000 | consumed tokens:     57344000 | elapsed time per iteration (s): 15.24 | learning rate: 9.175E-06 | global batch size:    16 | lm loss: 6.609263E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1751/  128728 | consumed samples:        28016 | consumed tokens:     57376768 | elapsed time per iteration (s): 15.22 | learning rate: 9.180E-06 | global batch size:    16 | lm loss: 6.592394E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1752/  128728 | consumed samples:        28032 | consumed tokens:     57409536 | elapsed time per iteration (s): 15.22 | learning rate: 9.186E-06 | global batch size:    16 | lm loss: 6.624197E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1753/  128728 | consumed samples:        28048 | consumed tokens:     57442304 | elapsed time per iteration (s): 15.21 | learning rate: 9.191E-06 | global batch size:    16 | lm loss: 6.544185E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1754/  128728 | consumed samples:        28064 | consumed tokens:     57475072 | elapsed time per iteration (s): 15.14 | learning rate: 9.196E-06 | global batch size:    16 | lm loss: 6.537138E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     1755/  128728 | consumed samples:        28080 | consumed tokens:     57507840 | elapsed time per iteration (s): 15.23 | learning rate: 9.201E-06 | global batch size:    16 | lm loss: 6.729046E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1756/  128728 | consumed samples:        28096 | consumed tokens:     57540608 | elapsed time per iteration (s): 15.20 | learning rate: 9.207E-06 | global batch size:    16 | lm loss: 6.539384E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1757/  128728 | consumed samples:        28112 | consumed tokens:     57573376 | elapsed time per iteration (s): 15.20 | learning rate: 9.212E-06 | global batch size:    16 | lm loss: 6.607846E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1758/  128728 | consumed samples:        28128 | consumed tokens:     57606144 | elapsed time per iteration (s): 15.24 | learning rate: 9.217E-06 | global batch size:    16 | lm loss: 6.539383E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1759/  128728 | consumed samples:        28144 | consumed tokens:     57638912 | elapsed time per iteration (s): 15.23 | learning rate: 9.222E-06 | global batch size:    16 | lm loss: 6.513782E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1760/  128728 | consumed samples:        28160 | consumed tokens:     57671680 | elapsed time per iteration (s): 15.24 | learning rate: 9.227E-06 | global batch size:    16 | lm loss: 6.585566E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1761/  128728 | consumed samples:        28176 | consumed tokens:     57704448 | elapsed time per iteration (s): 15.17 | learning rate: 9.233E-06 | global batch size:    16 | lm loss: 6.562658E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1762/  128728 | consumed samples:        28192 | consumed tokens:     57737216 | elapsed time per iteration (s): 15.24 | learning rate: 9.238E-06 | global batch size:    16 | lm loss: 6.564455E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1763/  128728 | consumed samples:        28208 | consumed tokens:     57769984 | elapsed time per iteration (s): 15.27 | learning rate: 9.243E-06 | global batch size:    16 | lm loss: 6.471663E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1764/  128728 | consumed samples:        28224 | consumed tokens:     57802752 | elapsed time per iteration (s): 15.16 | learning rate: 9.248E-06 | global batch size:    16 | lm loss: 6.601748E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1765/  128728 | consumed samples:        28240 | consumed tokens:     57835520 | elapsed time per iteration (s): 15.20 | learning rate: 9.254E-06 | global batch size:    16 | lm loss: 6.736907E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1766/  128728 | consumed samples:        28256 | consumed tokens:     57868288 | elapsed time per iteration (s): 15.15 | learning rate: 9.259E-06 | global batch size:    16 | lm loss: 6.662552E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     1767/  128728 | consumed samples:        28272 | consumed tokens:     57901056 | elapsed time per iteration (s): 15.23 | learning rate: 9.264E-06 | global batch size:    16 | lm loss: 6.668775E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1768/  128728 | consumed samples:        28288 | consumed tokens:     57933824 | elapsed time per iteration (s): 15.19 | learning rate: 9.269E-06 | global batch size:    16 | lm loss: 6.802093E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     1769/  128728 | consumed samples:        28304 | consumed tokens:     57966592 | elapsed time per iteration (s): 15.25 | learning rate: 9.275E-06 | global batch size:    16 | lm loss: 6.619685E+00 | grad norm: 1.255 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1770/  128728 | consumed samples:        28320 | consumed tokens:     57999360 | elapsed time per iteration (s): 15.22 | learning rate: 9.280E-06 | global batch size:    16 | lm loss: 6.863540E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1771/  128728 | consumed samples:        28336 | consumed tokens:     58032128 | elapsed time per iteration (s): 15.20 | learning rate: 9.285E-06 | global batch size:    16 | lm loss: 6.705997E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1772/  128728 | consumed samples:        28352 | consumed tokens:     58064896 | elapsed time per iteration (s): 15.21 | learning rate: 9.290E-06 | global batch size:    16 | lm loss: 6.656632E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1773/  128728 | consumed samples:        28368 | consumed tokens:     58097664 | elapsed time per iteration (s): 15.27 | learning rate: 9.296E-06 | global batch size:    16 | lm loss: 6.472975E+00 | grad norm: 1.079 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1774/  128728 | consumed samples:        28384 | consumed tokens:     58130432 | elapsed time per iteration (s): 15.25 | learning rate: 9.301E-06 | global batch size:    16 | lm loss: 6.678162E+00 | grad norm: 1.091 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1775/  128728 | consumed samples:        28400 | consumed tokens:     58163200 | elapsed time per iteration (s): 15.24 | learning rate: 9.306E-06 | global batch size:    16 | lm loss: 6.682146E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1776/  128728 | consumed samples:        28416 | consumed tokens:     58195968 | elapsed time per iteration (s): 15.25 | learning rate: 9.311E-06 | global batch size:    16 | lm loss: 6.408243E+00 | grad norm: 0.978 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1777/  128728 | consumed samples:        28432 | consumed tokens:     58228736 | elapsed time per iteration (s): 15.22 | learning rate: 9.317E-06 | global batch size:    16 | lm loss: 6.565637E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1778/  128728 | consumed samples:        28448 | consumed tokens:     58261504 | elapsed time per iteration (s): 15.24 | learning rate: 9.322E-06 | global batch size:    16 | lm loss: 6.499868E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1779/  128728 | consumed samples:        28464 | consumed tokens:     58294272 | elapsed time per iteration (s): 15.26 | learning rate: 9.327E-06 | global batch size:    16 | lm loss: 6.529296E+00 | grad norm: 1.027 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1780/  128728 | consumed samples:        28480 | consumed tokens:     58327040 | elapsed time per iteration (s): 15.26 | learning rate: 9.332E-06 | global batch size:    16 | lm loss: 6.737674E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1781/  128728 | consumed samples:        28496 | consumed tokens:     58359808 | elapsed time per iteration (s): 15.26 | learning rate: 9.338E-06 | global batch size:    16 | lm loss: 6.714326E+00 | grad norm: 1.247 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1782/  128728 | consumed samples:        28512 | consumed tokens:     58392576 | elapsed time per iteration (s): 15.25 | learning rate: 9.343E-06 | global batch size:    16 | lm loss: 6.741374E+00 | grad norm: 0.991 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1783/  128728 | consumed samples:        28528 | consumed tokens:     58425344 | elapsed time per iteration (s): 15.23 | learning rate: 9.348E-06 | global batch size:    16 | lm loss: 6.716001E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1784/  128728 | consumed samples:        28544 | consumed tokens:     58458112 | elapsed time per iteration (s): 15.25 | learning rate: 9.353E-06 | global batch size:    16 | lm loss: 6.781655E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1785/  128728 | consumed samples:        28560 | consumed tokens:     58490880 | elapsed time per iteration (s): 15.24 | learning rate: 9.359E-06 | global batch size:    16 | lm loss: 6.668215E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1786/  128728 | consumed samples:        28576 | consumed tokens:     58523648 | elapsed time per iteration (s): 15.21 | learning rate: 9.364E-06 | global batch size:    16 | lm loss: 6.672732E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1787/  128728 | consumed samples:        28592 | consumed tokens:     58556416 | elapsed time per iteration (s): 15.17 | learning rate: 9.369E-06 | global batch size:    16 | lm loss: 6.688550E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1788/  128728 | consumed samples:        28608 | consumed tokens:     58589184 | elapsed time per iteration (s): 15.27 | learning rate: 9.374E-06 | global batch size:    16 | lm loss: 6.671909E+00 | grad norm: 0.932 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1789/  128728 | consumed samples:        28624 | consumed tokens:     58621952 | elapsed time per iteration (s): 15.24 | learning rate: 9.380E-06 | global batch size:    16 | lm loss: 6.553540E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1790/  128728 | consumed samples:        28640 | consumed tokens:     58654720 | elapsed time per iteration (s): 15.24 | learning rate: 9.385E-06 | global batch size:    16 | lm loss: 6.730831E+00 | grad norm: 1.138 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1791/  128728 | consumed samples:        28656 | consumed tokens:     58687488 | elapsed time per iteration (s): 15.26 | learning rate: 9.390E-06 | global batch size:    16 | lm loss: 6.436703E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1792/  128728 | consumed samples:        28672 | consumed tokens:     58720256 | elapsed time per iteration (s): 15.22 | learning rate: 9.395E-06 | global batch size:    16 | lm loss: 6.470987E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1793/  128728 | consumed samples:        28688 | consumed tokens:     58753024 | elapsed time per iteration (s): 15.22 | learning rate: 9.401E-06 | global batch size:    16 | lm loss: 6.891861E+00 | grad norm: 1.009 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1794/  128728 | consumed samples:        28704 | consumed tokens:     58785792 | elapsed time per iteration (s): 15.25 | learning rate: 9.406E-06 | global batch size:    16 | lm loss: 6.579654E+00 | grad norm: 0.865 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1795/  128728 | consumed samples:        28720 | consumed tokens:     58818560 | elapsed time per iteration (s): 15.20 | learning rate: 9.411E-06 | global batch size:    16 | lm loss: 6.601382E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1796/  128728 | consumed samples:        28736 | consumed tokens:     58851328 | elapsed time per iteration (s): 15.22 | learning rate: 9.416E-06 | global batch size:    16 | lm loss: 6.712410E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1797/  128728 | consumed samples:        28752 | consumed tokens:     58884096 | elapsed time per iteration (s): 15.28 | learning rate: 9.421E-06 | global batch size:    16 | lm loss: 6.652021E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1798/  128728 | consumed samples:        28768 | consumed tokens:     58916864 | elapsed time per iteration (s): 15.21 | learning rate: 9.427E-06 | global batch size:    16 | lm loss: 6.661202E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1799/  128728 | consumed samples:        28784 | consumed tokens:     58949632 | elapsed time per iteration (s): 15.21 | learning rate: 9.432E-06 | global batch size:    16 | lm loss: 6.523858E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1800/  128728 | consumed samples:        28800 | consumed tokens:     58982400 | elapsed time per iteration (s): 15.24 | learning rate: 9.437E-06 | global batch size:    16 | lm loss: 6.623683E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1801/  128728 | consumed samples:        28816 | consumed tokens:     59015168 | elapsed time per iteration (s): 15.23 | learning rate: 9.442E-06 | global batch size:    16 | lm loss: 6.714439E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1802/  128728 | consumed samples:        28832 | consumed tokens:     59047936 | elapsed time per iteration (s): 15.23 | learning rate: 9.448E-06 | global batch size:    16 | lm loss: 6.545686E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1803/  128728 | consumed samples:        28848 | consumed tokens:     59080704 | elapsed time per iteration (s): 15.24 | learning rate: 9.453E-06 | global batch size:    16 | lm loss: 6.519464E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1804/  128728 | consumed samples:        28864 | consumed tokens:     59113472 | elapsed time per iteration (s): 15.22 | learning rate: 9.458E-06 | global batch size:    16 | lm loss: 6.891013E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1805/  128728 | consumed samples:        28880 | consumed tokens:     59146240 | elapsed time per iteration (s): 15.21 | learning rate: 9.463E-06 | global batch size:    16 | lm loss: 6.718174E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1806/  128728 | consumed samples:        28896 | consumed tokens:     59179008 | elapsed time per iteration (s): 15.24 | learning rate: 9.469E-06 | global batch size:    16 | lm loss: 6.641480E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1807/  128728 | consumed samples:        28912 | consumed tokens:     59211776 | elapsed time per iteration (s): 15.23 | learning rate: 9.474E-06 | global batch size:    16 | lm loss: 6.519784E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1808/  128728 | consumed samples:        28928 | consumed tokens:     59244544 | elapsed time per iteration (s): 15.21 | learning rate: 9.479E-06 | global batch size:    16 | lm loss: 6.584937E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1809/  128728 | consumed samples:        28944 | consumed tokens:     59277312 | elapsed time per iteration (s): 15.22 | learning rate: 9.484E-06 | global batch size:    16 | lm loss: 6.330964E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1810/  128728 | consumed samples:        28960 | consumed tokens:     59310080 | elapsed time per iteration (s): 15.23 | learning rate: 9.490E-06 | global batch size:    16 | lm loss: 7.042406E+00 | grad norm: 0.997 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1811/  128728 | consumed samples:        28976 | consumed tokens:     59342848 | elapsed time per iteration (s): 15.23 | learning rate: 9.495E-06 | global batch size:    16 | lm loss: 6.472970E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1812/  128728 | consumed samples:        28992 | consumed tokens:     59375616 | elapsed time per iteration (s): 15.23 | learning rate: 9.500E-06 | global batch size:    16 | lm loss: 6.761879E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1813/  128728 | consumed samples:        29008 | consumed tokens:     59408384 | elapsed time per iteration (s): 15.24 | learning rate: 9.505E-06 | global batch size:    16 | lm loss: 6.489796E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1814/  128728 | consumed samples:        29024 | consumed tokens:     59441152 | elapsed time per iteration (s): 15.24 | learning rate: 9.511E-06 | global batch size:    16 | lm loss: 6.517369E+00 | grad norm: 1.229 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1815/  128728 | consumed samples:        29040 | consumed tokens:     59473920 | elapsed time per iteration (s): 15.24 | learning rate: 9.516E-06 | global batch size:    16 | lm loss: 6.735540E+00 | grad norm: 1.278 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1816/  128728 | consumed samples:        29056 | consumed tokens:     59506688 | elapsed time per iteration (s): 15.22 | learning rate: 9.521E-06 | global batch size:    16 | lm loss: 6.628697E+00 | grad norm: 1.480 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1817/  128728 | consumed samples:        29072 | consumed tokens:     59539456 | elapsed time per iteration (s): 15.22 | learning rate: 9.526E-06 | global batch size:    16 | lm loss: 6.515108E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1818/  128728 | consumed samples:        29088 | consumed tokens:     59572224 | elapsed time per iteration (s): 15.24 | learning rate: 9.532E-06 | global batch size:    16 | lm loss: 6.639629E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1819/  128728 | consumed samples:        29104 | consumed tokens:     59604992 | elapsed time per iteration (s): 15.22 | learning rate: 9.537E-06 | global batch size:    16 | lm loss: 6.651646E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1820/  128728 | consumed samples:        29120 | consumed tokens:     59637760 | elapsed time per iteration (s): 15.22 | learning rate: 9.542E-06 | global batch size:    16 | lm loss: 6.575983E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1821/  128728 | consumed samples:        29136 | consumed tokens:     59670528 | elapsed time per iteration (s): 15.15 | learning rate: 9.547E-06 | global batch size:    16 | lm loss: 6.677689E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     1822/  128728 | consumed samples:        29152 | consumed tokens:     59703296 | elapsed time per iteration (s): 15.17 | learning rate: 9.553E-06 | global batch size:    16 | lm loss: 6.558556E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1823/  128728 | consumed samples:        29168 | consumed tokens:     59736064 | elapsed time per iteration (s): 15.22 | learning rate: 9.558E-06 | global batch size:    16 | lm loss: 6.579345E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1824/  128728 | consumed samples:        29184 | consumed tokens:     59768832 | elapsed time per iteration (s): 15.18 | learning rate: 9.563E-06 | global batch size:    16 | lm loss: 6.645849E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1825/  128728 | consumed samples:        29200 | consumed tokens:     59801600 | elapsed time per iteration (s): 15.23 | learning rate: 9.568E-06 | global batch size:    16 | lm loss: 6.550450E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1826/  128728 | consumed samples:        29216 | consumed tokens:     59834368 | elapsed time per iteration (s): 15.25 | learning rate: 9.574E-06 | global batch size:    16 | lm loss: 6.690180E+00 | grad norm: 0.949 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     1827/  128728 | consumed samples:        29232 | consumed tokens:     59867136 | elapsed time per iteration (s): 15.26 | learning rate: 9.579E-06 | global batch size:    16 | lm loss: 6.688923E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1828/  128728 | consumed samples:        29248 | consumed tokens:     59899904 | elapsed time per iteration (s): 15.25 | learning rate: 9.584E-06 | global batch size:    16 | lm loss: 6.797194E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1829/  128728 | consumed samples:        29264 | consumed tokens:     59932672 | elapsed time per iteration (s): 15.28 | learning rate: 9.589E-06 | global batch size:    16 | lm loss: 6.436186E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1830/  128728 | consumed samples:        29280 | consumed tokens:     59965440 | elapsed time per iteration (s): 15.23 | learning rate: 9.594E-06 | global batch size:    16 | lm loss: 6.853899E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1831/  128728 | consumed samples:        29296 | consumed tokens:     59998208 | elapsed time per iteration (s): 15.24 | learning rate: 9.600E-06 | global batch size:    16 | lm loss: 6.458448E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1832/  128728 | consumed samples:        29312 | consumed tokens:     60030976 | elapsed time per iteration (s): 15.24 | learning rate: 9.605E-06 | global batch size:    16 | lm loss: 6.671127E+00 | grad norm: 0.938 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1833/  128728 | consumed samples:        29328 | consumed tokens:     60063744 | elapsed time per iteration (s): 15.24 | learning rate: 9.610E-06 | global batch size:    16 | lm loss: 6.545115E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1834/  128728 | consumed samples:        29344 | consumed tokens:     60096512 | elapsed time per iteration (s): 15.26 | learning rate: 9.615E-06 | global batch size:    16 | lm loss: 6.780546E+00 | grad norm: 1.384 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1835/  128728 | consumed samples:        29360 | consumed tokens:     60129280 | elapsed time per iteration (s): 15.21 | learning rate: 9.621E-06 | global batch size:    16 | lm loss: 6.472826E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1836/  128728 | consumed samples:        29376 | consumed tokens:     60162048 | elapsed time per iteration (s): 15.23 | learning rate: 9.626E-06 | global batch size:    16 | lm loss: 6.723257E+00 | grad norm: 0.891 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1837/  128728 | consumed samples:        29392 | consumed tokens:     60194816 | elapsed time per iteration (s): 15.24 | learning rate: 9.631E-06 | global batch size:    16 | lm loss: 6.483226E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1838/  128728 | consumed samples:        29408 | consumed tokens:     60227584 | elapsed time per iteration (s): 15.23 | learning rate: 9.636E-06 | global batch size:    16 | lm loss: 6.481052E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1839/  128728 | consumed samples:        29424 | consumed tokens:     60260352 | elapsed time per iteration (s): 15.21 | learning rate: 9.642E-06 | global batch size:    16 | lm loss: 6.451497E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1840/  128728 | consumed samples:        29440 | consumed tokens:     60293120 | elapsed time per iteration (s): 15.21 | learning rate: 9.647E-06 | global batch size:    16 | lm loss: 6.728784E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1841/  128728 | consumed samples:        29456 | consumed tokens:     60325888 | elapsed time per iteration (s): 15.23 | learning rate: 9.652E-06 | global batch size:    16 | lm loss: 6.508964E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1842/  128728 | consumed samples:        29472 | consumed tokens:     60358656 | elapsed time per iteration (s): 15.22 | learning rate: 9.657E-06 | global batch size:    16 | lm loss: 6.681833E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1843/  128728 | consumed samples:        29488 | consumed tokens:     60391424 | elapsed time per iteration (s): 15.27 | learning rate: 9.663E-06 | global batch size:    16 | lm loss: 6.648950E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1844/  128728 | consumed samples:        29504 | consumed tokens:     60424192 | elapsed time per iteration (s): 15.21 | learning rate: 9.668E-06 | global batch size:    16 | lm loss: 6.666204E+00 | grad norm: 1.899 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1845/  128728 | consumed samples:        29520 | consumed tokens:     60456960 | elapsed time per iteration (s): 15.22 | learning rate: 9.673E-06 | global batch size:    16 | lm loss: 6.498180E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1846/  128728 | consumed samples:        29536 | consumed tokens:     60489728 | elapsed time per iteration (s): 15.22 | learning rate: 9.678E-06 | global batch size:    16 | lm loss: 6.420746E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1847/  128728 | consumed samples:        29552 | consumed tokens:     60522496 | elapsed time per iteration (s): 15.33 | learning rate: 9.684E-06 | global batch size:    16 | lm loss: 6.513135E+00 | grad norm: 1.344 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.043 | TFLOPs: 7.99 |
[default7]: iteration     1848/  128728 | consumed samples:        29568 | consumed tokens:     60555264 | elapsed time per iteration (s): 15.27 | learning rate: 9.689E-06 | global batch size:    16 | lm loss: 6.598331E+00 | grad norm: 1.024 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1849/  128728 | consumed samples:        29584 | consumed tokens:     60588032 | elapsed time per iteration (s): 15.23 | learning rate: 9.694E-06 | global batch size:    16 | lm loss: 6.658598E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1850/  128728 | consumed samples:        29600 | consumed tokens:     60620800 | elapsed time per iteration (s): 15.22 | learning rate: 9.699E-06 | global batch size:    16 | lm loss: 6.735951E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1851/  128728 | consumed samples:        29616 | consumed tokens:     60653568 | elapsed time per iteration (s): 15.21 | learning rate: 9.705E-06 | global batch size:    16 | lm loss: 6.589662E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1852/  128728 | consumed samples:        29632 | consumed tokens:     60686336 | elapsed time per iteration (s): 15.27 | learning rate: 9.710E-06 | global batch size:    16 | lm loss: 6.598696E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1853/  128728 | consumed samples:        29648 | consumed tokens:     60719104 | elapsed time per iteration (s): 15.27 | learning rate: 9.715E-06 | global batch size:    16 | lm loss: 6.593414E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1854/  128728 | consumed samples:        29664 | consumed tokens:     60751872 | elapsed time per iteration (s): 15.25 | learning rate: 9.720E-06 | global batch size:    16 | lm loss: 6.430328E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1855/  128728 | consumed samples:        29680 | consumed tokens:     60784640 | elapsed time per iteration (s): 15.22 | learning rate: 9.726E-06 | global batch size:    16 | lm loss: 6.661034E+00 | grad norm: 0.877 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1856/  128728 | consumed samples:        29696 | consumed tokens:     60817408 | elapsed time per iteration (s): 15.26 | learning rate: 9.731E-06 | global batch size:    16 | lm loss: 6.709377E+00 | grad norm: 1.023 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1857/  128728 | consumed samples:        29712 | consumed tokens:     60850176 | elapsed time per iteration (s): 15.21 | learning rate: 9.736E-06 | global batch size:    16 | lm loss: 6.679298E+00 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1858/  128728 | consumed samples:        29728 | consumed tokens:     60882944 | elapsed time per iteration (s): 15.25 | learning rate: 9.741E-06 | global batch size:    16 | lm loss: 6.827992E+00 | grad norm: 0.867 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1859/  128728 | consumed samples:        29744 | consumed tokens:     60915712 | elapsed time per iteration (s): 15.24 | learning rate: 9.747E-06 | global batch size:    16 | lm loss: 6.568586E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1860/  128728 | consumed samples:        29760 | consumed tokens:     60948480 | elapsed time per iteration (s): 15.21 | learning rate: 9.752E-06 | global batch size:    16 | lm loss: 6.410093E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1861/  128728 | consumed samples:        29776 | consumed tokens:     60981248 | elapsed time per iteration (s): 15.22 | learning rate: 9.757E-06 | global batch size:    16 | lm loss: 6.323568E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1862/  128728 | consumed samples:        29792 | consumed tokens:     61014016 | elapsed time per iteration (s): 15.21 | learning rate: 9.762E-06 | global batch size:    16 | lm loss: 6.819780E+00 | grad norm: 0.883 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1863/  128728 | consumed samples:        29808 | consumed tokens:     61046784 | elapsed time per iteration (s): 15.24 | learning rate: 9.768E-06 | global batch size:    16 | lm loss: 6.857122E+00 | grad norm: 1.314 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1864/  128728 | consumed samples:        29824 | consumed tokens:     61079552 | elapsed time per iteration (s): 15.27 | learning rate: 9.773E-06 | global batch size:    16 | lm loss: 6.621314E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1865/  128728 | consumed samples:        29840 | consumed tokens:     61112320 | elapsed time per iteration (s): 15.20 | learning rate: 9.778E-06 | global batch size:    16 | lm loss: 6.558571E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1866/  128728 | consumed samples:        29856 | consumed tokens:     61145088 | elapsed time per iteration (s): 15.21 | learning rate: 9.783E-06 | global batch size:    16 | lm loss: 6.498933E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1867/  128728 | consumed samples:        29872 | consumed tokens:     61177856 | elapsed time per iteration (s): 15.22 | learning rate: 9.788E-06 | global batch size:    16 | lm loss: 6.822206E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1868/  128728 | consumed samples:        29888 | consumed tokens:     61210624 | elapsed time per iteration (s): 15.18 | learning rate: 9.794E-06 | global batch size:    16 | lm loss: 6.600270E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1869/  128728 | consumed samples:        29904 | consumed tokens:     61243392 | elapsed time per iteration (s): 15.22 | learning rate: 9.799E-06 | global batch size:    16 | lm loss: 6.469594E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1870/  128728 | consumed samples:        29920 | consumed tokens:     61276160 | elapsed time per iteration (s): 15.25 | learning rate: 9.804E-06 | global batch size:    16 | lm loss: 6.446286E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1871/  128728 | consumed samples:        29936 | consumed tokens:     61308928 | elapsed time per iteration (s): 15.23 | learning rate: 9.809E-06 | global batch size:    16 | lm loss: 6.491003E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1872/  128728 | consumed samples:        29952 | consumed tokens:     61341696 | elapsed time per iteration (s): 15.23 | learning rate: 9.815E-06 | global batch size:    16 | lm loss: 6.493572E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1873/  128728 | consumed samples:        29968 | consumed tokens:     61374464 | elapsed time per iteration (s): 15.24 | learning rate: 9.820E-06 | global batch size:    16 | lm loss: 6.607419E+00 | grad norm: 1.124 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1874/  128728 | consumed samples:        29984 | consumed tokens:     61407232 | elapsed time per iteration (s): 15.20 | learning rate: 9.825E-06 | global batch size:    16 | lm loss: 6.643625E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1875/  128728 | consumed samples:        30000 | consumed tokens:     61440000 | elapsed time per iteration (s): 15.21 | learning rate: 9.830E-06 | global batch size:    16 | lm loss: 6.527872E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1876/  128728 | consumed samples:        30016 | consumed tokens:     61472768 | elapsed time per iteration (s): 15.23 | learning rate: 9.836E-06 | global batch size:    16 | lm loss: 6.579536E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1877/  128728 | consumed samples:        30032 | consumed tokens:     61505536 | elapsed time per iteration (s): 15.27 | learning rate: 9.841E-06 | global batch size:    16 | lm loss: 6.619586E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1878/  128728 | consumed samples:        30048 | consumed tokens:     61538304 | elapsed time per iteration (s): 15.19 | learning rate: 9.846E-06 | global batch size:    16 | lm loss: 6.514913E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1879/  128728 | consumed samples:        30064 | consumed tokens:     61571072 | elapsed time per iteration (s): 15.24 | learning rate: 9.851E-06 | global batch size:    16 | lm loss: 6.534479E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1880/  128728 | consumed samples:        30080 | consumed tokens:     61603840 | elapsed time per iteration (s): 15.30 | learning rate: 9.857E-06 | global batch size:    16 | lm loss: 6.383130E+00 | grad norm: 1.318 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.045 | TFLOPs: 8.00 |
[default7]: iteration     1881/  128728 | consumed samples:        30096 | consumed tokens:     61636608 | elapsed time per iteration (s): 15.27 | learning rate: 9.862E-06 | global batch size:    16 | lm loss: 6.530272E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1882/  128728 | consumed samples:        30112 | consumed tokens:     61669376 | elapsed time per iteration (s): 15.23 | learning rate: 9.867E-06 | global batch size:    16 | lm loss: 6.505867E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1883/  128728 | consumed samples:        30128 | consumed tokens:     61702144 | elapsed time per iteration (s): 15.23 | learning rate: 9.872E-06 | global batch size:    16 | lm loss: 6.482748E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1884/  128728 | consumed samples:        30144 | consumed tokens:     61734912 | elapsed time per iteration (s): 15.26 | learning rate: 9.878E-06 | global batch size:    16 | lm loss: 6.664700E+00 | grad norm: 2.959 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1885/  128728 | consumed samples:        30160 | consumed tokens:     61767680 | elapsed time per iteration (s): 15.22 | learning rate: 9.883E-06 | global batch size:    16 | lm loss: 6.515076E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1886/  128728 | consumed samples:        30176 | consumed tokens:     61800448 | elapsed time per iteration (s): 15.27 | learning rate: 9.888E-06 | global batch size:    16 | lm loss: 6.282681E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     1887/  128728 | consumed samples:        30192 | consumed tokens:     61833216 | elapsed time per iteration (s): 15.23 | learning rate: 9.893E-06 | global batch size:    16 | lm loss: 6.580127E+00 | grad norm: 1.281 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1888/  128728 | consumed samples:        30208 | consumed tokens:     61865984 | elapsed time per iteration (s): 15.22 | learning rate: 9.899E-06 | global batch size:    16 | lm loss: 6.476642E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1889/  128728 | consumed samples:        30224 | consumed tokens:     61898752 | elapsed time per iteration (s): 15.23 | learning rate: 9.904E-06 | global batch size:    16 | lm loss: 6.487213E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1890/  128728 | consumed samples:        30240 | consumed tokens:     61931520 | elapsed time per iteration (s): 15.19 | learning rate: 9.909E-06 | global batch size:    16 | lm loss: 6.532672E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1891/  128728 | consumed samples:        30256 | consumed tokens:     61964288 | elapsed time per iteration (s): 15.23 | learning rate: 9.914E-06 | global batch size:    16 | lm loss: 6.400381E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1892/  128728 | consumed samples:        30272 | consumed tokens:     61997056 | elapsed time per iteration (s): 15.25 | learning rate: 9.920E-06 | global batch size:    16 | lm loss: 6.453693E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1893/  128728 | consumed samples:        30288 | consumed tokens:     62029824 | elapsed time per iteration (s): 15.22 | learning rate: 9.925E-06 | global batch size:    16 | lm loss: 6.528496E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1894/  128728 | consumed samples:        30304 | consumed tokens:     62062592 | elapsed time per iteration (s): 15.26 | learning rate: 9.930E-06 | global batch size:    16 | lm loss: 6.691092E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1895/  128728 | consumed samples:        30320 | consumed tokens:     62095360 | elapsed time per iteration (s): 15.23 | learning rate: 9.935E-06 | global batch size:    16 | lm loss: 6.338684E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1896/  128728 | consumed samples:        30336 | consumed tokens:     62128128 | elapsed time per iteration (s): 15.28 | learning rate: 9.941E-06 | global batch size:    16 | lm loss: 6.594782E+00 | grad norm: 0.827 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1897/  128728 | consumed samples:        30352 | consumed tokens:     62160896 | elapsed time per iteration (s): 15.20 | learning rate: 9.946E-06 | global batch size:    16 | lm loss: 6.504727E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1898/  128728 | consumed samples:        30368 | consumed tokens:     62193664 | elapsed time per iteration (s): 15.16 | learning rate: 9.951E-06 | global batch size:    16 | lm loss: 6.835838E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1899/  128728 | consumed samples:        30384 | consumed tokens:     62226432 | elapsed time per iteration (s): 15.25 | learning rate: 9.956E-06 | global batch size:    16 | lm loss: 6.691212E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1900/  128728 | consumed samples:        30400 | consumed tokens:     62259200 | elapsed time per iteration (s): 15.21 | learning rate: 9.961E-06 | global batch size:    16 | lm loss: 6.594204E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1901/  128728 | consumed samples:        30416 | consumed tokens:     62291968 | elapsed time per iteration (s): 15.24 | learning rate: 9.967E-06 | global batch size:    16 | lm loss: 6.573639E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1902/  128728 | consumed samples:        30432 | consumed tokens:     62324736 | elapsed time per iteration (s): 15.20 | learning rate: 9.972E-06 | global batch size:    16 | lm loss: 6.642185E+00 | grad norm: 0.947 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1903/  128728 | consumed samples:        30448 | consumed tokens:     62357504 | elapsed time per iteration (s): 15.20 | learning rate: 9.977E-06 | global batch size:    16 | lm loss: 6.638869E+00 | grad norm: 1.003 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1904/  128728 | consumed samples:        30464 | consumed tokens:     62390272 | elapsed time per iteration (s): 15.18 | learning rate: 9.982E-06 | global batch size:    16 | lm loss: 6.439603E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1905/  128728 | consumed samples:        30480 | consumed tokens:     62423040 | elapsed time per iteration (s): 15.25 | learning rate: 9.988E-06 | global batch size:    16 | lm loss: 6.637027E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1906/  128728 | consumed samples:        30496 | consumed tokens:     62455808 | elapsed time per iteration (s): 15.15 | learning rate: 9.993E-06 | global batch size:    16 | lm loss: 6.455775E+00 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     1907/  128728 | consumed samples:        30512 | consumed tokens:     62488576 | elapsed time per iteration (s): 15.17 | learning rate: 9.998E-06 | global batch size:    16 | lm loss: 6.424469E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1908/  128728 | consumed samples:        30528 | consumed tokens:     62521344 | elapsed time per iteration (s): 15.20 | learning rate: 1.000E-05 | global batch size:    16 | lm loss: 6.547606E+00 | grad norm: 1.458 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1909/  128728 | consumed samples:        30544 | consumed tokens:     62554112 | elapsed time per iteration (s): 15.22 | learning rate: 1.001E-05 | global batch size:    16 | lm loss: 6.466846E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1910/  128728 | consumed samples:        30560 | consumed tokens:     62586880 | elapsed time per iteration (s): 15.24 | learning rate: 1.001E-05 | global batch size:    16 | lm loss: 6.650313E+00 | grad norm: 1.447 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1911/  128728 | consumed samples:        30576 | consumed tokens:     62619648 | elapsed time per iteration (s): 15.22 | learning rate: 1.002E-05 | global batch size:    16 | lm loss: 6.618893E+00 | grad norm: 1.340 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1912/  128728 | consumed samples:        30592 | consumed tokens:     62652416 | elapsed time per iteration (s): 15.21 | learning rate: 1.002E-05 | global batch size:    16 | lm loss: 6.551538E+00 | grad norm: 1.040 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1913/  128728 | consumed samples:        30608 | consumed tokens:     62685184 | elapsed time per iteration (s): 15.26 | learning rate: 1.003E-05 | global batch size:    16 | lm loss: 6.546391E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1914/  128728 | consumed samples:        30624 | consumed tokens:     62717952 | elapsed time per iteration (s): 15.25 | learning rate: 1.003E-05 | global batch size:    16 | lm loss: 6.704463E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1915/  128728 | consumed samples:        30640 | consumed tokens:     62750720 | elapsed time per iteration (s): 15.22 | learning rate: 1.004E-05 | global batch size:    16 | lm loss: 6.473845E+00 | grad norm: 1.044 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1916/  128728 | consumed samples:        30656 | consumed tokens:     62783488 | elapsed time per iteration (s): 15.23 | learning rate: 1.005E-05 | global batch size:    16 | lm loss: 6.693832E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1917/  128728 | consumed samples:        30672 | consumed tokens:     62816256 | elapsed time per iteration (s): 15.21 | learning rate: 1.005E-05 | global batch size:    16 | lm loss: 6.588843E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1918/  128728 | consumed samples:        30688 | consumed tokens:     62849024 | elapsed time per iteration (s): 15.21 | learning rate: 1.006E-05 | global batch size:    16 | lm loss: 6.421237E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1919/  128728 | consumed samples:        30704 | consumed tokens:     62881792 | elapsed time per iteration (s): 15.21 | learning rate: 1.006E-05 | global batch size:    16 | lm loss: 6.483512E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1920/  128728 | consumed samples:        30720 | consumed tokens:     62914560 | elapsed time per iteration (s): 15.19 | learning rate: 1.007E-05 | global batch size:    16 | lm loss: 6.566906E+00 | grad norm: 1.538 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1921/  128728 | consumed samples:        30736 | consumed tokens:     62947328 | elapsed time per iteration (s): 15.21 | learning rate: 1.007E-05 | global batch size:    16 | lm loss: 6.512776E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1922/  128728 | consumed samples:        30752 | consumed tokens:     62980096 | elapsed time per iteration (s): 15.22 | learning rate: 1.008E-05 | global batch size:    16 | lm loss: 6.370068E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1923/  128728 | consumed samples:        30768 | consumed tokens:     63012864 | elapsed time per iteration (s): 15.21 | learning rate: 1.008E-05 | global batch size:    16 | lm loss: 6.588835E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1924/  128728 | consumed samples:        30784 | consumed tokens:     63045632 | elapsed time per iteration (s): 15.21 | learning rate: 1.009E-05 | global batch size:    16 | lm loss: 6.359224E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1925/  128728 | consumed samples:        30800 | consumed tokens:     63078400 | elapsed time per iteration (s): 15.23 | learning rate: 1.009E-05 | global batch size:    16 | lm loss: 6.767123E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1926/  128728 | consumed samples:        30816 | consumed tokens:     63111168 | elapsed time per iteration (s): 15.22 | learning rate: 1.010E-05 | global batch size:    16 | lm loss: 6.477892E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1927/  128728 | consumed samples:        30832 | consumed tokens:     63143936 | elapsed time per iteration (s): 15.15 | learning rate: 1.010E-05 | global batch size:    16 | lm loss: 6.328223E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     1928/  128728 | consumed samples:        30848 | consumed tokens:     63176704 | elapsed time per iteration (s): 15.22 | learning rate: 1.011E-05 | global batch size:    16 | lm loss: 6.486270E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1929/  128728 | consumed samples:        30864 | consumed tokens:     63209472 | elapsed time per iteration (s): 15.21 | learning rate: 1.011E-05 | global batch size:    16 | lm loss: 6.472905E+00 | grad norm: 0.994 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1930/  128728 | consumed samples:        30880 | consumed tokens:     63242240 | elapsed time per iteration (s): 15.22 | learning rate: 1.012E-05 | global batch size:    16 | lm loss: 6.811383E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1931/  128728 | consumed samples:        30896 | consumed tokens:     63275008 | elapsed time per iteration (s): 15.25 | learning rate: 1.012E-05 | global batch size:    16 | lm loss: 6.692072E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1932/  128728 | consumed samples:        30912 | consumed tokens:     63307776 | elapsed time per iteration (s): 15.19 | learning rate: 1.013E-05 | global batch size:    16 | lm loss: 6.495762E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1933/  128728 | consumed samples:        30928 | consumed tokens:     63340544 | elapsed time per iteration (s): 15.21 | learning rate: 1.013E-05 | global batch size:    16 | lm loss: 6.484449E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1934/  128728 | consumed samples:        30944 | consumed tokens:     63373312 | elapsed time per iteration (s): 15.20 | learning rate: 1.014E-05 | global batch size:    16 | lm loss: 6.561663E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1935/  128728 | consumed samples:        30960 | consumed tokens:     63406080 | elapsed time per iteration (s): 15.22 | learning rate: 1.014E-05 | global batch size:    16 | lm loss: 6.621759E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1936/  128728 | consumed samples:        30976 | consumed tokens:     63438848 | elapsed time per iteration (s): 15.22 | learning rate: 1.015E-05 | global batch size:    16 | lm loss: 6.439867E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1937/  128728 | consumed samples:        30992 | consumed tokens:     63471616 | elapsed time per iteration (s): 15.26 | learning rate: 1.016E-05 | global batch size:    16 | lm loss: 6.363036E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1938/  128728 | consumed samples:        31008 | consumed tokens:     63504384 | elapsed time per iteration (s): 15.26 | learning rate: 1.016E-05 | global batch size:    16 | lm loss: 6.514183E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1939/  128728 | consumed samples:        31024 | consumed tokens:     63537152 | elapsed time per iteration (s): 15.26 | learning rate: 1.017E-05 | global batch size:    16 | lm loss: 6.339239E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1940/  128728 | consumed samples:        31040 | consumed tokens:     63569920 | elapsed time per iteration (s): 15.22 | learning rate: 1.017E-05 | global batch size:    16 | lm loss: 6.654146E+00 | grad norm: 0.980 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1941/  128728 | consumed samples:        31056 | consumed tokens:     63602688 | elapsed time per iteration (s): 15.23 | learning rate: 1.018E-05 | global batch size:    16 | lm loss: 6.603597E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1942/  128728 | consumed samples:        31072 | consumed tokens:     63635456 | elapsed time per iteration (s): 15.23 | learning rate: 1.018E-05 | global batch size:    16 | lm loss: 6.599665E+00 | grad norm: 3.648 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1943/  128728 | consumed samples:        31088 | consumed tokens:     63668224 | elapsed time per iteration (s): 15.21 | learning rate: 1.019E-05 | global batch size:    16 | lm loss: 6.663511E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1944/  128728 | consumed samples:        31104 | consumed tokens:     63700992 | elapsed time per iteration (s): 15.21 | learning rate: 1.019E-05 | global batch size:    16 | lm loss: 6.307026E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1945/  128728 | consumed samples:        31120 | consumed tokens:     63733760 | elapsed time per iteration (s): 15.22 | learning rate: 1.020E-05 | global batch size:    16 | lm loss: 6.489582E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1946/  128728 | consumed samples:        31136 | consumed tokens:     63766528 | elapsed time per iteration (s): 15.26 | learning rate: 1.020E-05 | global batch size:    16 | lm loss: 6.788570E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1947/  128728 | consumed samples:        31152 | consumed tokens:     63799296 | elapsed time per iteration (s): 15.24 | learning rate: 1.021E-05 | global batch size:    16 | lm loss: 6.571981E+00 | grad norm: 1.260 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1948/  128728 | consumed samples:        31168 | consumed tokens:     63832064 | elapsed time per iteration (s): 15.21 | learning rate: 1.021E-05 | global batch size:    16 | lm loss: 6.630430E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1949/  128728 | consumed samples:        31184 | consumed tokens:     63864832 | elapsed time per iteration (s): 15.21 | learning rate: 1.022E-05 | global batch size:    16 | lm loss: 6.470918E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1950/  128728 | consumed samples:        31200 | consumed tokens:     63897600 | elapsed time per iteration (s): 15.28 | learning rate: 1.022E-05 | global batch size:    16 | lm loss: 6.354256E+00 | grad norm: 1.022 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     1951/  128728 | consumed samples:        31216 | consumed tokens:     63930368 | elapsed time per iteration (s): 15.16 | learning rate: 1.023E-05 | global batch size:    16 | lm loss: 6.493493E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     1952/  128728 | consumed samples:        31232 | consumed tokens:     63963136 | elapsed time per iteration (s): 15.20 | learning rate: 1.023E-05 | global batch size:    16 | lm loss: 6.460168E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1953/  128728 | consumed samples:        31248 | consumed tokens:     63995904 | elapsed time per iteration (s): 15.22 | learning rate: 1.024E-05 | global batch size:    16 | lm loss: 6.540512E+00 | grad norm: 1.009 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1954/  128728 | consumed samples:        31264 | consumed tokens:     64028672 | elapsed time per iteration (s): 15.21 | learning rate: 1.024E-05 | global batch size:    16 | lm loss: 6.298806E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1955/  128728 | consumed samples:        31280 | consumed tokens:     64061440 | elapsed time per iteration (s): 15.23 | learning rate: 1.025E-05 | global batch size:    16 | lm loss: 6.592202E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1956/  128728 | consumed samples:        31296 | consumed tokens:     64094208 | elapsed time per iteration (s): 15.22 | learning rate: 1.026E-05 | global batch size:    16 | lm loss: 6.384544E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1957/  128728 | consumed samples:        31312 | consumed tokens:     64126976 | elapsed time per iteration (s): 15.23 | learning rate: 1.026E-05 | global batch size:    16 | lm loss: 6.758242E+00 | grad norm: 1.276 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1958/  128728 | consumed samples:        31328 | consumed tokens:     64159744 | elapsed time per iteration (s): 15.24 | learning rate: 1.027E-05 | global batch size:    16 | lm loss: 6.602652E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1959/  128728 | consumed samples:        31344 | consumed tokens:     64192512 | elapsed time per iteration (s): 15.20 | learning rate: 1.027E-05 | global batch size:    16 | lm loss: 6.728225E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1960/  128728 | consumed samples:        31360 | consumed tokens:     64225280 | elapsed time per iteration (s): 15.20 | learning rate: 1.028E-05 | global batch size:    16 | lm loss: 6.458584E+00 | grad norm: 0.984 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1961/  128728 | consumed samples:        31376 | consumed tokens:     64258048 | elapsed time per iteration (s): 15.21 | learning rate: 1.028E-05 | global batch size:    16 | lm loss: 6.611272E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1962/  128728 | consumed samples:        31392 | consumed tokens:     64290816 | elapsed time per iteration (s): 15.18 | learning rate: 1.029E-05 | global batch size:    16 | lm loss: 6.663339E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1963/  128728 | consumed samples:        31408 | consumed tokens:     64323584 | elapsed time per iteration (s): 15.24 | learning rate: 1.029E-05 | global batch size:    16 | lm loss: 6.305027E+00 | grad norm: 0.940 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1964/  128728 | consumed samples:        31424 | consumed tokens:     64356352 | elapsed time per iteration (s): 15.24 | learning rate: 1.030E-05 | global batch size:    16 | lm loss: 6.693589E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1965/  128728 | consumed samples:        31440 | consumed tokens:     64389120 | elapsed time per iteration (s): 15.26 | learning rate: 1.030E-05 | global batch size:    16 | lm loss: 6.589158E+00 | grad norm: 1.256 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     1966/  128728 | consumed samples:        31456 | consumed tokens:     64421888 | elapsed time per iteration (s): 15.20 | learning rate: 1.031E-05 | global batch size:    16 | lm loss: 6.519398E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1967/  128728 | consumed samples:        31472 | consumed tokens:     64454656 | elapsed time per iteration (s): 15.23 | learning rate: 1.031E-05 | global batch size:    16 | lm loss: 6.615813E+00 | grad norm: 1.512 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1968/  128728 | consumed samples:        31488 | consumed tokens:     64487424 | elapsed time per iteration (s): 15.24 | learning rate: 1.032E-05 | global batch size:    16 | lm loss: 6.581736E+00 | grad norm: 1.318 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1969/  128728 | consumed samples:        31504 | consumed tokens:     64520192 | elapsed time per iteration (s): 15.24 | learning rate: 1.032E-05 | global batch size:    16 | lm loss: 6.641015E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1970/  128728 | consumed samples:        31520 | consumed tokens:     64552960 | elapsed time per iteration (s): 15.25 | learning rate: 1.033E-05 | global batch size:    16 | lm loss: 6.500915E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1971/  128728 | consumed samples:        31536 | consumed tokens:     64585728 | elapsed time per iteration (s): 15.26 | learning rate: 1.033E-05 | global batch size:    16 | lm loss: 6.305531E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1972/  128728 | consumed samples:        31552 | consumed tokens:     64618496 | elapsed time per iteration (s): 15.23 | learning rate: 1.034E-05 | global batch size:    16 | lm loss: 6.369489E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1973/  128728 | consumed samples:        31568 | consumed tokens:     64651264 | elapsed time per iteration (s): 15.18 | learning rate: 1.034E-05 | global batch size:    16 | lm loss: 6.497954E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     1974/  128728 | consumed samples:        31584 | consumed tokens:     64684032 | elapsed time per iteration (s): 15.21 | learning rate: 1.035E-05 | global batch size:    16 | lm loss: 6.460599E+00 | grad norm: 0.879 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1975/  128728 | consumed samples:        31600 | consumed tokens:     64716800 | elapsed time per iteration (s): 15.20 | learning rate: 1.035E-05 | global batch size:    16 | lm loss: 6.474432E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1976/  128728 | consumed samples:        31616 | consumed tokens:     64749568 | elapsed time per iteration (s): 15.21 | learning rate: 1.036E-05 | global batch size:    16 | lm loss: 6.461910E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1977/  128728 | consumed samples:        31632 | consumed tokens:     64782336 | elapsed time per iteration (s): 15.22 | learning rate: 1.037E-05 | global batch size:    16 | lm loss: 6.431888E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1978/  128728 | consumed samples:        31648 | consumed tokens:     64815104 | elapsed time per iteration (s): 15.22 | learning rate: 1.037E-05 | global batch size:    16 | lm loss: 6.392217E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1979/  128728 | consumed samples:        31664 | consumed tokens:     64847872 | elapsed time per iteration (s): 15.23 | learning rate: 1.038E-05 | global batch size:    16 | lm loss: 6.331327E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1980/  128728 | consumed samples:        31680 | consumed tokens:     64880640 | elapsed time per iteration (s): 15.25 | learning rate: 1.038E-05 | global batch size:    16 | lm loss: 6.728785E+00 | grad norm: 1.497 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1981/  128728 | consumed samples:        31696 | consumed tokens:     64913408 | elapsed time per iteration (s): 15.21 | learning rate: 1.039E-05 | global batch size:    16 | lm loss: 6.497895E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1982/  128728 | consumed samples:        31712 | consumed tokens:     64946176 | elapsed time per iteration (s): 15.24 | learning rate: 1.039E-05 | global batch size:    16 | lm loss: 6.456326E+00 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1983/  128728 | consumed samples:        31728 | consumed tokens:     64978944 | elapsed time per iteration (s): 15.20 | learning rate: 1.040E-05 | global batch size:    16 | lm loss: 6.607990E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1984/  128728 | consumed samples:        31744 | consumed tokens:     65011712 | elapsed time per iteration (s): 15.22 | learning rate: 1.040E-05 | global batch size:    16 | lm loss: 6.539401E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1985/  128728 | consumed samples:        31760 | consumed tokens:     65044480 | elapsed time per iteration (s): 15.21 | learning rate: 1.041E-05 | global batch size:    16 | lm loss: 6.522558E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1986/  128728 | consumed samples:        31776 | consumed tokens:     65077248 | elapsed time per iteration (s): 15.21 | learning rate: 1.041E-05 | global batch size:    16 | lm loss: 6.358567E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     1987/  128728 | consumed samples:        31792 | consumed tokens:     65110016 | elapsed time per iteration (s): 15.22 | learning rate: 1.042E-05 | global batch size:    16 | lm loss: 6.626979E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1988/  128728 | consumed samples:        31808 | consumed tokens:     65142784 | elapsed time per iteration (s): 15.23 | learning rate: 1.042E-05 | global batch size:    16 | lm loss: 6.454780E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     1989/  128728 | consumed samples:        31824 | consumed tokens:     65175552 | elapsed time per iteration (s): 15.23 | learning rate: 1.043E-05 | global batch size:    16 | lm loss: 6.659132E+00 | grad norm: 1.133 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1990/  128728 | consumed samples:        31840 | consumed tokens:     65208320 | elapsed time per iteration (s): 15.23 | learning rate: 1.043E-05 | global batch size:    16 | lm loss: 6.639725E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1991/  128728 | consumed samples:        31856 | consumed tokens:     65241088 | elapsed time per iteration (s): 15.21 | learning rate: 1.044E-05 | global batch size:    16 | lm loss: 6.386582E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     1992/  128728 | consumed samples:        31872 | consumed tokens:     65273856 | elapsed time per iteration (s): 15.25 | learning rate: 1.044E-05 | global batch size:    16 | lm loss: 6.536162E+00 | grad norm: 1.135 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1993/  128728 | consumed samples:        31888 | consumed tokens:     65306624 | elapsed time per iteration (s): 15.20 | learning rate: 1.045E-05 | global batch size:    16 | lm loss: 6.536993E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     1994/  128728 | consumed samples:        31904 | consumed tokens:     65339392 | elapsed time per iteration (s): 15.23 | learning rate: 1.045E-05 | global batch size:    16 | lm loss: 6.499975E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1995/  128728 | consumed samples:        31920 | consumed tokens:     65372160 | elapsed time per iteration (s): 15.23 | learning rate: 1.046E-05 | global batch size:    16 | lm loss: 6.507980E+00 | grad norm: 1.094 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1996/  128728 | consumed samples:        31936 | consumed tokens:     65404928 | elapsed time per iteration (s): 15.24 | learning rate: 1.046E-05 | global batch size:    16 | lm loss: 6.813938E+00 | grad norm: 1.231 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     1997/  128728 | consumed samples:        31952 | consumed tokens:     65437696 | elapsed time per iteration (s): 15.23 | learning rate: 1.047E-05 | global batch size:    16 | lm loss: 6.574802E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     1998/  128728 | consumed samples:        31968 | consumed tokens:     65470464 | elapsed time per iteration (s): 15.25 | learning rate: 1.048E-05 | global batch size:    16 | lm loss: 6.272683E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     1999/  128728 | consumed samples:        31984 | consumed tokens:     65503232 | elapsed time per iteration (s): 15.25 | learning rate: 1.048E-05 | global batch size:    16 | lm loss: 6.425203E+00 | grad norm: 1.376 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default0]:[2022-03-03 14:24:27,588] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=0, lr=[1.0485786843614321e-05, 1.0485786843614321e-05, 1.0485786843614321e-05], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
[default0]:steps: 2000 loss: 6.4689 iter time (s): 13.947 samples/sec: 1.147
[default7]: iteration     2000/  128728 | consumed samples:        32000 | consumed tokens:     65536000 | elapsed time per iteration (s): 15.22 | learning rate: 1.049E-05 | global batch size:    16 | lm loss: 6.468902E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default0]:saving checkpoint at iteration    2000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default7]:------------------------------------------------------------------------------------------
[default7]:valid loss at iteration 2000 | lm loss value: 6.755413E+00 | lm loss PPL: 8.586940E+02 | 
[default7]:------------------------------------------------------------------------------------------
[default1]:[2022-03-03 14:24:56,077] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/mp_rank_01_model_states.pt
[default0]:[2022-03-03 14:24:56,084] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/mp_rank_00_model_states.pt
[default1]:[2022-03-03 14:25:04,064] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt
[default5]:[2022-03-03 14:25:04,138] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 14:25:04,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt
[default0]:[2022-03-03 14:25:04,391] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt
[default1]:[2022-03-03 14:25:04,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt
[default5]:[2022-03-03 14:25:04,427] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt
[default2]:[2022-03-03 14:25:04,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt
[default1]:[2022-03-03 14:25:04,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt
[default2]:[2022-03-03 14:25:04,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 14:25:04,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt
[default2]:[2022-03-03 14:25:04,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt
[default5]:[2022-03-03 14:25:04,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt
[default3]:[2022-03-03 14:25:04,787] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 14:25:04,757] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt
[default4]:[2022-03-03 14:25:04,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt
[default6]:[2022-03-03 14:25:04,783] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt
[default3]:[2022-03-03 14:25:04,750] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt
[default0]:[2022-03-03 14:25:04,824] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt
[default4]:[2022-03-03 14:25:04,819] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt
[default7]:[2022-03-03 14:25:04,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt
[default6]:[2022-03-03 14:25:04,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt
[default6]:[2022-03-03 14:25:04,926] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt
[default4]:[2022-03-03 14:25:04,927] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt
[default2]:[2022-03-03 14:25:05,016] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 14:25:05,013] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt
[default5]:[2022-03-03 14:25:05,066] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt
[default7]:[2022-03-03 14:25:05,061] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt
[default1]:[2022-03-03 14:25:05,176] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt
[default7]:[2022-03-03 14:25:05,131] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt
[default5]:[2022-03-03 14:25:05,182] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 14:25:05,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt
[default1]:[2022-03-03 14:25:05,282] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 14:25:05,306] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt
[default6]:[2022-03-03 14:25:05,332] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt
[default4]:[2022-03-03 14:25:05,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 14:25:05,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 14:25:05,687] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt
[default2]:[2022-03-03 14:25:05,712] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 14:25:05,705] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt
[default0]:[2022-03-03 14:25:05,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt
[default0]:[2022-03-03 14:25:06,466] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt
[default0]:[2022-03-03 14:25:06,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt
[default3]:[2022-03-03 14:25:06,491] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt
[default6]:[2022-03-03 14:25:06,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 14:25:06,572] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 14:25:06,624] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt
[default1]:[2022-03-03 14:25:06,808] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt
[default5]:[2022-03-03 14:25:06,764] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt
[default2]:[2022-03-03 14:25:06,889] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt
[default4]:[2022-03-03 14:25:06,979] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 14:25:07,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt
[default6]:[2022-03-03 14:25:07,364] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt
[default4]:[2022-03-03 14:25:07,371] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt
[default3]:[2022-03-03 14:25:07,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt
[default6]:[2022-03-03 14:25:07,536] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 14:25:07,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt
[default4]:[2022-03-03 14:25:07,516] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt
[default0]:[2022-03-03 14:25:07,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt
[default4]:[2022-03-03 14:25:07,554] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt
[default5]:[2022-03-03 14:25:07,620] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 14:25:07,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt
[default1]:[2022-03-03 14:25:07,625] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 14:25:07,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt
[default3]:[2022-03-03 14:25:07,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt
[default6]:[2022-03-03 14:25:07,750] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt
[default0]:[2022-03-03 14:25:07,845] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt
[default7]:[2022-03-03 14:25:07,849] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt
[default3]:[2022-03-03 14:25:07,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt
[default3]:[2022-03-03 14:25:08,058] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt
[default1]:[2022-03-03 14:25:08,049] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt
[default7]:[2022-03-03 14:25:08,111] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 14:25:08,136] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt
[default4]:[2022-03-03 14:25:08,213] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 14:25:08,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 14:25:08,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt
[default5]:[2022-03-03 14:25:08,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt
[default7]:[2022-03-03 14:25:08,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 14:25:08,634] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt
[default1]:[2022-03-03 14:25:08,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt
[default1]:[2022-03-03 14:25:08,793] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt
[default2]:[2022-03-03 14:25:08,948] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt
[default5]:[2022-03-03 14:25:08,881] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt
[default6]:[2022-03-03 14:25:08,987] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt
[default0]:[2022-03-03 14:25:08,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt
[default3]:[2022-03-03 14:25:08,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt
[default1]:[2022-03-03 14:25:09,016] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 14:25:09,220] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
[default4]:[2022-03-03 14:25:09,279] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 14:25:09,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 14:25:09,316] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 14:25:09,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt
[default0]:[2022-03-03 14:25:09,286] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
[default6]:[2022-03-03 14:25:09,413] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt
[default0]:[2022-03-03 14:25:09,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt
[default2]:[2022-03-03 14:25:09,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt
[default5]:[2022-03-03 14:25:09,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 14:25:09,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt
[default0]:[2022-03-03 14:25:09,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt
[default2]:[2022-03-03 14:25:09,665] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt
[default7]:[2022-03-03 14:25:09,723] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt
[default3]:[2022-03-03 14:25:09,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt
[default4]:[2022-03-03 14:25:09,848] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt
[default2]:[2022-03-03 14:25:09,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt
[default6]:[2022-03-03 14:25:09,961] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt
[default1]:[2022-03-03 14:25:09,929] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt
[default0]:[2022-03-03 14:25:10,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 14:25:09,992] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt
[default2]:[2022-03-03 14:25:09,998] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt
[default0]:[2022-03-03 14:25:10,045] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt
[default2]:[2022-03-03 14:25:10,078] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 14:25:10,119] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
[default3]:[2022-03-03 14:25:10,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt
[default3]:[2022-03-03 14:25:10,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt
[default6]:[2022-03-03 14:25:10,185] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 14:25:10,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt
[default0]:[2022-03-03 14:25:10,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt
[default2]:[2022-03-03 14:25:10,302] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt
[default0]:[2022-03-03 14:25:10,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt
[default0]:[2022-03-03 14:25:10,407] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt
[default4]:[2022-03-03 14:25:10,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 14:25:10,483] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt
[default7]:[2022-03-03 14:25:10,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt
[default1]:[2022-03-03 14:25:10,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt
[default7]:[2022-03-03 14:25:10,707] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt
[default7]:[2022-03-03 14:25:10,821] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 14:25:10,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 14:25:10,876] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt
[default6]:[2022-03-03 14:25:10,851] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt
[default1]:[2022-03-03 14:25:10,965] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt
[default6]:[2022-03-03 14:25:10,913] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 14:25:10,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt
[default1]:[2022-03-03 14:25:11,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt
[default1]:[2022-03-03 14:25:11,030] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt
[default2]:[2022-03-03 14:25:11,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 14:25:11,146] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 14:25:11,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt
[default1]:[2022-03-03 14:25:11,182] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt
[default7]:[2022-03-03 14:25:11,228] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt
[default4]:[2022-03-03 14:25:11,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt
[default3]:[2022-03-03 14:25:11,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt
[default4]:[2022-03-03 14:25:11,436] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
[default7]:[2022-03-03 14:25:11,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt
[default5]:[2022-03-03 14:25:11,453] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt
[default5]:[2022-03-03 14:25:11,466] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt
[default5]:[2022-03-03 14:25:11,474] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt
[default5]:[2022-03-03 14:25:11,513] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt
[default1]:[2022-03-03 14:25:11,458] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt
[default2]:[2022-03-03 14:25:11,451] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt
[default5]:[2022-03-03 14:25:11,552] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt
[default0]:[2022-03-03 14:25:11,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt
[default3]:[2022-03-03 14:25:11,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt
[default0]:[2022-03-03 14:25:11,694] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt
[default2]:[2022-03-03 14:25:11,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt
[default4]:[2022-03-03 14:25:11,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt
[default1]:[2022-03-03 14:25:11,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt
[default1]:[2022-03-03 14:25:11,752] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 14:25:11,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt
[default1]:[2022-03-03 14:25:11,883] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt
[default7]:[2022-03-03 14:25:11,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt
[default2]:[2022-03-03 14:25:11,893] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt
[default2]:[2022-03-03 14:25:11,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt
[default6]:[2022-03-03 14:25:12,003] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 14:25:11,986] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt
[default3]:[2022-03-03 14:25:12,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt
[default2]:[2022-03-03 14:25:12,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt
[default4]:[2022-03-03 14:25:12,043] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt
[default6]:[2022-03-03 14:25:12,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt
[default0]:[2022-03-03 14:25:12,083] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
[default0]:[2022-03-03 14:25:12,118] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 14:25:12,069] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt
[default2]:[2022-03-03 14:25:12,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt
[default7]:[2022-03-03 14:25:12,180] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt
[default2]:[2022-03-03 14:25:12,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt
[default6]:[2022-03-03 14:25:12,221] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 14:25:12,321] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt
[default4]:[2022-03-03 14:25:12,355] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt
[default0]:[2022-03-03 14:25:12,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[default3]:[2022-03-03 14:25:12,522] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 14:25:12,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt
[default2]:[2022-03-03 14:25:12,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt
[default0]:[2022-03-03 14:25:12,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt
[default5]:[2022-03-03 14:25:12,559] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 14:25:12,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 14:25:12,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt
[default4]:[2022-03-03 14:25:12,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 14:25:12,717] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt
[default1]:[2022-03-03 14:25:12,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt
[default1]:[2022-03-03 14:25:12,734] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 14:25:12,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt
[default0]:[2022-03-03 14:25:12,819] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt
[default2]:[2022-03-03 14:25:12,824] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 14:25:12,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt
[default3]:[2022-03-03 14:25:12,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt
[default3]:[2022-03-03 14:25:12,778] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt
[default2]:[2022-03-03 14:25:12,845] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt
[default5]:[2022-03-03 14:25:12,859] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 14:25:12,876] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt
[default2]:[2022-03-03 14:25:12,887] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt
[default3]:[2022-03-03 14:25:12,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt
[default5]:[2022-03-03 14:25:12,944] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt
[default5]:[2022-03-03 14:25:12,998] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt
[default0]:[2022-03-03 14:25:12,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 14:25:12,957] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt
[default6]:[2022-03-03 14:25:13,109] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 14:25:13,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt
[default0]:[2022-03-03 14:25:13,101] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt
[default6]:[2022-03-03 14:25:13,131] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt
[default6]:[2022-03-03 14:25:13,217] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt
[default1]:[2022-03-03 14:25:13,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 14:25:13,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt
[default0]:[2022-03-03 14:25:13,277] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt
[default4]:[2022-03-03 14:25:13,231] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt
[default7]:[2022-03-03 14:25:13,290] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt
[default1]:[2022-03-03 14:25:13,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 14:25:13,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 14:25:13,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 14:25:13,622] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt
[default7]:[2022-03-03 14:25:13,654] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt
[default5]:[2022-03-03 14:25:13,639] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt
[default3]:[2022-03-03 14:25:13,734] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt
[default1]:[2022-03-03 14:25:13,725] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt
[default5]:[2022-03-03 14:25:13,819] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt
[default7]:[2022-03-03 14:25:13,783] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt
[default6]:[2022-03-03 14:25:13,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt
[default4]:[2022-03-03 14:25:13,884] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt
[default7]:[2022-03-03 14:25:13,870] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt
[default4]:[2022-03-03 14:25:13,823] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt
[default2]:[2022-03-03 14:25:13,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt
[default5]:[2022-03-03 14:25:13,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt
[default7]:[2022-03-03 14:25:13,925] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt
[default3]:[2022-03-03 14:25:13,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt
[default2]:[2022-03-03 14:25:13,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt
[default2]:[2022-03-03 14:25:13,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 14:25:14,059] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 14:25:13,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt
[default0]:[2022-03-03 14:25:14,020] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt
[default6]:[2022-03-03 14:25:14,056] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt
[default2]:[2022-03-03 14:25:14,096] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt
[default1]:[2022-03-03 14:25:14,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 14:25:14,075] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt
[default7]:[2022-03-03 14:25:14,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt
[default3]:[2022-03-03 14:25:14,160] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt
[default5]:[2022-03-03 14:25:14,130] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 14:25:14,216] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt
[default2]:[2022-03-03 14:25:14,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt
[default4]:[2022-03-03 14:25:14,140] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 14:25:14,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 14:25:14,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt
[default1]:[2022-03-03 14:25:14,311] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt
[default0]:[2022-03-03 14:25:14,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 14:25:14,331] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt
[default6]:[2022-03-03 14:25:14,368] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt
[default7]:[2022-03-03 14:25:14,475] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt
[default1]:[2022-03-03 14:25:14,460] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 14:25:14,523] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt
[default5]:[2022-03-03 14:25:14,536] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 14:25:14,577] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 14:25:14,531] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt
[default4]:[2022-03-03 14:25:14,644] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 14:25:14,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt
[default5]:[2022-03-03 14:25:14,760] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt
[default7]:[2022-03-03 14:25:14,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt
[default3]:[2022-03-03 14:25:14,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt
[default4]:[2022-03-03 14:25:14,755] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 14:25:14,776] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt
[default4]:[2022-03-03 14:25:14,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt
[default2]:[2022-03-03 14:25:14,804] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt
[default2]:[2022-03-03 14:25:14,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 14:25:14,768] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt
[default5]:[2022-03-03 14:25:14,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt
[default4]:[2022-03-03 14:25:14,888] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 14:25:14,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt
[default1]:[2022-03-03 14:25:14,879] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt
[default7]:[2022-03-03 14:25:14,919] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt
[default6]:[2022-03-03 14:25:14,929] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 14:25:14,951] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 14:25:15,004] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt
[default3]:[2022-03-03 14:25:15,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt
[default0]:[2022-03-03 14:25:15,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt
[default7]:[2022-03-03 14:25:15,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt
[default5]:[2022-03-03 14:25:15,176] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt
[default0]:[2022-03-03 14:25:15,167] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt
[default6]:[2022-03-03 14:25:15,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt
[default6]:[2022-03-03 14:25:15,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt
[default6]:[2022-03-03 14:25:15,265] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt
[default7]:[2022-03-03 14:25:15,195] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 14:25:15,289] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt
[default3]:[2022-03-03 14:25:15,325] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 14:25:15,334] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 14:25:15,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt
[default0]:[2022-03-03 14:25:15,405] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt
[default1]:[2022-03-03 14:25:15,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt
[default3]:[2022-03-03 14:25:15,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt
[default6]:[2022-03-03 14:25:15,453] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt
[default2]:[2022-03-03 14:25:15,553] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt
[default7]:[2022-03-03 14:25:15,543] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt
[default7]:[2022-03-03 14:25:15,568] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt
[default4]:[2022-03-03 14:25:15,547] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 14:25:15,566] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt
[default3]:[2022-03-03 14:25:15,620] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 14:25:15,647] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt
[default5]:[2022-03-03 14:25:15,620] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt
[default5]:[2022-03-03 14:25:15,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt
[default4]:[2022-03-03 14:25:15,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
[default1]:[2022-03-03 14:25:15,711] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt
[default4]:[2022-03-03 14:25:15,744] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
[default4]:[2022-03-03 14:25:15,744] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt
[default1]:[2022-03-03 14:25:15,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt
[default7]:[2022-03-03 14:25:15,890] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt
[default0]:[2022-03-03 14:25:15,880] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt
[default4]:[2022-03-03 14:25:15,875] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt
[default3]:[2022-03-03 14:25:15,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt
[default1]:[2022-03-03 14:25:15,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt
[default3]:[2022-03-03 14:25:16,027] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt
[default3]:[2022-03-03 14:25:15,979] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt
[default6]:[2022-03-03 14:25:15,989] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt
[default0]:[2022-03-03 14:25:16,092] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt
[default3]:[2022-03-03 14:25:16,058] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 14:25:16,205] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt
[default6]:[2022-03-03 14:25:16,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt
[default6]:[2022-03-03 14:25:16,294] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt
[default6]:[2022-03-03 14:25:16,302] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt
[default2]:[2022-03-03 14:25:16,286] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 14:25:16,399] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt
[default3]:[2022-03-03 14:25:16,420] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt
[default1]:[2022-03-03 14:25:16,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt
[default6]:[2022-03-03 14:25:16,509] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt
[default0]:[2022-03-03 14:25:16,582] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
[default6]:[2022-03-03 14:25:16,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt
[default2]:[2022-03-03 14:25:16,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 14:25:16,632] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt
[default5]:[2022-03-03 14:25:16,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt
[default2]:[2022-03-03 14:25:16,720] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt
[default2]:[2022-03-03 14:25:16,786] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt
[default5]:[2022-03-03 14:25:16,880] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 14:25:16,865] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 14:25:16,982] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt
[default1]:[2022-03-03 14:25:17,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 14:25:17,204] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt
[default7]:[2022-03-03 14:25:17,303] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt
[default1]:[2022-03-03 14:25:17,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt
[default6]:[2022-03-03 14:25:17,465] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt
[default7]:[2022-03-03 14:25:17,519] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt
[default0]:[2022-03-03 14:25:17,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt
[default4]:[2022-03-03 14:25:17,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 14:25:17,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt
[default6]:[2022-03-03 14:25:17,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt
[default7]:[2022-03-03 14:25:18,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt
[default0]:[2022-03-03 14:25:18,108] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt
[default0]:[2022-03-03 14:25:18,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt
[default1]:[2022-03-03 14:25:18,120] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt
[default1]:[2022-03-03 14:25:18,166] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 14:25:18,915] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt
[default5]:[2022-03-03 14:25:18,914] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt
[default4]:[2022-03-03 14:25:18,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt
[default6]:[2022-03-03 14:25:19,123] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 14:25:19,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt
[default0]:[2022-03-03 14:25:19,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt
[default6]:[2022-03-03 14:25:19,180] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt
[default2]:[2022-03-03 14:25:19,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 14:25:19,362] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 14:25:19,499] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt
[default3]:[2022-03-03 14:25:19,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 14:25:19,829] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt
[default5]:[2022-03-03 14:25:20,020] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 14:25:20,139] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt
[default4]:[2022-03-03 14:25:20,377] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt
[default5]:[2022-03-03 14:25:20,522] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt
[default7]:[2022-03-03 14:25:20,535] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 14:25:20,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 14:25:20,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt
[default2]:[2022-03-03 14:25:20,721] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt
[default5]:[2022-03-03 14:25:21,334] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt
[default7]:[2022-03-03 14:25:21,646] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 14:25:21,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt
[default7]:[2022-03-03 14:25:21,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt
[default4]:[2022-03-03 14:25:22,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt
[default6]:[2022-03-03 14:25:22,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt
[default4]:[2022-03-03 14:25:22,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt
[default5]:[2022-03-03 14:25:22,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 14:25:22,785] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 14:25:22,873] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt
[default4]:[2022-03-03 14:25:26,948] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt
[default0]:  successfully saved checkpoint at iteration    2000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default7]:time (ms) | save-checkpoint: 38895.48
[default5]:[2022-03-03 14:25:26,963] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt
[default7]: iteration     2001/  128728 | consumed samples:        32016 | consumed tokens:     65568768 | elapsed time per iteration (s): 73.65 | learning rate: 1.049E-05 | global batch size:    16 | lm loss: 6.402064E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.217 | TFLOPs: 1.66 |
[default7]: iteration     2002/  128728 | consumed samples:        32032 | consumed tokens:     65601536 | elapsed time per iteration (s): 15.23 | learning rate: 1.050E-05 | global batch size:    16 | lm loss: 6.597949E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2003/  128728 | consumed samples:        32048 | consumed tokens:     65634304 | elapsed time per iteration (s): 15.23 | learning rate: 1.050E-05 | global batch size:    16 | lm loss: 6.418831E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2004/  128728 | consumed samples:        32064 | consumed tokens:     65667072 | elapsed time per iteration (s): 15.19 | learning rate: 1.051E-05 | global batch size:    16 | lm loss: 6.606805E+00 | grad norm: 1.369 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2005/  128728 | consumed samples:        32080 | consumed tokens:     65699840 | elapsed time per iteration (s): 15.23 | learning rate: 1.051E-05 | global batch size:    16 | lm loss: 6.370827E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2006/  128728 | consumed samples:        32096 | consumed tokens:     65732608 | elapsed time per iteration (s): 15.23 | learning rate: 1.052E-05 | global batch size:    16 | lm loss: 6.308137E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2007/  128728 | consumed samples:        32112 | consumed tokens:     65765376 | elapsed time per iteration (s): 15.21 | learning rate: 1.052E-05 | global batch size:    16 | lm loss: 6.523125E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2008/  128728 | consumed samples:        32128 | consumed tokens:     65798144 | elapsed time per iteration (s): 15.20 | learning rate: 1.053E-05 | global batch size:    16 | lm loss: 6.829843E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2009/  128728 | consumed samples:        32144 | consumed tokens:     65830912 | elapsed time per iteration (s): 15.24 | learning rate: 1.053E-05 | global batch size:    16 | lm loss: 6.465959E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2010/  128728 | consumed samples:        32160 | consumed tokens:     65863680 | elapsed time per iteration (s): 15.23 | learning rate: 1.054E-05 | global batch size:    16 | lm loss: 6.585162E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2011/  128728 | consumed samples:        32176 | consumed tokens:     65896448 | elapsed time per iteration (s): 15.16 | learning rate: 1.054E-05 | global batch size:    16 | lm loss: 6.360588E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2012/  128728 | consumed samples:        32192 | consumed tokens:     65929216 | elapsed time per iteration (s): 15.22 | learning rate: 1.055E-05 | global batch size:    16 | lm loss: 6.488918E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2013/  128728 | consumed samples:        32208 | consumed tokens:     65961984 | elapsed time per iteration (s): 15.19 | learning rate: 1.055E-05 | global batch size:    16 | lm loss: 6.635891E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2014/  128728 | consumed samples:        32224 | consumed tokens:     65994752 | elapsed time per iteration (s): 15.24 | learning rate: 1.056E-05 | global batch size:    16 | lm loss: 6.583560E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2015/  128728 | consumed samples:        32240 | consumed tokens:     66027520 | elapsed time per iteration (s): 15.21 | learning rate: 1.056E-05 | global batch size:    16 | lm loss: 6.287863E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2016/  128728 | consumed samples:        32256 | consumed tokens:     66060288 | elapsed time per iteration (s): 15.19 | learning rate: 1.057E-05 | global batch size:    16 | lm loss: 6.449275E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2017/  128728 | consumed samples:        32272 | consumed tokens:     66093056 | elapsed time per iteration (s): 15.22 | learning rate: 1.057E-05 | global batch size:    16 | lm loss: 6.813572E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2018/  128728 | consumed samples:        32288 | consumed tokens:     66125824 | elapsed time per iteration (s): 15.21 | learning rate: 1.058E-05 | global batch size:    16 | lm loss: 6.464739E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2019/  128728 | consumed samples:        32304 | consumed tokens:     66158592 | elapsed time per iteration (s): 15.20 | learning rate: 1.059E-05 | global batch size:    16 | lm loss: 6.490543E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2020/  128728 | consumed samples:        32320 | consumed tokens:     66191360 | elapsed time per iteration (s): 15.17 | learning rate: 1.059E-05 | global batch size:    16 | lm loss: 6.522612E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2021/  128728 | consumed samples:        32336 | consumed tokens:     66224128 | elapsed time per iteration (s): 15.22 | learning rate: 1.060E-05 | global batch size:    16 | lm loss: 6.463878E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2022/  128728 | consumed samples:        32352 | consumed tokens:     66256896 | elapsed time per iteration (s): 15.22 | learning rate: 1.060E-05 | global batch size:    16 | lm loss: 6.588681E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2023/  128728 | consumed samples:        32368 | consumed tokens:     66289664 | elapsed time per iteration (s): 15.26 | learning rate: 1.061E-05 | global batch size:    16 | lm loss: 6.585972E+00 | grad norm: 1.220 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2024/  128728 | consumed samples:        32384 | consumed tokens:     66322432 | elapsed time per iteration (s): 15.25 | learning rate: 1.061E-05 | global batch size:    16 | lm loss: 6.484285E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2025/  128728 | consumed samples:        32400 | consumed tokens:     66355200 | elapsed time per iteration (s): 15.21 | learning rate: 1.062E-05 | global batch size:    16 | lm loss: 6.319049E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2026/  128728 | consumed samples:        32416 | consumed tokens:     66387968 | elapsed time per iteration (s): 15.23 | learning rate: 1.062E-05 | global batch size:    16 | lm loss: 6.435322E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2027/  128728 | consumed samples:        32432 | consumed tokens:     66420736 | elapsed time per iteration (s): 15.24 | learning rate: 1.063E-05 | global batch size:    16 | lm loss: 6.357363E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2028/  128728 | consumed samples:        32448 | consumed tokens:     66453504 | elapsed time per iteration (s): 15.23 | learning rate: 1.063E-05 | global batch size:    16 | lm loss: 6.541761E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2029/  128728 | consumed samples:        32464 | consumed tokens:     66486272 | elapsed time per iteration (s): 15.23 | learning rate: 1.064E-05 | global batch size:    16 | lm loss: 6.403821E+00 | grad norm: 1.879 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2030/  128728 | consumed samples:        32480 | consumed tokens:     66519040 | elapsed time per iteration (s): 15.21 | learning rate: 1.064E-05 | global batch size:    16 | lm loss: 6.531659E+00 | grad norm: 1.442 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2031/  128728 | consumed samples:        32496 | consumed tokens:     66551808 | elapsed time per iteration (s): 15.24 | learning rate: 1.065E-05 | global batch size:    16 | lm loss: 6.443928E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2032/  128728 | consumed samples:        32512 | consumed tokens:     66584576 | elapsed time per iteration (s): 15.22 | learning rate: 1.065E-05 | global batch size:    16 | lm loss: 6.522864E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2033/  128728 | consumed samples:        32528 | consumed tokens:     66617344 | elapsed time per iteration (s): 15.22 | learning rate: 1.066E-05 | global batch size:    16 | lm loss: 6.443838E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2034/  128728 | consumed samples:        32544 | consumed tokens:     66650112 | elapsed time per iteration (s): 15.23 | learning rate: 1.066E-05 | global batch size:    16 | lm loss: 6.476315E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2035/  128728 | consumed samples:        32560 | consumed tokens:     66682880 | elapsed time per iteration (s): 15.24 | learning rate: 1.067E-05 | global batch size:    16 | lm loss: 6.310287E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2036/  128728 | consumed samples:        32576 | consumed tokens:     66715648 | elapsed time per iteration (s): 15.25 | learning rate: 1.067E-05 | global batch size:    16 | lm loss: 6.401248E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2037/  128728 | consumed samples:        32592 | consumed tokens:     66748416 | elapsed time per iteration (s): 15.23 | learning rate: 1.068E-05 | global batch size:    16 | lm loss: 6.626089E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2038/  128728 | consumed samples:        32608 | consumed tokens:     66781184 | elapsed time per iteration (s): 15.24 | learning rate: 1.069E-05 | global batch size:    16 | lm loss: 6.504237E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2039/  128728 | consumed samples:        32624 | consumed tokens:     66813952 | elapsed time per iteration (s): 15.22 | learning rate: 1.069E-05 | global batch size:    16 | lm loss: 6.728966E+00 | grad norm: 0.975 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2040/  128728 | consumed samples:        32640 | consumed tokens:     66846720 | elapsed time per iteration (s): 15.21 | learning rate: 1.070E-05 | global batch size:    16 | lm loss: 6.668674E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2041/  128728 | consumed samples:        32656 | consumed tokens:     66879488 | elapsed time per iteration (s): 15.22 | learning rate: 1.070E-05 | global batch size:    16 | lm loss: 6.519093E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2042/  128728 | consumed samples:        32672 | consumed tokens:     66912256 | elapsed time per iteration (s): 15.17 | learning rate: 1.071E-05 | global batch size:    16 | lm loss: 6.491826E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     2043/  128728 | consumed samples:        32688 | consumed tokens:     66945024 | elapsed time per iteration (s): 15.20 | learning rate: 1.071E-05 | global batch size:    16 | lm loss: 6.445816E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2044/  128728 | consumed samples:        32704 | consumed tokens:     66977792 | elapsed time per iteration (s): 15.24 | learning rate: 1.072E-05 | global batch size:    16 | lm loss: 6.649926E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2045/  128728 | consumed samples:        32720 | consumed tokens:     67010560 | elapsed time per iteration (s): 15.21 | learning rate: 1.072E-05 | global batch size:    16 | lm loss: 6.410240E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2046/  128728 | consumed samples:        32736 | consumed tokens:     67043328 | elapsed time per iteration (s): 15.25 | learning rate: 1.073E-05 | global batch size:    16 | lm loss: 6.510799E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2047/  128728 | consumed samples:        32752 | consumed tokens:     67076096 | elapsed time per iteration (s): 15.25 | learning rate: 1.073E-05 | global batch size:    16 | lm loss: 6.586518E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2048/  128728 | consumed samples:        32768 | consumed tokens:     67108864 | elapsed time per iteration (s): 15.24 | learning rate: 1.074E-05 | global batch size:    16 | lm loss: 6.675879E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2049/  128728 | consumed samples:        32784 | consumed tokens:     67141632 | elapsed time per iteration (s): 15.17 | learning rate: 1.074E-05 | global batch size:    16 | lm loss: 6.550882E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     2050/  128728 | consumed samples:        32800 | consumed tokens:     67174400 | elapsed time per iteration (s): 15.22 | learning rate: 1.075E-05 | global batch size:    16 | lm loss: 6.626620E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2051/  128728 | consumed samples:        32816 | consumed tokens:     67207168 | elapsed time per iteration (s): 15.18 | learning rate: 1.075E-05 | global batch size:    16 | lm loss: 6.336770E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2052/  128728 | consumed samples:        32832 | consumed tokens:     67239936 | elapsed time per iteration (s): 15.23 | learning rate: 1.076E-05 | global batch size:    16 | lm loss: 6.473155E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2053/  128728 | consumed samples:        32848 | consumed tokens:     67272704 | elapsed time per iteration (s): 15.22 | learning rate: 1.076E-05 | global batch size:    16 | lm loss: 6.799645E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2054/  128728 | consumed samples:        32864 | consumed tokens:     67305472 | elapsed time per iteration (s): 15.21 | learning rate: 1.077E-05 | global batch size:    16 | lm loss: 6.377295E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2055/  128728 | consumed samples:        32880 | consumed tokens:     67338240 | elapsed time per iteration (s): 15.17 | learning rate: 1.077E-05 | global batch size:    16 | lm loss: 6.436339E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     2056/  128728 | consumed samples:        32896 | consumed tokens:     67371008 | elapsed time per iteration (s): 15.21 | learning rate: 1.078E-05 | global batch size:    16 | lm loss: 6.468864E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2057/  128728 | consumed samples:        32912 | consumed tokens:     67403776 | elapsed time per iteration (s): 15.19 | learning rate: 1.078E-05 | global batch size:    16 | lm loss: 6.744889E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2058/  128728 | consumed samples:        32928 | consumed tokens:     67436544 | elapsed time per iteration (s): 15.16 | learning rate: 1.079E-05 | global batch size:    16 | lm loss: 6.324127E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2059/  128728 | consumed samples:        32944 | consumed tokens:     67469312 | elapsed time per iteration (s): 15.26 | learning rate: 1.080E-05 | global batch size:    16 | lm loss: 6.515798E+00 | grad norm: 1.601 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2060/  128728 | consumed samples:        32960 | consumed tokens:     67502080 | elapsed time per iteration (s): 15.29 | learning rate: 1.080E-05 | global batch size:    16 | lm loss: 6.469310E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     2061/  128728 | consumed samples:        32976 | consumed tokens:     67534848 | elapsed time per iteration (s): 15.15 | learning rate: 1.081E-05 | global batch size:    16 | lm loss: 6.753767E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2062/  128728 | consumed samples:        32992 | consumed tokens:     67567616 | elapsed time per iteration (s): 15.24 | learning rate: 1.081E-05 | global batch size:    16 | lm loss: 6.343292E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2063/  128728 | consumed samples:        33008 | consumed tokens:     67600384 | elapsed time per iteration (s): 15.23 | learning rate: 1.082E-05 | global batch size:    16 | lm loss: 6.468085E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2064/  128728 | consumed samples:        33024 | consumed tokens:     67633152 | elapsed time per iteration (s): 15.24 | learning rate: 1.082E-05 | global batch size:    16 | lm loss: 6.474153E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2065/  128728 | consumed samples:        33040 | consumed tokens:     67665920 | elapsed time per iteration (s): 15.24 | learning rate: 1.083E-05 | global batch size:    16 | lm loss: 6.486255E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2066/  128728 | consumed samples:        33056 | consumed tokens:     67698688 | elapsed time per iteration (s): 15.21 | learning rate: 1.083E-05 | global batch size:    16 | lm loss: 6.522897E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2067/  128728 | consumed samples:        33072 | consumed tokens:     67731456 | elapsed time per iteration (s): 15.24 | learning rate: 1.084E-05 | global batch size:    16 | lm loss: 6.500715E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2068/  128728 | consumed samples:        33088 | consumed tokens:     67764224 | elapsed time per iteration (s): 15.19 | learning rate: 1.084E-05 | global batch size:    16 | lm loss: 6.581506E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2069/  128728 | consumed samples:        33104 | consumed tokens:     67796992 | elapsed time per iteration (s): 15.24 | learning rate: 1.085E-05 | global batch size:    16 | lm loss: 6.609797E+00 | grad norm: 0.984 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2070/  128728 | consumed samples:        33120 | consumed tokens:     67829760 | elapsed time per iteration (s): 15.22 | learning rate: 1.085E-05 | global batch size:    16 | lm loss: 6.362771E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2071/  128728 | consumed samples:        33136 | consumed tokens:     67862528 | elapsed time per iteration (s): 15.22 | learning rate: 1.086E-05 | global batch size:    16 | lm loss: 6.606649E+00 | grad norm: 0.987 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2072/  128728 | consumed samples:        33152 | consumed tokens:     67895296 | elapsed time per iteration (s): 15.18 | learning rate: 1.086E-05 | global batch size:    16 | lm loss: 6.527749E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2073/  128728 | consumed samples:        33168 | consumed tokens:     67928064 | elapsed time per iteration (s): 15.16 | learning rate: 1.087E-05 | global batch size:    16 | lm loss: 6.351327E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2074/  128728 | consumed samples:        33184 | consumed tokens:     67960832 | elapsed time per iteration (s): 15.20 | learning rate: 1.087E-05 | global batch size:    16 | lm loss: 6.256629E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2075/  128728 | consumed samples:        33200 | consumed tokens:     67993600 | elapsed time per iteration (s): 15.19 | learning rate: 1.088E-05 | global batch size:    16 | lm loss: 6.681695E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2076/  128728 | consumed samples:        33216 | consumed tokens:     68026368 | elapsed time per iteration (s): 15.20 | learning rate: 1.088E-05 | global batch size:    16 | lm loss: 6.361331E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2077/  128728 | consumed samples:        33232 | consumed tokens:     68059136 | elapsed time per iteration (s): 15.18 | learning rate: 1.089E-05 | global batch size:    16 | lm loss: 6.319121E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2078/  128728 | consumed samples:        33248 | consumed tokens:     68091904 | elapsed time per iteration (s): 15.21 | learning rate: 1.089E-05 | global batch size:    16 | lm loss: 6.285818E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2079/  128728 | consumed samples:        33264 | consumed tokens:     68124672 | elapsed time per iteration (s): 15.22 | learning rate: 1.090E-05 | global batch size:    16 | lm loss: 6.337141E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2080/  128728 | consumed samples:        33280 | consumed tokens:     68157440 | elapsed time per iteration (s): 15.24 | learning rate: 1.091E-05 | global batch size:    16 | lm loss: 6.420028E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2081/  128728 | consumed samples:        33296 | consumed tokens:     68190208 | elapsed time per iteration (s): 15.15 | learning rate: 1.091E-05 | global batch size:    16 | lm loss: 6.409562E+00 | grad norm: 0.962 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2082/  128728 | consumed samples:        33312 | consumed tokens:     68222976 | elapsed time per iteration (s): 15.22 | learning rate: 1.092E-05 | global batch size:    16 | lm loss: 6.588019E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2083/  128728 | consumed samples:        33328 | consumed tokens:     68255744 | elapsed time per iteration (s): 15.23 | learning rate: 1.092E-05 | global batch size:    16 | lm loss: 6.473838E+00 | grad norm: 0.969 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2084/  128728 | consumed samples:        33344 | consumed tokens:     68288512 | elapsed time per iteration (s): 15.20 | learning rate: 1.093E-05 | global batch size:    16 | lm loss: 6.194841E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2085/  128728 | consumed samples:        33360 | consumed tokens:     68321280 | elapsed time per iteration (s): 15.22 | learning rate: 1.093E-05 | global batch size:    16 | lm loss: 6.565664E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2086/  128728 | consumed samples:        33376 | consumed tokens:     68354048 | elapsed time per iteration (s): 15.18 | learning rate: 1.094E-05 | global batch size:    16 | lm loss: 6.302047E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2087/  128728 | consumed samples:        33392 | consumed tokens:     68386816 | elapsed time per iteration (s): 15.20 | learning rate: 1.094E-05 | global batch size:    16 | lm loss: 6.493527E+00 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2088/  128728 | consumed samples:        33408 | consumed tokens:     68419584 | elapsed time per iteration (s): 15.15 | learning rate: 1.095E-05 | global batch size:    16 | lm loss: 6.456075E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2089/  128728 | consumed samples:        33424 | consumed tokens:     68452352 | elapsed time per iteration (s): 15.23 | learning rate: 1.095E-05 | global batch size:    16 | lm loss: 6.435150E+00 | grad norm: 1.070 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2090/  128728 | consumed samples:        33440 | consumed tokens:     68485120 | elapsed time per iteration (s): 15.19 | learning rate: 1.096E-05 | global batch size:    16 | lm loss: 6.466596E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2091/  128728 | consumed samples:        33456 | consumed tokens:     68517888 | elapsed time per iteration (s): 15.21 | learning rate: 1.096E-05 | global batch size:    16 | lm loss: 6.540755E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2092/  128728 | consumed samples:        33472 | consumed tokens:     68550656 | elapsed time per iteration (s): 15.22 | learning rate: 1.097E-05 | global batch size:    16 | lm loss: 6.240240E+00 | grad norm: 1.037 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2093/  128728 | consumed samples:        33488 | consumed tokens:     68583424 | elapsed time per iteration (s): 15.23 | learning rate: 1.097E-05 | global batch size:    16 | lm loss: 6.645574E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2094/  128728 | consumed samples:        33504 | consumed tokens:     68616192 | elapsed time per iteration (s): 15.23 | learning rate: 1.098E-05 | global batch size:    16 | lm loss: 6.572923E+00 | grad norm: 2.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2095/  128728 | consumed samples:        33520 | consumed tokens:     68648960 | elapsed time per iteration (s): 15.24 | learning rate: 1.098E-05 | global batch size:    16 | lm loss: 6.311743E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2096/  128728 | consumed samples:        33536 | consumed tokens:     68681728 | elapsed time per iteration (s): 15.26 | learning rate: 1.099E-05 | global batch size:    16 | lm loss: 6.382287E+00 | grad norm: 1.122 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2097/  128728 | consumed samples:        33552 | consumed tokens:     68714496 | elapsed time per iteration (s): 15.22 | learning rate: 1.099E-05 | global batch size:    16 | lm loss: 6.491906E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2098/  128728 | consumed samples:        33568 | consumed tokens:     68747264 | elapsed time per iteration (s): 15.25 | learning rate: 1.100E-05 | global batch size:    16 | lm loss: 6.311732E+00 | grad norm: 0.658 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2099/  128728 | consumed samples:        33584 | consumed tokens:     68780032 | elapsed time per iteration (s): 15.23 | learning rate: 1.100E-05 | global batch size:    16 | lm loss: 6.513503E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2100/  128728 | consumed samples:        33600 | consumed tokens:     68812800 | elapsed time per iteration (s): 15.22 | learning rate: 1.101E-05 | global batch size:    16 | lm loss: 6.404696E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2101/  128728 | consumed samples:        33616 | consumed tokens:     68845568 | elapsed time per iteration (s): 15.23 | learning rate: 1.102E-05 | global batch size:    16 | lm loss: 6.420318E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2102/  128728 | consumed samples:        33632 | consumed tokens:     68878336 | elapsed time per iteration (s): 15.22 | learning rate: 1.102E-05 | global batch size:    16 | lm loss: 6.275981E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2103/  128728 | consumed samples:        33648 | consumed tokens:     68911104 | elapsed time per iteration (s): 15.22 | learning rate: 1.103E-05 | global batch size:    16 | lm loss: 6.372358E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2104/  128728 | consumed samples:        33664 | consumed tokens:     68943872 | elapsed time per iteration (s): 15.21 | learning rate: 1.103E-05 | global batch size:    16 | lm loss: 6.088941E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2105/  128728 | consumed samples:        33680 | consumed tokens:     68976640 | elapsed time per iteration (s): 15.20 | learning rate: 1.104E-05 | global batch size:    16 | lm loss: 6.542912E+00 | grad norm: 0.862 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2106/  128728 | consumed samples:        33696 | consumed tokens:     69009408 | elapsed time per iteration (s): 15.20 | learning rate: 1.104E-05 | global batch size:    16 | lm loss: 6.359058E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2107/  128728 | consumed samples:        33712 | consumed tokens:     69042176 | elapsed time per iteration (s): 15.24 | learning rate: 1.105E-05 | global batch size:    16 | lm loss: 6.501265E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2108/  128728 | consumed samples:        33728 | consumed tokens:     69074944 | elapsed time per iteration (s): 15.21 | learning rate: 1.105E-05 | global batch size:    16 | lm loss: 6.367177E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2109/  128728 | consumed samples:        33744 | consumed tokens:     69107712 | elapsed time per iteration (s): 15.22 | learning rate: 1.106E-05 | global batch size:    16 | lm loss: 6.246887E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2110/  128728 | consumed samples:        33760 | consumed tokens:     69140480 | elapsed time per iteration (s): 15.23 | learning rate: 1.106E-05 | global batch size:    16 | lm loss: 6.294720E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2111/  128728 | consumed samples:        33776 | consumed tokens:     69173248 | elapsed time per iteration (s): 15.24 | learning rate: 1.107E-05 | global batch size:    16 | lm loss: 6.356379E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2112/  128728 | consumed samples:        33792 | consumed tokens:     69206016 | elapsed time per iteration (s): 15.21 | learning rate: 1.107E-05 | global batch size:    16 | lm loss: 6.442330E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2113/  128728 | consumed samples:        33808 | consumed tokens:     69238784 | elapsed time per iteration (s): 15.22 | learning rate: 1.108E-05 | global batch size:    16 | lm loss: 6.351761E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2114/  128728 | consumed samples:        33824 | consumed tokens:     69271552 | elapsed time per iteration (s): 15.25 | learning rate: 1.108E-05 | global batch size:    16 | lm loss: 6.381479E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2115/  128728 | consumed samples:        33840 | consumed tokens:     69304320 | elapsed time per iteration (s): 15.25 | learning rate: 1.109E-05 | global batch size:    16 | lm loss: 6.759895E+00 | grad norm: 1.024 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2116/  128728 | consumed samples:        33856 | consumed tokens:     69337088 | elapsed time per iteration (s): 15.21 | learning rate: 1.109E-05 | global batch size:    16 | lm loss: 6.386426E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2117/  128728 | consumed samples:        33872 | consumed tokens:     69369856 | elapsed time per iteration (s): 15.18 | learning rate: 1.110E-05 | global batch size:    16 | lm loss: 6.215895E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2118/  128728 | consumed samples:        33888 | consumed tokens:     69402624 | elapsed time per iteration (s): 15.21 | learning rate: 1.110E-05 | global batch size:    16 | lm loss: 6.337823E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2119/  128728 | consumed samples:        33904 | consumed tokens:     69435392 | elapsed time per iteration (s): 15.19 | learning rate: 1.111E-05 | global batch size:    16 | lm loss: 6.306813E+00 | grad norm: 1.303 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2120/  128728 | consumed samples:        33920 | consumed tokens:     69468160 | elapsed time per iteration (s): 15.21 | learning rate: 1.111E-05 | global batch size:    16 | lm loss: 6.559892E+00 | grad norm: 1.107 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2121/  128728 | consumed samples:        33936 | consumed tokens:     69500928 | elapsed time per iteration (s): 15.16 | learning rate: 1.112E-05 | global batch size:    16 | lm loss: 6.418102E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2122/  128728 | consumed samples:        33952 | consumed tokens:     69533696 | elapsed time per iteration (s): 15.17 | learning rate: 1.113E-05 | global batch size:    16 | lm loss: 6.318988E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2123/  128728 | consumed samples:        33968 | consumed tokens:     69566464 | elapsed time per iteration (s): 15.20 | learning rate: 1.113E-05 | global batch size:    16 | lm loss: 6.536681E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2124/  128728 | consumed samples:        33984 | consumed tokens:     69599232 | elapsed time per iteration (s): 15.20 | learning rate: 1.114E-05 | global batch size:    16 | lm loss: 6.491345E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2125/  128728 | consumed samples:        34000 | consumed tokens:     69632000 | elapsed time per iteration (s): 15.20 | learning rate: 1.114E-05 | global batch size:    16 | lm loss: 6.226834E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2126/  128728 | consumed samples:        34016 | consumed tokens:     69664768 | elapsed time per iteration (s): 15.17 | learning rate: 1.115E-05 | global batch size:    16 | lm loss: 6.499043E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2127/  128728 | consumed samples:        34032 | consumed tokens:     69697536 | elapsed time per iteration (s): 15.14 | learning rate: 1.115E-05 | global batch size:    16 | lm loss: 6.487580E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     2128/  128728 | consumed samples:        34048 | consumed tokens:     69730304 | elapsed time per iteration (s): 15.22 | learning rate: 1.116E-05 | global batch size:    16 | lm loss: 6.466814E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2129/  128728 | consumed samples:        34064 | consumed tokens:     69763072 | elapsed time per iteration (s): 15.21 | learning rate: 1.116E-05 | global batch size:    16 | lm loss: 6.457569E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2130/  128728 | consumed samples:        34080 | consumed tokens:     69795840 | elapsed time per iteration (s): 15.16 | learning rate: 1.117E-05 | global batch size:    16 | lm loss: 6.301426E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2131/  128728 | consumed samples:        34096 | consumed tokens:     69828608 | elapsed time per iteration (s): 15.15 | learning rate: 1.117E-05 | global batch size:    16 | lm loss: 6.291666E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2132/  128728 | consumed samples:        34112 | consumed tokens:     69861376 | elapsed time per iteration (s): 15.18 | learning rate: 1.118E-05 | global batch size:    16 | lm loss: 6.416221E+00 | grad norm: 1.176 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2133/  128728 | consumed samples:        34128 | consumed tokens:     69894144 | elapsed time per iteration (s): 15.14 | learning rate: 1.118E-05 | global batch size:    16 | lm loss: 6.413527E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     2134/  128728 | consumed samples:        34144 | consumed tokens:     69926912 | elapsed time per iteration (s): 15.21 | learning rate: 1.119E-05 | global batch size:    16 | lm loss: 6.463112E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2135/  128728 | consumed samples:        34160 | consumed tokens:     69959680 | elapsed time per iteration (s): 15.22 | learning rate: 1.119E-05 | global batch size:    16 | lm loss: 6.217474E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2136/  128728 | consumed samples:        34176 | consumed tokens:     69992448 | elapsed time per iteration (s): 15.22 | learning rate: 1.120E-05 | global batch size:    16 | lm loss: 6.451793E+00 | grad norm: 0.918 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2137/  128728 | consumed samples:        34192 | consumed tokens:     70025216 | elapsed time per iteration (s): 15.21 | learning rate: 1.120E-05 | global batch size:    16 | lm loss: 6.483500E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2138/  128728 | consumed samples:        34208 | consumed tokens:     70057984 | elapsed time per iteration (s): 15.23 | learning rate: 1.121E-05 | global batch size:    16 | lm loss: 6.475822E+00 | grad norm: 1.001 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2139/  128728 | consumed samples:        34224 | consumed tokens:     70090752 | elapsed time per iteration (s): 15.24 | learning rate: 1.121E-05 | global batch size:    16 | lm loss: 6.433506E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2140/  128728 | consumed samples:        34240 | consumed tokens:     70123520 | elapsed time per iteration (s): 15.19 | learning rate: 1.122E-05 | global batch size:    16 | lm loss: 6.512136E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2141/  128728 | consumed samples:        34256 | consumed tokens:     70156288 | elapsed time per iteration (s): 15.22 | learning rate: 1.123E-05 | global batch size:    16 | lm loss: 6.240833E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2142/  128728 | consumed samples:        34272 | consumed tokens:     70189056 | elapsed time per iteration (s): 15.22 | learning rate: 1.123E-05 | global batch size:    16 | lm loss: 6.371235E+00 | grad norm: 0.952 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2143/  128728 | consumed samples:        34288 | consumed tokens:     70221824 | elapsed time per iteration (s): 15.21 | learning rate: 1.124E-05 | global batch size:    16 | lm loss: 6.270912E+00 | grad norm: 1.476 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2144/  128728 | consumed samples:        34304 | consumed tokens:     70254592 | elapsed time per iteration (s): 15.22 | learning rate: 1.124E-05 | global batch size:    16 | lm loss: 6.396859E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2145/  128728 | consumed samples:        34320 | consumed tokens:     70287360 | elapsed time per iteration (s): 15.22 | learning rate: 1.125E-05 | global batch size:    16 | lm loss: 6.418116E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2146/  128728 | consumed samples:        34336 | consumed tokens:     70320128 | elapsed time per iteration (s): 15.25 | learning rate: 1.125E-05 | global batch size:    16 | lm loss: 6.358143E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2147/  128728 | consumed samples:        34352 | consumed tokens:     70352896 | elapsed time per iteration (s): 15.23 | learning rate: 1.126E-05 | global batch size:    16 | lm loss: 6.441215E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2148/  128728 | consumed samples:        34368 | consumed tokens:     70385664 | elapsed time per iteration (s): 15.21 | learning rate: 1.126E-05 | global batch size:    16 | lm loss: 6.519576E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2149/  128728 | consumed samples:        34384 | consumed tokens:     70418432 | elapsed time per iteration (s): 15.25 | learning rate: 1.127E-05 | global batch size:    16 | lm loss: 6.501801E+00 | grad norm: 1.110 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2150/  128728 | consumed samples:        34400 | consumed tokens:     70451200 | elapsed time per iteration (s): 15.22 | learning rate: 1.127E-05 | global batch size:    16 | lm loss: 6.442147E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2151/  128728 | consumed samples:        34416 | consumed tokens:     70483968 | elapsed time per iteration (s): 15.23 | learning rate: 1.128E-05 | global batch size:    16 | lm loss: 6.503671E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2152/  128728 | consumed samples:        34432 | consumed tokens:     70516736 | elapsed time per iteration (s): 15.25 | learning rate: 1.128E-05 | global batch size:    16 | lm loss: 6.525469E+00 | grad norm: 1.080 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2153/  128728 | consumed samples:        34448 | consumed tokens:     70549504 | elapsed time per iteration (s): 15.20 | learning rate: 1.129E-05 | global batch size:    16 | lm loss: 6.287985E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2154/  128728 | consumed samples:        34464 | consumed tokens:     70582272 | elapsed time per iteration (s): 15.16 | learning rate: 1.129E-05 | global batch size:    16 | lm loss: 6.422863E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2155/  128728 | consumed samples:        34480 | consumed tokens:     70615040 | elapsed time per iteration (s): 15.19 | learning rate: 1.130E-05 | global batch size:    16 | lm loss: 6.324255E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2156/  128728 | consumed samples:        34496 | consumed tokens:     70647808 | elapsed time per iteration (s): 15.23 | learning rate: 1.130E-05 | global batch size:    16 | lm loss: 6.228876E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2157/  128728 | consumed samples:        34512 | consumed tokens:     70680576 | elapsed time per iteration (s): 15.24 | learning rate: 1.131E-05 | global batch size:    16 | lm loss: 6.399559E+00 | grad norm: 0.998 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2158/  128728 | consumed samples:        34528 | consumed tokens:     70713344 | elapsed time per iteration (s): 15.20 | learning rate: 1.131E-05 | global batch size:    16 | lm loss: 6.407733E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2159/  128728 | consumed samples:        34544 | consumed tokens:     70746112 | elapsed time per iteration (s): 15.24 | learning rate: 1.132E-05 | global batch size:    16 | lm loss: 6.550023E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2160/  128728 | consumed samples:        34560 | consumed tokens:     70778880 | elapsed time per iteration (s): 15.20 | learning rate: 1.132E-05 | global batch size:    16 | lm loss: 6.425354E+00 | grad norm: 0.911 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2161/  128728 | consumed samples:        34576 | consumed tokens:     70811648 | elapsed time per iteration (s): 15.21 | learning rate: 1.133E-05 | global batch size:    16 | lm loss: 6.268637E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2162/  128728 | consumed samples:        34592 | consumed tokens:     70844416 | elapsed time per iteration (s): 15.22 | learning rate: 1.134E-05 | global batch size:    16 | lm loss: 6.435404E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2163/  128728 | consumed samples:        34608 | consumed tokens:     70877184 | elapsed time per iteration (s): 15.23 | learning rate: 1.134E-05 | global batch size:    16 | lm loss: 6.342821E+00 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2164/  128728 | consumed samples:        34624 | consumed tokens:     70909952 | elapsed time per iteration (s): 15.23 | learning rate: 1.135E-05 | global batch size:    16 | lm loss: 6.411083E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2165/  128728 | consumed samples:        34640 | consumed tokens:     70942720 | elapsed time per iteration (s): 15.21 | learning rate: 1.135E-05 | global batch size:    16 | lm loss: 6.366318E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2166/  128728 | consumed samples:        34656 | consumed tokens:     70975488 | elapsed time per iteration (s): 15.22 | learning rate: 1.136E-05 | global batch size:    16 | lm loss: 6.391248E+00 | grad norm: 0.966 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2167/  128728 | consumed samples:        34672 | consumed tokens:     71008256 | elapsed time per iteration (s): 15.19 | learning rate: 1.136E-05 | global batch size:    16 | lm loss: 6.311316E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2168/  128728 | consumed samples:        34688 | consumed tokens:     71041024 | elapsed time per iteration (s): 15.17 | learning rate: 1.137E-05 | global batch size:    16 | lm loss: 6.601685E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2169/  128728 | consumed samples:        34704 | consumed tokens:     71073792 | elapsed time per iteration (s): 15.21 | learning rate: 1.137E-05 | global batch size:    16 | lm loss: 6.726507E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2170/  128728 | consumed samples:        34720 | consumed tokens:     71106560 | elapsed time per iteration (s): 15.23 | learning rate: 1.138E-05 | global batch size:    16 | lm loss: 6.186215E+00 | grad norm: 0.914 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2171/  128728 | consumed samples:        34736 | consumed tokens:     71139328 | elapsed time per iteration (s): 15.17 | learning rate: 1.138E-05 | global batch size:    16 | lm loss: 6.277987E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     2172/  128728 | consumed samples:        34752 | consumed tokens:     71172096 | elapsed time per iteration (s): 15.21 | learning rate: 1.139E-05 | global batch size:    16 | lm loss: 6.518786E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2173/  128728 | consumed samples:        34768 | consumed tokens:     71204864 | elapsed time per iteration (s): 15.15 | learning rate: 1.139E-05 | global batch size:    16 | lm loss: 6.218275E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2174/  128728 | consumed samples:        34784 | consumed tokens:     71237632 | elapsed time per iteration (s): 15.25 | learning rate: 1.140E-05 | global batch size:    16 | lm loss: 6.684814E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2175/  128728 | consumed samples:        34800 | consumed tokens:     71270400 | elapsed time per iteration (s): 15.22 | learning rate: 1.140E-05 | global batch size:    16 | lm loss: 6.340273E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2176/  128728 | consumed samples:        34816 | consumed tokens:     71303168 | elapsed time per iteration (s): 15.15 | learning rate: 1.141E-05 | global batch size:    16 | lm loss: 6.500647E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2177/  128728 | consumed samples:        34832 | consumed tokens:     71335936 | elapsed time per iteration (s): 15.22 | learning rate: 1.141E-05 | global batch size:    16 | lm loss: 6.369704E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2178/  128728 | consumed samples:        34848 | consumed tokens:     71368704 | elapsed time per iteration (s): 15.21 | learning rate: 1.142E-05 | global batch size:    16 | lm loss: 6.439621E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2179/  128728 | consumed samples:        34864 | consumed tokens:     71401472 | elapsed time per iteration (s): 15.21 | learning rate: 1.142E-05 | global batch size:    16 | lm loss: 6.381093E+00 | grad norm: 1.039 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2180/  128728 | consumed samples:        34880 | consumed tokens:     71434240 | elapsed time per iteration (s): 15.18 | learning rate: 1.143E-05 | global batch size:    16 | lm loss: 6.661847E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2181/  128728 | consumed samples:        34896 | consumed tokens:     71467008 | elapsed time per iteration (s): 15.25 | learning rate: 1.143E-05 | global batch size:    16 | lm loss: 6.390566E+00 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2182/  128728 | consumed samples:        34912 | consumed tokens:     71499776 | elapsed time per iteration (s): 15.23 | learning rate: 1.144E-05 | global batch size:    16 | lm loss: 6.537359E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2183/  128728 | consumed samples:        34928 | consumed tokens:     71532544 | elapsed time per iteration (s): 15.25 | learning rate: 1.145E-05 | global batch size:    16 | lm loss: 6.467527E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2184/  128728 | consumed samples:        34944 | consumed tokens:     71565312 | elapsed time per iteration (s): 15.18 | learning rate: 1.145E-05 | global batch size:    16 | lm loss: 6.535425E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2185/  128728 | consumed samples:        34960 | consumed tokens:     71598080 | elapsed time per iteration (s): 15.24 | learning rate: 1.146E-05 | global batch size:    16 | lm loss: 6.346310E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2186/  128728 | consumed samples:        34976 | consumed tokens:     71630848 | elapsed time per iteration (s): 15.23 | learning rate: 1.146E-05 | global batch size:    16 | lm loss: 6.353755E+00 | grad norm: 1.825 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2187/  128728 | consumed samples:        34992 | consumed tokens:     71663616 | elapsed time per iteration (s): 15.25 | learning rate: 1.147E-05 | global batch size:    16 | lm loss: 6.488267E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2188/  128728 | consumed samples:        35008 | consumed tokens:     71696384 | elapsed time per iteration (s): 15.19 | learning rate: 1.147E-05 | global batch size:    16 | lm loss: 6.271044E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2189/  128728 | consumed samples:        35024 | consumed tokens:     71729152 | elapsed time per iteration (s): 15.22 | learning rate: 1.148E-05 | global batch size:    16 | lm loss: 6.419786E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2190/  128728 | consumed samples:        35040 | consumed tokens:     71761920 | elapsed time per iteration (s): 15.21 | learning rate: 1.148E-05 | global batch size:    16 | lm loss: 6.286393E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2191/  128728 | consumed samples:        35056 | consumed tokens:     71794688 | elapsed time per iteration (s): 15.24 | learning rate: 1.149E-05 | global batch size:    16 | lm loss: 6.343496E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2192/  128728 | consumed samples:        35072 | consumed tokens:     71827456 | elapsed time per iteration (s): 15.21 | learning rate: 1.149E-05 | global batch size:    16 | lm loss: 6.306832E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2193/  128728 | consumed samples:        35088 | consumed tokens:     71860224 | elapsed time per iteration (s): 15.20 | learning rate: 1.150E-05 | global batch size:    16 | lm loss: 6.315264E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2194/  128728 | consumed samples:        35104 | consumed tokens:     71892992 | elapsed time per iteration (s): 15.24 | learning rate: 1.150E-05 | global batch size:    16 | lm loss: 6.208344E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2195/  128728 | consumed samples:        35120 | consumed tokens:     71925760 | elapsed time per iteration (s): 15.22 | learning rate: 1.151E-05 | global batch size:    16 | lm loss: 6.307616E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2196/  128728 | consumed samples:        35136 | consumed tokens:     71958528 | elapsed time per iteration (s): 15.22 | learning rate: 1.151E-05 | global batch size:    16 | lm loss: 6.192717E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2197/  128728 | consumed samples:        35152 | consumed tokens:     71991296 | elapsed time per iteration (s): 15.20 | learning rate: 1.152E-05 | global batch size:    16 | lm loss: 6.418719E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2198/  128728 | consumed samples:        35168 | consumed tokens:     72024064 | elapsed time per iteration (s): 15.19 | learning rate: 1.152E-05 | global batch size:    16 | lm loss: 6.245737E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2199/  128728 | consumed samples:        35184 | consumed tokens:     72056832 | elapsed time per iteration (s): 15.21 | learning rate: 1.153E-05 | global batch size:    16 | lm loss: 6.310443E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2200/  128728 | consumed samples:        35200 | consumed tokens:     72089600 | elapsed time per iteration (s): 15.21 | learning rate: 1.153E-05 | global batch size:    16 | lm loss: 6.745331E+00 | grad norm: 0.852 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2201/  128728 | consumed samples:        35216 | consumed tokens:     72122368 | elapsed time per iteration (s): 15.19 | learning rate: 1.154E-05 | global batch size:    16 | lm loss: 6.420246E+00 | grad norm: 1.406 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2202/  128728 | consumed samples:        35232 | consumed tokens:     72155136 | elapsed time per iteration (s): 15.22 | learning rate: 1.154E-05 | global batch size:    16 | lm loss: 6.487600E+00 | grad norm: 0.883 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2203/  128728 | consumed samples:        35248 | consumed tokens:     72187904 | elapsed time per iteration (s): 15.20 | learning rate: 1.155E-05 | global batch size:    16 | lm loss: 6.501083E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2204/  128728 | consumed samples:        35264 | consumed tokens:     72220672 | elapsed time per iteration (s): 15.22 | learning rate: 1.156E-05 | global batch size:    16 | lm loss: 6.380270E+00 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2205/  128728 | consumed samples:        35280 | consumed tokens:     72253440 | elapsed time per iteration (s): 15.21 | learning rate: 1.156E-05 | global batch size:    16 | lm loss: 6.324718E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2206/  128728 | consumed samples:        35296 | consumed tokens:     72286208 | elapsed time per iteration (s): 15.23 | learning rate: 1.157E-05 | global batch size:    16 | lm loss: 6.390339E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2207/  128728 | consumed samples:        35312 | consumed tokens:     72318976 | elapsed time per iteration (s): 15.21 | learning rate: 1.157E-05 | global batch size:    16 | lm loss: 6.343199E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2208/  128728 | consumed samples:        35328 | consumed tokens:     72351744 | elapsed time per iteration (s): 15.20 | learning rate: 1.158E-05 | global batch size:    16 | lm loss: 6.292582E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2209/  128728 | consumed samples:        35344 | consumed tokens:     72384512 | elapsed time per iteration (s): 15.21 | learning rate: 1.158E-05 | global batch size:    16 | lm loss: 6.351970E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2210/  128728 | consumed samples:        35360 | consumed tokens:     72417280 | elapsed time per iteration (s): 15.26 | learning rate: 1.159E-05 | global batch size:    16 | lm loss: 6.413839E+00 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2211/  128728 | consumed samples:        35376 | consumed tokens:     72450048 | elapsed time per iteration (s): 15.21 | learning rate: 1.159E-05 | global batch size:    16 | lm loss: 6.559430E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2212/  128728 | consumed samples:        35392 | consumed tokens:     72482816 | elapsed time per iteration (s): 15.22 | learning rate: 1.160E-05 | global batch size:    16 | lm loss: 6.109778E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2213/  128728 | consumed samples:        35408 | consumed tokens:     72515584 | elapsed time per iteration (s): 15.25 | learning rate: 1.160E-05 | global batch size:    16 | lm loss: 6.061421E+00 | grad norm: 1.314 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2214/  128728 | consumed samples:        35424 | consumed tokens:     72548352 | elapsed time per iteration (s): 15.22 | learning rate: 1.161E-05 | global batch size:    16 | lm loss: 6.424275E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2215/  128728 | consumed samples:        35440 | consumed tokens:     72581120 | elapsed time per iteration (s): 15.24 | learning rate: 1.161E-05 | global batch size:    16 | lm loss: 6.570379E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2216/  128728 | consumed samples:        35456 | consumed tokens:     72613888 | elapsed time per iteration (s): 15.19 | learning rate: 1.162E-05 | global batch size:    16 | lm loss: 6.441628E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2217/  128728 | consumed samples:        35472 | consumed tokens:     72646656 | elapsed time per iteration (s): 15.23 | learning rate: 1.162E-05 | global batch size:    16 | lm loss: 6.402570E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2218/  128728 | consumed samples:        35488 | consumed tokens:     72679424 | elapsed time per iteration (s): 15.21 | learning rate: 1.163E-05 | global batch size:    16 | lm loss: 6.482116E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2219/  128728 | consumed samples:        35504 | consumed tokens:     72712192 | elapsed time per iteration (s): 15.23 | learning rate: 1.163E-05 | global batch size:    16 | lm loss: 6.316390E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2220/  128728 | consumed samples:        35520 | consumed tokens:     72744960 | elapsed time per iteration (s): 15.24 | learning rate: 1.164E-05 | global batch size:    16 | lm loss: 6.419680E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2221/  128728 | consumed samples:        35536 | consumed tokens:     72777728 | elapsed time per iteration (s): 15.23 | learning rate: 1.164E-05 | global batch size:    16 | lm loss: 6.395838E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2222/  128728 | consumed samples:        35552 | consumed tokens:     72810496 | elapsed time per iteration (s): 15.19 | learning rate: 1.165E-05 | global batch size:    16 | lm loss: 6.283474E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2223/  128728 | consumed samples:        35568 | consumed tokens:     72843264 | elapsed time per iteration (s): 15.21 | learning rate: 1.165E-05 | global batch size:    16 | lm loss: 6.431798E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2224/  128728 | consumed samples:        35584 | consumed tokens:     72876032 | elapsed time per iteration (s): 15.16 | learning rate: 1.166E-05 | global batch size:    16 | lm loss: 6.408734E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2225/  128728 | consumed samples:        35600 | consumed tokens:     72908800 | elapsed time per iteration (s): 15.19 | learning rate: 1.167E-05 | global batch size:    16 | lm loss: 6.471613E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2226/  128728 | consumed samples:        35616 | consumed tokens:     72941568 | elapsed time per iteration (s): 15.22 | learning rate: 1.167E-05 | global batch size:    16 | lm loss: 6.622155E+00 | grad norm: 1.155 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2227/  128728 | consumed samples:        35632 | consumed tokens:     72974336 | elapsed time per iteration (s): 15.19 | learning rate: 1.168E-05 | global batch size:    16 | lm loss: 6.308994E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2228/  128728 | consumed samples:        35648 | consumed tokens:     73007104 | elapsed time per iteration (s): 15.23 | learning rate: 1.168E-05 | global batch size:    16 | lm loss: 6.676260E+00 | grad norm: 1.552 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2229/  128728 | consumed samples:        35664 | consumed tokens:     73039872 | elapsed time per iteration (s): 15.19 | learning rate: 1.169E-05 | global batch size:    16 | lm loss: 6.296388E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2230/  128728 | consumed samples:        35680 | consumed tokens:     73072640 | elapsed time per iteration (s): 15.20 | learning rate: 1.169E-05 | global batch size:    16 | lm loss: 6.460938E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2231/  128728 | consumed samples:        35696 | consumed tokens:     73105408 | elapsed time per iteration (s): 15.21 | learning rate: 1.170E-05 | global batch size:    16 | lm loss: 5.970007E+00 | grad norm: 1.151 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2232/  128728 | consumed samples:        35712 | consumed tokens:     73138176 | elapsed time per iteration (s): 15.21 | learning rate: 1.170E-05 | global batch size:    16 | lm loss: 6.361232E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2233/  128728 | consumed samples:        35728 | consumed tokens:     73170944 | elapsed time per iteration (s): 15.19 | learning rate: 1.171E-05 | global batch size:    16 | lm loss: 6.304492E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2234/  128728 | consumed samples:        35744 | consumed tokens:     73203712 | elapsed time per iteration (s): 15.19 | learning rate: 1.171E-05 | global batch size:    16 | lm loss: 6.355456E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2235/  128728 | consumed samples:        35760 | consumed tokens:     73236480 | elapsed time per iteration (s): 15.21 | learning rate: 1.172E-05 | global batch size:    16 | lm loss: 6.381365E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2236/  128728 | consumed samples:        35776 | consumed tokens:     73269248 | elapsed time per iteration (s): 15.23 | learning rate: 1.172E-05 | global batch size:    16 | lm loss: 6.291452E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2237/  128728 | consumed samples:        35792 | consumed tokens:     73302016 | elapsed time per iteration (s): 15.21 | learning rate: 1.173E-05 | global batch size:    16 | lm loss: 6.277006E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2238/  128728 | consumed samples:        35808 | consumed tokens:     73334784 | elapsed time per iteration (s): 15.22 | learning rate: 1.173E-05 | global batch size:    16 | lm loss: 6.583983E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2239/  128728 | consumed samples:        35824 | consumed tokens:     73367552 | elapsed time per iteration (s): 15.24 | learning rate: 1.174E-05 | global batch size:    16 | lm loss: 6.101068E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2240/  128728 | consumed samples:        35840 | consumed tokens:     73400320 | elapsed time per iteration (s): 15.23 | learning rate: 1.174E-05 | global batch size:    16 | lm loss: 6.559378E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2241/  128728 | consumed samples:        35856 | consumed tokens:     73433088 | elapsed time per iteration (s): 15.23 | learning rate: 1.175E-05 | global batch size:    16 | lm loss: 6.321910E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2242/  128728 | consumed samples:        35872 | consumed tokens:     73465856 | elapsed time per iteration (s): 15.23 | learning rate: 1.175E-05 | global batch size:    16 | lm loss: 6.386539E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2243/  128728 | consumed samples:        35888 | consumed tokens:     73498624 | elapsed time per iteration (s): 15.20 | learning rate: 1.176E-05 | global batch size:    16 | lm loss: 6.276346E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2244/  128728 | consumed samples:        35904 | consumed tokens:     73531392 | elapsed time per iteration (s): 15.19 | learning rate: 1.177E-05 | global batch size:    16 | lm loss: 6.231655E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2245/  128728 | consumed samples:        35920 | consumed tokens:     73564160 | elapsed time per iteration (s): 15.22 | learning rate: 1.177E-05 | global batch size:    16 | lm loss: 6.294347E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2246/  128728 | consumed samples:        35936 | consumed tokens:     73596928 | elapsed time per iteration (s): 15.17 | learning rate: 1.178E-05 | global batch size:    16 | lm loss: 6.297725E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2247/  128728 | consumed samples:        35952 | consumed tokens:     73629696 | elapsed time per iteration (s): 15.20 | learning rate: 1.178E-05 | global batch size:    16 | lm loss: 6.356587E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2248/  128728 | consumed samples:        35968 | consumed tokens:     73662464 | elapsed time per iteration (s): 15.21 | learning rate: 1.179E-05 | global batch size:    16 | lm loss: 6.241110E+00 | grad norm: 1.062 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2249/  128728 | consumed samples:        35984 | consumed tokens:     73695232 | elapsed time per iteration (s): 15.21 | learning rate: 1.179E-05 | global batch size:    16 | lm loss: 6.421347E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2250/  128728 | consumed samples:        36000 | consumed tokens:     73728000 | elapsed time per iteration (s): 15.24 | learning rate: 1.180E-05 | global batch size:    16 | lm loss: 6.559862E+00 | grad norm: 0.926 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2251/  128728 | consumed samples:        36016 | consumed tokens:     73760768 | elapsed time per iteration (s): 15.24 | learning rate: 1.180E-05 | global batch size:    16 | lm loss: 6.585839E+00 | grad norm: 1.510 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2252/  128728 | consumed samples:        36032 | consumed tokens:     73793536 | elapsed time per iteration (s): 15.16 | learning rate: 1.181E-05 | global batch size:    16 | lm loss: 6.288719E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2253/  128728 | consumed samples:        36048 | consumed tokens:     73826304 | elapsed time per iteration (s): 15.18 | learning rate: 1.181E-05 | global batch size:    16 | lm loss: 6.560059E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2254/  128728 | consumed samples:        36064 | consumed tokens:     73859072 | elapsed time per iteration (s): 15.21 | learning rate: 1.182E-05 | global batch size:    16 | lm loss: 6.443803E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2255/  128728 | consumed samples:        36080 | consumed tokens:     73891840 | elapsed time per iteration (s): 15.22 | learning rate: 1.182E-05 | global batch size:    16 | lm loss: 6.219304E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2256/  128728 | consumed samples:        36096 | consumed tokens:     73924608 | elapsed time per iteration (s): 15.21 | learning rate: 1.183E-05 | global batch size:    16 | lm loss: 6.347414E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2257/  128728 | consumed samples:        36112 | consumed tokens:     73957376 | elapsed time per iteration (s): 15.22 | learning rate: 1.183E-05 | global batch size:    16 | lm loss: 6.342593E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2258/  128728 | consumed samples:        36128 | consumed tokens:     73990144 | elapsed time per iteration (s): 15.21 | learning rate: 1.184E-05 | global batch size:    16 | lm loss: 6.316047E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2259/  128728 | consumed samples:        36144 | consumed tokens:     74022912 | elapsed time per iteration (s): 15.21 | learning rate: 1.184E-05 | global batch size:    16 | lm loss: 6.370636E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2260/  128728 | consumed samples:        36160 | consumed tokens:     74055680 | elapsed time per iteration (s): 15.19 | learning rate: 1.185E-05 | global batch size:    16 | lm loss: 6.101759E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2261/  128728 | consumed samples:        36176 | consumed tokens:     74088448 | elapsed time per iteration (s): 15.18 | learning rate: 1.185E-05 | global batch size:    16 | lm loss: 6.264756E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2262/  128728 | consumed samples:        36192 | consumed tokens:     74121216 | elapsed time per iteration (s): 15.23 | learning rate: 1.186E-05 | global batch size:    16 | lm loss: 6.437723E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2263/  128728 | consumed samples:        36208 | consumed tokens:     74153984 | elapsed time per iteration (s): 15.17 | learning rate: 1.186E-05 | global batch size:    16 | lm loss: 6.398685E+00 | grad norm: 1.106 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     2264/  128728 | consumed samples:        36224 | consumed tokens:     74186752 | elapsed time per iteration (s): 15.21 | learning rate: 1.187E-05 | global batch size:    16 | lm loss: 6.381065E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2265/  128728 | consumed samples:        36240 | consumed tokens:     74219520 | elapsed time per iteration (s): 15.25 | learning rate: 1.188E-05 | global batch size:    16 | lm loss: 6.362085E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2266/  128728 | consumed samples:        36256 | consumed tokens:     74252288 | elapsed time per iteration (s): 15.23 | learning rate: 1.188E-05 | global batch size:    16 | lm loss: 6.612569E+00 | grad norm: 1.086 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2267/  128728 | consumed samples:        36272 | consumed tokens:     74285056 | elapsed time per iteration (s): 15.25 | learning rate: 1.189E-05 | global batch size:    16 | lm loss: 6.538249E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2268/  128728 | consumed samples:        36288 | consumed tokens:     74317824 | elapsed time per iteration (s): 15.19 | learning rate: 1.189E-05 | global batch size:    16 | lm loss: 6.117253E+00 | grad norm: 1.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2269/  128728 | consumed samples:        36304 | consumed tokens:     74350592 | elapsed time per iteration (s): 15.16 | learning rate: 1.190E-05 | global batch size:    16 | lm loss: 6.429029E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2270/  128728 | consumed samples:        36320 | consumed tokens:     74383360 | elapsed time per iteration (s): 15.19 | learning rate: 1.190E-05 | global batch size:    16 | lm loss: 6.362271E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2271/  128728 | consumed samples:        36336 | consumed tokens:     74416128 | elapsed time per iteration (s): 15.20 | learning rate: 1.191E-05 | global batch size:    16 | lm loss: 6.516022E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2272/  128728 | consumed samples:        36352 | consumed tokens:     74448896 | elapsed time per iteration (s): 15.17 | learning rate: 1.191E-05 | global batch size:    16 | lm loss: 6.428764E+00 | grad norm: 1.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     2273/  128728 | consumed samples:        36368 | consumed tokens:     74481664 | elapsed time per iteration (s): 15.20 | learning rate: 1.192E-05 | global batch size:    16 | lm loss: 6.327567E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2274/  128728 | consumed samples:        36384 | consumed tokens:     74514432 | elapsed time per iteration (s): 15.16 | learning rate: 1.192E-05 | global batch size:    16 | lm loss: 6.334872E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2275/  128728 | consumed samples:        36400 | consumed tokens:     74547200 | elapsed time per iteration (s): 15.24 | learning rate: 1.193E-05 | global batch size:    16 | lm loss: 6.308464E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2276/  128728 | consumed samples:        36416 | consumed tokens:     74579968 | elapsed time per iteration (s): 15.21 | learning rate: 1.193E-05 | global batch size:    16 | lm loss: 6.263940E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2277/  128728 | consumed samples:        36432 | consumed tokens:     74612736 | elapsed time per iteration (s): 15.25 | learning rate: 1.194E-05 | global batch size:    16 | lm loss: 6.259884E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2278/  128728 | consumed samples:        36448 | consumed tokens:     74645504 | elapsed time per iteration (s): 15.20 | learning rate: 1.194E-05 | global batch size:    16 | lm loss: 6.369345E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2279/  128728 | consumed samples:        36464 | consumed tokens:     74678272 | elapsed time per iteration (s): 15.21 | learning rate: 1.195E-05 | global batch size:    16 | lm loss: 6.319073E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2280/  128728 | consumed samples:        36480 | consumed tokens:     74711040 | elapsed time per iteration (s): 15.15 | learning rate: 1.195E-05 | global batch size:    16 | lm loss: 6.353582E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2281/  128728 | consumed samples:        36496 | consumed tokens:     74743808 | elapsed time per iteration (s): 15.24 | learning rate: 1.196E-05 | global batch size:    16 | lm loss: 6.437267E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2282/  128728 | consumed samples:        36512 | consumed tokens:     74776576 | elapsed time per iteration (s): 15.22 | learning rate: 1.196E-05 | global batch size:    16 | lm loss: 6.291619E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2283/  128728 | consumed samples:        36528 | consumed tokens:     74809344 | elapsed time per iteration (s): 15.23 | learning rate: 1.197E-05 | global batch size:    16 | lm loss: 6.131433E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2284/  128728 | consumed samples:        36544 | consumed tokens:     74842112 | elapsed time per iteration (s): 15.21 | learning rate: 1.197E-05 | global batch size:    16 | lm loss: 6.666663E+00 | grad norm: 0.935 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2285/  128728 | consumed samples:        36560 | consumed tokens:     74874880 | elapsed time per iteration (s): 15.22 | learning rate: 1.198E-05 | global batch size:    16 | lm loss: 6.291386E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2286/  128728 | consumed samples:        36576 | consumed tokens:     74907648 | elapsed time per iteration (s): 15.22 | learning rate: 1.199E-05 | global batch size:    16 | lm loss: 6.249954E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2287/  128728 | consumed samples:        36592 | consumed tokens:     74940416 | elapsed time per iteration (s): 15.23 | learning rate: 1.199E-05 | global batch size:    16 | lm loss: 6.303566E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2288/  128728 | consumed samples:        36608 | consumed tokens:     74973184 | elapsed time per iteration (s): 15.18 | learning rate: 1.200E-05 | global batch size:    16 | lm loss: 6.470012E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2289/  128728 | consumed samples:        36624 | consumed tokens:     75005952 | elapsed time per iteration (s): 15.22 | learning rate: 1.200E-05 | global batch size:    16 | lm loss: 6.342841E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2290/  128728 | consumed samples:        36640 | consumed tokens:     75038720 | elapsed time per iteration (s): 15.20 | learning rate: 1.201E-05 | global batch size:    16 | lm loss: 6.390831E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2291/  128728 | consumed samples:        36656 | consumed tokens:     75071488 | elapsed time per iteration (s): 15.21 | learning rate: 1.201E-05 | global batch size:    16 | lm loss: 6.325696E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2292/  128728 | consumed samples:        36672 | consumed tokens:     75104256 | elapsed time per iteration (s): 15.19 | learning rate: 1.202E-05 | global batch size:    16 | lm loss: 6.365410E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2293/  128728 | consumed samples:        36688 | consumed tokens:     75137024 | elapsed time per iteration (s): 15.24 | learning rate: 1.202E-05 | global batch size:    16 | lm loss: 6.209689E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2294/  128728 | consumed samples:        36704 | consumed tokens:     75169792 | elapsed time per iteration (s): 15.23 | learning rate: 1.203E-05 | global batch size:    16 | lm loss: 6.228543E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2295/  128728 | consumed samples:        36720 | consumed tokens:     75202560 | elapsed time per iteration (s): 15.22 | learning rate: 1.203E-05 | global batch size:    16 | lm loss: 6.475939E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2296/  128728 | consumed samples:        36736 | consumed tokens:     75235328 | elapsed time per iteration (s): 15.16 | learning rate: 1.204E-05 | global batch size:    16 | lm loss: 6.304015E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2297/  128728 | consumed samples:        36752 | consumed tokens:     75268096 | elapsed time per iteration (s): 15.20 | learning rate: 1.204E-05 | global batch size:    16 | lm loss: 6.079097E+00 | grad norm: 0.947 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2298/  128728 | consumed samples:        36768 | consumed tokens:     75300864 | elapsed time per iteration (s): 15.20 | learning rate: 1.205E-05 | global batch size:    16 | lm loss: 6.455893E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2299/  128728 | consumed samples:        36784 | consumed tokens:     75333632 | elapsed time per iteration (s): 15.22 | learning rate: 1.205E-05 | global batch size:    16 | lm loss: 6.587708E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2300/  128728 | consumed samples:        36800 | consumed tokens:     75366400 | elapsed time per iteration (s): 15.17 | learning rate: 1.206E-05 | global batch size:    16 | lm loss: 6.351872E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     2301/  128728 | consumed samples:        36816 | consumed tokens:     75399168 | elapsed time per iteration (s): 15.22 | learning rate: 1.206E-05 | global batch size:    16 | lm loss: 6.254686E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2302/  128728 | consumed samples:        36832 | consumed tokens:     75431936 | elapsed time per iteration (s): 15.26 | learning rate: 1.207E-05 | global batch size:    16 | lm loss: 6.451199E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2303/  128728 | consumed samples:        36848 | consumed tokens:     75464704 | elapsed time per iteration (s): 15.22 | learning rate: 1.207E-05 | global batch size:    16 | lm loss: 6.339918E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2304/  128728 | consumed samples:        36864 | consumed tokens:     75497472 | elapsed time per iteration (s): 15.20 | learning rate: 1.208E-05 | global batch size:    16 | lm loss: 6.407025E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2305/  128728 | consumed samples:        36880 | consumed tokens:     75530240 | elapsed time per iteration (s): 15.20 | learning rate: 1.208E-05 | global batch size:    16 | lm loss: 6.282629E+00 | grad norm: 1.452 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2306/  128728 | consumed samples:        36896 | consumed tokens:     75563008 | elapsed time per iteration (s): 15.18 | learning rate: 1.209E-05 | global batch size:    16 | lm loss: 6.182392E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2307/  128728 | consumed samples:        36912 | consumed tokens:     75595776 | elapsed time per iteration (s): 15.20 | learning rate: 1.210E-05 | global batch size:    16 | lm loss: 6.384602E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2308/  128728 | consumed samples:        36928 | consumed tokens:     75628544 | elapsed time per iteration (s): 15.27 | learning rate: 1.210E-05 | global batch size:    16 | lm loss: 6.367940E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     2309/  128728 | consumed samples:        36944 | consumed tokens:     75661312 | elapsed time per iteration (s): 15.26 | learning rate: 1.211E-05 | global batch size:    16 | lm loss: 6.468973E+00 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2310/  128728 | consumed samples:        36960 | consumed tokens:     75694080 | elapsed time per iteration (s): 15.17 | learning rate: 1.211E-05 | global batch size:    16 | lm loss: 6.521853E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2311/  128728 | consumed samples:        36976 | consumed tokens:     75726848 | elapsed time per iteration (s): 15.25 | learning rate: 1.212E-05 | global batch size:    16 | lm loss: 6.229692E+00 | grad norm: 1.296 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2312/  128728 | consumed samples:        36992 | consumed tokens:     75759616 | elapsed time per iteration (s): 15.24 | learning rate: 1.212E-05 | global batch size:    16 | lm loss: 6.317523E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2313/  128728 | consumed samples:        37008 | consumed tokens:     75792384 | elapsed time per iteration (s): 15.19 | learning rate: 1.213E-05 | global batch size:    16 | lm loss: 6.391045E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2314/  128728 | consumed samples:        37024 | consumed tokens:     75825152 | elapsed time per iteration (s): 15.24 | learning rate: 1.213E-05 | global batch size:    16 | lm loss: 6.241301E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2315/  128728 | consumed samples:        37040 | consumed tokens:     75857920 | elapsed time per iteration (s): 15.23 | learning rate: 1.214E-05 | global batch size:    16 | lm loss: 6.358777E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2316/  128728 | consumed samples:        37056 | consumed tokens:     75890688 | elapsed time per iteration (s): 15.21 | learning rate: 1.214E-05 | global batch size:    16 | lm loss: 5.995783E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2317/  128728 | consumed samples:        37072 | consumed tokens:     75923456 | elapsed time per iteration (s): 15.22 | learning rate: 1.215E-05 | global batch size:    16 | lm loss: 6.135524E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2318/  128728 | consumed samples:        37088 | consumed tokens:     75956224 | elapsed time per iteration (s): 15.26 | learning rate: 1.215E-05 | global batch size:    16 | lm loss: 6.258219E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2319/  128728 | consumed samples:        37104 | consumed tokens:     75988992 | elapsed time per iteration (s): 15.21 | learning rate: 1.216E-05 | global batch size:    16 | lm loss: 6.189133E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2320/  128728 | consumed samples:        37120 | consumed tokens:     76021760 | elapsed time per iteration (s): 15.24 | learning rate: 1.216E-05 | global batch size:    16 | lm loss: 6.054904E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2321/  128728 | consumed samples:        37136 | consumed tokens:     76054528 | elapsed time per iteration (s): 15.23 | learning rate: 1.217E-05 | global batch size:    16 | lm loss: 6.221347E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2322/  128728 | consumed samples:        37152 | consumed tokens:     76087296 | elapsed time per iteration (s): 15.22 | learning rate: 1.217E-05 | global batch size:    16 | lm loss: 6.249143E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2323/  128728 | consumed samples:        37168 | consumed tokens:     76120064 | elapsed time per iteration (s): 15.14 | learning rate: 1.218E-05 | global batch size:    16 | lm loss: 6.319250E+00 | grad norm: 0.901 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2324/  128728 | consumed samples:        37184 | consumed tokens:     76152832 | elapsed time per iteration (s): 15.16 | learning rate: 1.218E-05 | global batch size:    16 | lm loss: 6.291220E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2325/  128728 | consumed samples:        37200 | consumed tokens:     76185600 | elapsed time per iteration (s): 15.27 | learning rate: 1.219E-05 | global batch size:    16 | lm loss: 6.504467E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     2326/  128728 | consumed samples:        37216 | consumed tokens:     76218368 | elapsed time per iteration (s): 15.22 | learning rate: 1.219E-05 | global batch size:    16 | lm loss: 6.271231E+00 | grad norm: 0.978 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2327/  128728 | consumed samples:        37232 | consumed tokens:     76251136 | elapsed time per iteration (s): 15.23 | learning rate: 1.220E-05 | global batch size:    16 | lm loss: 6.325077E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2328/  128728 | consumed samples:        37248 | consumed tokens:     76283904 | elapsed time per iteration (s): 15.19 | learning rate: 1.221E-05 | global batch size:    16 | lm loss: 6.327701E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2329/  128728 | consumed samples:        37264 | consumed tokens:     76316672 | elapsed time per iteration (s): 15.24 | learning rate: 1.221E-05 | global batch size:    16 | lm loss: 6.261258E+00 | grad norm: 1.054 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2330/  128728 | consumed samples:        37280 | consumed tokens:     76349440 | elapsed time per iteration (s): 15.23 | learning rate: 1.222E-05 | global batch size:    16 | lm loss: 6.183227E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2331/  128728 | consumed samples:        37296 | consumed tokens:     76382208 | elapsed time per iteration (s): 15.22 | learning rate: 1.222E-05 | global batch size:    16 | lm loss: 6.479833E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2332/  128728 | consumed samples:        37312 | consumed tokens:     76414976 | elapsed time per iteration (s): 15.21 | learning rate: 1.223E-05 | global batch size:    16 | lm loss: 6.230041E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2333/  128728 | consumed samples:        37328 | consumed tokens:     76447744 | elapsed time per iteration (s): 15.21 | learning rate: 1.223E-05 | global batch size:    16 | lm loss: 6.235174E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2334/  128728 | consumed samples:        37344 | consumed tokens:     76480512 | elapsed time per iteration (s): 15.23 | learning rate: 1.224E-05 | global batch size:    16 | lm loss: 6.161546E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2335/  128728 | consumed samples:        37360 | consumed tokens:     76513280 | elapsed time per iteration (s): 15.18 | learning rate: 1.224E-05 | global batch size:    16 | lm loss: 6.356600E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2336/  128728 | consumed samples:        37376 | consumed tokens:     76546048 | elapsed time per iteration (s): 15.30 | learning rate: 1.225E-05 | global batch size:    16 | lm loss: 6.117655E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     2337/  128728 | consumed samples:        37392 | consumed tokens:     76578816 | elapsed time per iteration (s): 15.25 | learning rate: 1.225E-05 | global batch size:    16 | lm loss: 6.201807E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2338/  128728 | consumed samples:        37408 | consumed tokens:     76611584 | elapsed time per iteration (s): 15.25 | learning rate: 1.226E-05 | global batch size:    16 | lm loss: 6.379524E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2339/  128728 | consumed samples:        37424 | consumed tokens:     76644352 | elapsed time per iteration (s): 15.19 | learning rate: 1.226E-05 | global batch size:    16 | lm loss: 6.343836E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2340/  128728 | consumed samples:        37440 | consumed tokens:     76677120 | elapsed time per iteration (s): 15.24 | learning rate: 1.227E-05 | global batch size:    16 | lm loss: 6.259268E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2341/  128728 | consumed samples:        37456 | consumed tokens:     76709888 | elapsed time per iteration (s): 15.27 | learning rate: 1.227E-05 | global batch size:    16 | lm loss: 6.352734E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     2342/  128728 | consumed samples:        37472 | consumed tokens:     76742656 | elapsed time per iteration (s): 15.26 | learning rate: 1.228E-05 | global batch size:    16 | lm loss: 6.273996E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2343/  128728 | consumed samples:        37488 | consumed tokens:     76775424 | elapsed time per iteration (s): 15.21 | learning rate: 1.228E-05 | global batch size:    16 | lm loss: 6.341601E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2344/  128728 | consumed samples:        37504 | consumed tokens:     76808192 | elapsed time per iteration (s): 15.23 | learning rate: 1.229E-05 | global batch size:    16 | lm loss: 6.245977E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2345/  128728 | consumed samples:        37520 | consumed tokens:     76840960 | elapsed time per iteration (s): 15.24 | learning rate: 1.229E-05 | global batch size:    16 | lm loss: 6.364078E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2346/  128728 | consumed samples:        37536 | consumed tokens:     76873728 | elapsed time per iteration (s): 15.27 | learning rate: 1.230E-05 | global batch size:    16 | lm loss: 6.206321E+00 | grad norm: 1.079 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     2347/  128728 | consumed samples:        37552 | consumed tokens:     76906496 | elapsed time per iteration (s): 15.24 | learning rate: 1.231E-05 | global batch size:    16 | lm loss: 6.211255E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2348/  128728 | consumed samples:        37568 | consumed tokens:     76939264 | elapsed time per iteration (s): 15.17 | learning rate: 1.231E-05 | global batch size:    16 | lm loss: 6.149254E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2349/  128728 | consumed samples:        37584 | consumed tokens:     76972032 | elapsed time per iteration (s): 15.22 | learning rate: 1.232E-05 | global batch size:    16 | lm loss: 6.326467E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2350/  128728 | consumed samples:        37600 | consumed tokens:     77004800 | elapsed time per iteration (s): 15.21 | learning rate: 1.232E-05 | global batch size:    16 | lm loss: 6.407173E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2351/  128728 | consumed samples:        37616 | consumed tokens:     77037568 | elapsed time per iteration (s): 15.24 | learning rate: 1.233E-05 | global batch size:    16 | lm loss: 6.484642E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2352/  128728 | consumed samples:        37632 | consumed tokens:     77070336 | elapsed time per iteration (s): 15.23 | learning rate: 1.233E-05 | global batch size:    16 | lm loss: 6.147301E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2353/  128728 | consumed samples:        37648 | consumed tokens:     77103104 | elapsed time per iteration (s): 15.16 | learning rate: 1.234E-05 | global batch size:    16 | lm loss: 6.507727E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2354/  128728 | consumed samples:        37664 | consumed tokens:     77135872 | elapsed time per iteration (s): 15.25 | learning rate: 1.234E-05 | global batch size:    16 | lm loss: 6.197268E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2355/  128728 | consumed samples:        37680 | consumed tokens:     77168640 | elapsed time per iteration (s): 15.22 | learning rate: 1.235E-05 | global batch size:    16 | lm loss: 6.132536E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2356/  128728 | consumed samples:        37696 | consumed tokens:     77201408 | elapsed time per iteration (s): 15.25 | learning rate: 1.235E-05 | global batch size:    16 | lm loss: 6.288426E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2357/  128728 | consumed samples:        37712 | consumed tokens:     77234176 | elapsed time per iteration (s): 15.23 | learning rate: 1.236E-05 | global batch size:    16 | lm loss: 6.204188E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2358/  128728 | consumed samples:        37728 | consumed tokens:     77266944 | elapsed time per iteration (s): 15.20 | learning rate: 1.236E-05 | global batch size:    16 | lm loss: 6.382045E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2359/  128728 | consumed samples:        37744 | consumed tokens:     77299712 | elapsed time per iteration (s): 15.19 | learning rate: 1.237E-05 | global batch size:    16 | lm loss: 6.236710E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2360/  128728 | consumed samples:        37760 | consumed tokens:     77332480 | elapsed time per iteration (s): 15.23 | learning rate: 1.237E-05 | global batch size:    16 | lm loss: 6.214093E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2361/  128728 | consumed samples:        37776 | consumed tokens:     77365248 | elapsed time per iteration (s): 15.24 | learning rate: 1.238E-05 | global batch size:    16 | lm loss: 6.367528E+00 | grad norm: 1.105 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2362/  128728 | consumed samples:        37792 | consumed tokens:     77398016 | elapsed time per iteration (s): 15.22 | learning rate: 1.238E-05 | global batch size:    16 | lm loss: 6.165552E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2363/  128728 | consumed samples:        37808 | consumed tokens:     77430784 | elapsed time per iteration (s): 15.20 | learning rate: 1.239E-05 | global batch size:    16 | lm loss: 6.096678E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2364/  128728 | consumed samples:        37824 | consumed tokens:     77463552 | elapsed time per iteration (s): 15.21 | learning rate: 1.239E-05 | global batch size:    16 | lm loss: 6.199354E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2365/  128728 | consumed samples:        37840 | consumed tokens:     77496320 | elapsed time per iteration (s): 15.20 | learning rate: 1.240E-05 | global batch size:    16 | lm loss: 6.324061E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2366/  128728 | consumed samples:        37856 | consumed tokens:     77529088 | elapsed time per iteration (s): 15.25 | learning rate: 1.240E-05 | global batch size:    16 | lm loss: 6.271080E+00 | grad norm: 1.050 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2367/  128728 | consumed samples:        37872 | consumed tokens:     77561856 | elapsed time per iteration (s): 15.24 | learning rate: 1.241E-05 | global batch size:    16 | lm loss: 6.355011E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2368/  128728 | consumed samples:        37888 | consumed tokens:     77594624 | elapsed time per iteration (s): 15.23 | learning rate: 1.242E-05 | global batch size:    16 | lm loss: 6.325652E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2369/  128728 | consumed samples:        37904 | consumed tokens:     77627392 | elapsed time per iteration (s): 15.25 | learning rate: 1.242E-05 | global batch size:    16 | lm loss: 6.347246E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2370/  128728 | consumed samples:        37920 | consumed tokens:     77660160 | elapsed time per iteration (s): 15.23 | learning rate: 1.243E-05 | global batch size:    16 | lm loss: 6.249259E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2371/  128728 | consumed samples:        37936 | consumed tokens:     77692928 | elapsed time per iteration (s): 15.22 | learning rate: 1.243E-05 | global batch size:    16 | lm loss: 6.194085E+00 | grad norm: 0.663 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2372/  128728 | consumed samples:        37952 | consumed tokens:     77725696 | elapsed time per iteration (s): 15.25 | learning rate: 1.244E-05 | global batch size:    16 | lm loss: 6.347694E+00 | grad norm: 1.007 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2373/  128728 | consumed samples:        37968 | consumed tokens:     77758464 | elapsed time per iteration (s): 15.24 | learning rate: 1.244E-05 | global batch size:    16 | lm loss: 6.205462E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2374/  128728 | consumed samples:        37984 | consumed tokens:     77791232 | elapsed time per iteration (s): 15.18 | learning rate: 1.245E-05 | global batch size:    16 | lm loss: 6.502585E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2375/  128728 | consumed samples:        38000 | consumed tokens:     77824000 | elapsed time per iteration (s): 15.14 | learning rate: 1.245E-05 | global batch size:    16 | lm loss: 6.250050E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2376/  128728 | consumed samples:        38016 | consumed tokens:     77856768 | elapsed time per iteration (s): 15.22 | learning rate: 1.246E-05 | global batch size:    16 | lm loss: 6.314306E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2377/  128728 | consumed samples:        38032 | consumed tokens:     77889536 | elapsed time per iteration (s): 15.19 | learning rate: 1.246E-05 | global batch size:    16 | lm loss: 6.403317E+00 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2378/  128728 | consumed samples:        38048 | consumed tokens:     77922304 | elapsed time per iteration (s): 15.17 | learning rate: 1.247E-05 | global batch size:    16 | lm loss: 6.022367E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2379/  128728 | consumed samples:        38064 | consumed tokens:     77955072 | elapsed time per iteration (s): 15.25 | learning rate: 1.247E-05 | global batch size:    16 | lm loss: 6.283350E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2380/  128728 | consumed samples:        38080 | consumed tokens:     77987840 | elapsed time per iteration (s): 15.20 | learning rate: 1.248E-05 | global batch size:    16 | lm loss: 6.473180E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2381/  128728 | consumed samples:        38096 | consumed tokens:     78020608 | elapsed time per iteration (s): 15.23 | learning rate: 1.248E-05 | global batch size:    16 | lm loss: 6.159327E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2382/  128728 | consumed samples:        38112 | consumed tokens:     78053376 | elapsed time per iteration (s): 15.26 | learning rate: 1.249E-05 | global batch size:    16 | lm loss: 6.267680E+00 | grad norm: 2.109 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2383/  128728 | consumed samples:        38128 | consumed tokens:     78086144 | elapsed time per iteration (s): 15.25 | learning rate: 1.249E-05 | global batch size:    16 | lm loss: 6.295687E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2384/  128728 | consumed samples:        38144 | consumed tokens:     78118912 | elapsed time per iteration (s): 15.21 | learning rate: 1.250E-05 | global batch size:    16 | lm loss: 6.494872E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2385/  128728 | consumed samples:        38160 | consumed tokens:     78151680 | elapsed time per iteration (s): 15.23 | learning rate: 1.250E-05 | global batch size:    16 | lm loss: 6.360726E+00 | grad norm: 1.138 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2386/  128728 | consumed samples:        38176 | consumed tokens:     78184448 | elapsed time per iteration (s): 15.25 | learning rate: 1.251E-05 | global batch size:    16 | lm loss: 6.168993E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     2387/  128728 | consumed samples:        38192 | consumed tokens:     78217216 | elapsed time per iteration (s): 15.24 | learning rate: 1.251E-05 | global batch size:    16 | lm loss: 6.293113E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2388/  128728 | consumed samples:        38208 | consumed tokens:     78249984 | elapsed time per iteration (s): 15.28 | learning rate: 1.252E-05 | global batch size:    16 | lm loss: 6.249125E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     2389/  128728 | consumed samples:        38224 | consumed tokens:     78282752 | elapsed time per iteration (s): 15.25 | learning rate: 1.253E-05 | global batch size:    16 | lm loss: 6.292615E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2390/  128728 | consumed samples:        38240 | consumed tokens:     78315520 | elapsed time per iteration (s): 15.19 | learning rate: 1.253E-05 | global batch size:    16 | lm loss: 6.254043E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2391/  128728 | consumed samples:        38256 | consumed tokens:     78348288 | elapsed time per iteration (s): 15.26 | learning rate: 1.254E-05 | global batch size:    16 | lm loss: 6.146667E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2392/  128728 | consumed samples:        38272 | consumed tokens:     78381056 | elapsed time per iteration (s): 15.25 | learning rate: 1.254E-05 | global batch size:    16 | lm loss: 6.178244E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2393/  128728 | consumed samples:        38288 | consumed tokens:     78413824 | elapsed time per iteration (s): 15.22 | learning rate: 1.255E-05 | global batch size:    16 | lm loss: 6.237836E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2394/  128728 | consumed samples:        38304 | consumed tokens:     78446592 | elapsed time per iteration (s): 15.21 | learning rate: 1.255E-05 | global batch size:    16 | lm loss: 6.309330E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2395/  128728 | consumed samples:        38320 | consumed tokens:     78479360 | elapsed time per iteration (s): 15.19 | learning rate: 1.256E-05 | global batch size:    16 | lm loss: 6.187210E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2396/  128728 | consumed samples:        38336 | consumed tokens:     78512128 | elapsed time per iteration (s): 15.24 | learning rate: 1.256E-05 | global batch size:    16 | lm loss: 6.128231E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2397/  128728 | consumed samples:        38352 | consumed tokens:     78544896 | elapsed time per iteration (s): 15.25 | learning rate: 1.257E-05 | global batch size:    16 | lm loss: 6.318326E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2398/  128728 | consumed samples:        38368 | consumed tokens:     78577664 | elapsed time per iteration (s): 15.20 | learning rate: 1.257E-05 | global batch size:    16 | lm loss: 6.380693E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2399/  128728 | consumed samples:        38384 | consumed tokens:     78610432 | elapsed time per iteration (s): 15.23 | learning rate: 1.258E-05 | global batch size:    16 | lm loss: 6.091791E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2400/  128728 | consumed samples:        38400 | consumed tokens:     78643200 | elapsed time per iteration (s): 15.19 | learning rate: 1.258E-05 | global batch size:    16 | lm loss: 6.328229E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2401/  128728 | consumed samples:        38416 | consumed tokens:     78675968 | elapsed time per iteration (s): 15.21 | learning rate: 1.259E-05 | global batch size:    16 | lm loss: 6.066146E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2402/  128728 | consumed samples:        38432 | consumed tokens:     78708736 | elapsed time per iteration (s): 15.23 | learning rate: 1.259E-05 | global batch size:    16 | lm loss: 6.242963E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2403/  128728 | consumed samples:        38448 | consumed tokens:     78741504 | elapsed time per iteration (s): 15.22 | learning rate: 1.260E-05 | global batch size:    16 | lm loss: 6.153259E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2404/  128728 | consumed samples:        38464 | consumed tokens:     78774272 | elapsed time per iteration (s): 15.21 | learning rate: 1.260E-05 | global batch size:    16 | lm loss: 6.031731E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2405/  128728 | consumed samples:        38480 | consumed tokens:     78807040 | elapsed time per iteration (s): 15.28 | learning rate: 1.261E-05 | global batch size:    16 | lm loss: 6.312222E+00 | grad norm: 1.150 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     2406/  128728 | consumed samples:        38496 | consumed tokens:     78839808 | elapsed time per iteration (s): 15.25 | learning rate: 1.261E-05 | global batch size:    16 | lm loss: 6.096965E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2407/  128728 | consumed samples:        38512 | consumed tokens:     78872576 | elapsed time per iteration (s): 15.22 | learning rate: 1.262E-05 | global batch size:    16 | lm loss: 6.310668E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2408/  128728 | consumed samples:        38528 | consumed tokens:     78905344 | elapsed time per iteration (s): 15.22 | learning rate: 1.262E-05 | global batch size:    16 | lm loss: 6.245684E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2409/  128728 | consumed samples:        38544 | consumed tokens:     78938112 | elapsed time per iteration (s): 15.25 | learning rate: 1.263E-05 | global batch size:    16 | lm loss: 6.369366E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2410/  128728 | consumed samples:        38560 | consumed tokens:     78970880 | elapsed time per iteration (s): 15.18 | learning rate: 1.264E-05 | global batch size:    16 | lm loss: 6.178571E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2411/  128728 | consumed samples:        38576 | consumed tokens:     79003648 | elapsed time per iteration (s): 15.18 | learning rate: 1.264E-05 | global batch size:    16 | lm loss: 6.309995E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2412/  128728 | consumed samples:        38592 | consumed tokens:     79036416 | elapsed time per iteration (s): 15.24 | learning rate: 1.265E-05 | global batch size:    16 | lm loss: 6.512557E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2413/  128728 | consumed samples:        38608 | consumed tokens:     79069184 | elapsed time per iteration (s): 15.26 | learning rate: 1.265E-05 | global batch size:    16 | lm loss: 6.460606E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2414/  128728 | consumed samples:        38624 | consumed tokens:     79101952 | elapsed time per iteration (s): 15.22 | learning rate: 1.266E-05 | global batch size:    16 | lm loss: 6.191257E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2415/  128728 | consumed samples:        38640 | consumed tokens:     79134720 | elapsed time per iteration (s): 15.23 | learning rate: 1.266E-05 | global batch size:    16 | lm loss: 6.224933E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2416/  128728 | consumed samples:        38656 | consumed tokens:     79167488 | elapsed time per iteration (s): 15.20 | learning rate: 1.267E-05 | global batch size:    16 | lm loss: 6.085470E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2417/  128728 | consumed samples:        38672 | consumed tokens:     79200256 | elapsed time per iteration (s): 15.21 | learning rate: 1.267E-05 | global batch size:    16 | lm loss: 6.211289E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2418/  128728 | consumed samples:        38688 | consumed tokens:     79233024 | elapsed time per iteration (s): 15.23 | learning rate: 1.268E-05 | global batch size:    16 | lm loss: 6.217436E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2419/  128728 | consumed samples:        38704 | consumed tokens:     79265792 | elapsed time per iteration (s): 15.24 | learning rate: 1.268E-05 | global batch size:    16 | lm loss: 6.188772E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2420/  128728 | consumed samples:        38720 | consumed tokens:     79298560 | elapsed time per iteration (s): 15.16 | learning rate: 1.269E-05 | global batch size:    16 | lm loss: 6.116030E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2421/  128728 | consumed samples:        38736 | consumed tokens:     79331328 | elapsed time per iteration (s): 15.18 | learning rate: 1.269E-05 | global batch size:    16 | lm loss: 6.088687E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2422/  128728 | consumed samples:        38752 | consumed tokens:     79364096 | elapsed time per iteration (s): 15.20 | learning rate: 1.270E-05 | global batch size:    16 | lm loss: 6.145873E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2423/  128728 | consumed samples:        38768 | consumed tokens:     79396864 | elapsed time per iteration (s): 15.23 | learning rate: 1.270E-05 | global batch size:    16 | lm loss: 6.244073E+00 | grad norm: 1.080 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2424/  128728 | consumed samples:        38784 | consumed tokens:     79429632 | elapsed time per iteration (s): 15.19 | learning rate: 1.271E-05 | global batch size:    16 | lm loss: 6.340772E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2425/  128728 | consumed samples:        38800 | consumed tokens:     79462400 | elapsed time per iteration (s): 15.16 | learning rate: 1.271E-05 | global batch size:    16 | lm loss: 6.090518E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2426/  128728 | consumed samples:        38816 | consumed tokens:     79495168 | elapsed time per iteration (s): 15.23 | learning rate: 1.272E-05 | global batch size:    16 | lm loss: 6.469316E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2427/  128728 | consumed samples:        38832 | consumed tokens:     79527936 | elapsed time per iteration (s): 15.16 | learning rate: 1.272E-05 | global batch size:    16 | lm loss: 6.311891E+00 | grad norm: 0.883 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2428/  128728 | consumed samples:        38848 | consumed tokens:     79560704 | elapsed time per iteration (s): 15.16 | learning rate: 1.273E-05 | global batch size:    16 | lm loss: 6.165546E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2429/  128728 | consumed samples:        38864 | consumed tokens:     79593472 | elapsed time per iteration (s): 15.24 | learning rate: 1.273E-05 | global batch size:    16 | lm loss: 6.298426E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2430/  128728 | consumed samples:        38880 | consumed tokens:     79626240 | elapsed time per iteration (s): 15.22 | learning rate: 1.274E-05 | global batch size:    16 | lm loss: 6.042541E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2431/  128728 | consumed samples:        38896 | consumed tokens:     79659008 | elapsed time per iteration (s): 15.19 | learning rate: 1.275E-05 | global batch size:    16 | lm loss: 6.275948E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2432/  128728 | consumed samples:        38912 | consumed tokens:     79691776 | elapsed time per iteration (s): 15.24 | learning rate: 1.275E-05 | global batch size:    16 | lm loss: 6.168737E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2433/  128728 | consumed samples:        38928 | consumed tokens:     79724544 | elapsed time per iteration (s): 15.21 | learning rate: 1.276E-05 | global batch size:    16 | lm loss: 6.469221E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2434/  128728 | consumed samples:        38944 | consumed tokens:     79757312 | elapsed time per iteration (s): 15.23 | learning rate: 1.276E-05 | global batch size:    16 | lm loss: 6.241963E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2435/  128728 | consumed samples:        38960 | consumed tokens:     79790080 | elapsed time per iteration (s): 15.23 | learning rate: 1.277E-05 | global batch size:    16 | lm loss: 6.322588E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2436/  128728 | consumed samples:        38976 | consumed tokens:     79822848 | elapsed time per iteration (s): 15.21 | learning rate: 1.277E-05 | global batch size:    16 | lm loss: 6.185337E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2437/  128728 | consumed samples:        38992 | consumed tokens:     79855616 | elapsed time per iteration (s): 15.22 | learning rate: 1.278E-05 | global batch size:    16 | lm loss: 6.192573E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2438/  128728 | consumed samples:        39008 | consumed tokens:     79888384 | elapsed time per iteration (s): 15.25 | learning rate: 1.278E-05 | global batch size:    16 | lm loss: 6.097382E+00 | grad norm: 1.021 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2439/  128728 | consumed samples:        39024 | consumed tokens:     79921152 | elapsed time per iteration (s): 15.22 | learning rate: 1.279E-05 | global batch size:    16 | lm loss: 6.090995E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2440/  128728 | consumed samples:        39040 | consumed tokens:     79953920 | elapsed time per iteration (s): 15.24 | learning rate: 1.279E-05 | global batch size:    16 | lm loss: 6.367899E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2441/  128728 | consumed samples:        39056 | consumed tokens:     79986688 | elapsed time per iteration (s): 15.22 | learning rate: 1.280E-05 | global batch size:    16 | lm loss: 6.321862E+00 | grad norm: 1.128 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2442/  128728 | consumed samples:        39072 | consumed tokens:     80019456 | elapsed time per iteration (s): 15.24 | learning rate: 1.280E-05 | global batch size:    16 | lm loss: 6.289917E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2443/  128728 | consumed samples:        39088 | consumed tokens:     80052224 | elapsed time per iteration (s): 15.22 | learning rate: 1.281E-05 | global batch size:    16 | lm loss: 6.321412E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2444/  128728 | consumed samples:        39104 | consumed tokens:     80084992 | elapsed time per iteration (s): 15.25 | learning rate: 1.281E-05 | global batch size:    16 | lm loss: 6.268347E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2445/  128728 | consumed samples:        39120 | consumed tokens:     80117760 | elapsed time per iteration (s): 15.26 | learning rate: 1.282E-05 | global batch size:    16 | lm loss: 6.359283E+00 | grad norm: 1.130 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2446/  128728 | consumed samples:        39136 | consumed tokens:     80150528 | elapsed time per iteration (s): 15.21 | learning rate: 1.282E-05 | global batch size:    16 | lm loss: 6.297738E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2447/  128728 | consumed samples:        39152 | consumed tokens:     80183296 | elapsed time per iteration (s): 15.21 | learning rate: 1.283E-05 | global batch size:    16 | lm loss: 6.223781E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2448/  128728 | consumed samples:        39168 | consumed tokens:     80216064 | elapsed time per iteration (s): 15.22 | learning rate: 1.283E-05 | global batch size:    16 | lm loss: 6.075761E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2449/  128728 | consumed samples:        39184 | consumed tokens:     80248832 | elapsed time per iteration (s): 15.23 | learning rate: 1.284E-05 | global batch size:    16 | lm loss: 6.331431E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2450/  128728 | consumed samples:        39200 | consumed tokens:     80281600 | elapsed time per iteration (s): 15.21 | learning rate: 1.285E-05 | global batch size:    16 | lm loss: 6.386670E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2451/  128728 | consumed samples:        39216 | consumed tokens:     80314368 | elapsed time per iteration (s): 15.24 | learning rate: 1.285E-05 | global batch size:    16 | lm loss: 5.956208E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2452/  128728 | consumed samples:        39232 | consumed tokens:     80347136 | elapsed time per iteration (s): 15.19 | learning rate: 1.286E-05 | global batch size:    16 | lm loss: 6.244837E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2453/  128728 | consumed samples:        39248 | consumed tokens:     80379904 | elapsed time per iteration (s): 15.21 | learning rate: 1.286E-05 | global batch size:    16 | lm loss: 6.260391E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2454/  128728 | consumed samples:        39264 | consumed tokens:     80412672 | elapsed time per iteration (s): 15.19 | learning rate: 1.287E-05 | global batch size:    16 | lm loss: 6.223280E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2455/  128728 | consumed samples:        39280 | consumed tokens:     80445440 | elapsed time per iteration (s): 15.21 | learning rate: 1.287E-05 | global batch size:    16 | lm loss: 6.260831E+00 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2456/  128728 | consumed samples:        39296 | consumed tokens:     80478208 | elapsed time per iteration (s): 15.19 | learning rate: 1.288E-05 | global batch size:    16 | lm loss: 6.206467E+00 | grad norm: 0.935 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2457/  128728 | consumed samples:        39312 | consumed tokens:     80510976 | elapsed time per iteration (s): 15.22 | learning rate: 1.288E-05 | global batch size:    16 | lm loss: 6.305583E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2458/  128728 | consumed samples:        39328 | consumed tokens:     80543744 | elapsed time per iteration (s): 15.25 | learning rate: 1.289E-05 | global batch size:    16 | lm loss: 6.003456E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2459/  128728 | consumed samples:        39344 | consumed tokens:     80576512 | elapsed time per iteration (s): 15.16 | learning rate: 1.289E-05 | global batch size:    16 | lm loss: 6.305748E+00 | grad norm: 1.191 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2460/  128728 | consumed samples:        39360 | consumed tokens:     80609280 | elapsed time per iteration (s): 15.25 | learning rate: 1.290E-05 | global batch size:    16 | lm loss: 6.356349E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2461/  128728 | consumed samples:        39376 | consumed tokens:     80642048 | elapsed time per iteration (s): 15.24 | learning rate: 1.290E-05 | global batch size:    16 | lm loss: 6.082371E+00 | grad norm: 1.122 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2462/  128728 | consumed samples:        39392 | consumed tokens:     80674816 | elapsed time per iteration (s): 15.21 | learning rate: 1.291E-05 | global batch size:    16 | lm loss: 6.293061E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2463/  128728 | consumed samples:        39408 | consumed tokens:     80707584 | elapsed time per iteration (s): 15.20 | learning rate: 1.291E-05 | global batch size:    16 | lm loss: 6.216317E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2464/  128728 | consumed samples:        39424 | consumed tokens:     80740352 | elapsed time per iteration (s): 15.22 | learning rate: 1.292E-05 | global batch size:    16 | lm loss: 6.274666E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2465/  128728 | consumed samples:        39440 | consumed tokens:     80773120 | elapsed time per iteration (s): 15.28 | learning rate: 1.292E-05 | global batch size:    16 | lm loss: 6.314239E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     2466/  128728 | consumed samples:        39456 | consumed tokens:     80805888 | elapsed time per iteration (s): 15.21 | learning rate: 1.293E-05 | global batch size:    16 | lm loss: 6.179266E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2467/  128728 | consumed samples:        39472 | consumed tokens:     80838656 | elapsed time per iteration (s): 15.22 | learning rate: 1.293E-05 | global batch size:    16 | lm loss: 6.121453E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2468/  128728 | consumed samples:        39488 | consumed tokens:     80871424 | elapsed time per iteration (s): 15.24 | learning rate: 1.294E-05 | global batch size:    16 | lm loss: 6.419597E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2469/  128728 | consumed samples:        39504 | consumed tokens:     80904192 | elapsed time per iteration (s): 15.27 | learning rate: 1.294E-05 | global batch size:    16 | lm loss: 6.172673E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     2470/  128728 | consumed samples:        39520 | consumed tokens:     80936960 | elapsed time per iteration (s): 15.24 | learning rate: 1.295E-05 | global batch size:    16 | lm loss: 6.166053E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2471/  128728 | consumed samples:        39536 | consumed tokens:     80969728 | elapsed time per iteration (s): 15.26 | learning rate: 1.296E-05 | global batch size:    16 | lm loss: 6.552093E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2472/  128728 | consumed samples:        39552 | consumed tokens:     81002496 | elapsed time per iteration (s): 15.26 | learning rate: 1.296E-05 | global batch size:    16 | lm loss: 6.085385E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2473/  128728 | consumed samples:        39568 | consumed tokens:     81035264 | elapsed time per iteration (s): 15.20 | learning rate: 1.297E-05 | global batch size:    16 | lm loss: 6.246649E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2474/  128728 | consumed samples:        39584 | consumed tokens:     81068032 | elapsed time per iteration (s): 15.25 | learning rate: 1.297E-05 | global batch size:    16 | lm loss: 6.106105E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     2475/  128728 | consumed samples:        39600 | consumed tokens:     81100800 | elapsed time per iteration (s): 15.16 | learning rate: 1.298E-05 | global batch size:    16 | lm loss: 5.814936E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2476/  128728 | consumed samples:        39616 | consumed tokens:     81133568 | elapsed time per iteration (s): 15.21 | learning rate: 1.298E-05 | global batch size:    16 | lm loss: 6.232026E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2477/  128728 | consumed samples:        39632 | consumed tokens:     81166336 | elapsed time per iteration (s): 15.21 | learning rate: 1.299E-05 | global batch size:    16 | lm loss: 6.282386E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2478/  128728 | consumed samples:        39648 | consumed tokens:     81199104 | elapsed time per iteration (s): 15.22 | learning rate: 1.299E-05 | global batch size:    16 | lm loss: 6.110389E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2479/  128728 | consumed samples:        39664 | consumed tokens:     81231872 | elapsed time per iteration (s): 15.20 | learning rate: 1.300E-05 | global batch size:    16 | lm loss: 6.111573E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2480/  128728 | consumed samples:        39680 | consumed tokens:     81264640 | elapsed time per iteration (s): 15.27 | learning rate: 1.300E-05 | global batch size:    16 | lm loss: 6.483891E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     2481/  128728 | consumed samples:        39696 | consumed tokens:     81297408 | elapsed time per iteration (s): 15.17 | learning rate: 1.301E-05 | global batch size:    16 | lm loss: 6.348729E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2482/  128728 | consumed samples:        39712 | consumed tokens:     81330176 | elapsed time per iteration (s): 15.18 | learning rate: 1.301E-05 | global batch size:    16 | lm loss: 6.445699E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2483/  128728 | consumed samples:        39728 | consumed tokens:     81362944 | elapsed time per iteration (s): 15.23 | learning rate: 1.302E-05 | global batch size:    16 | lm loss: 6.384290E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2484/  128728 | consumed samples:        39744 | consumed tokens:     81395712 | elapsed time per iteration (s): 15.20 | learning rate: 1.302E-05 | global batch size:    16 | lm loss: 6.514880E+00 | grad norm: 0.913 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2485/  128728 | consumed samples:        39760 | consumed tokens:     81428480 | elapsed time per iteration (s): 15.18 | learning rate: 1.303E-05 | global batch size:    16 | lm loss: 6.243723E+00 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2486/  128728 | consumed samples:        39776 | consumed tokens:     81461248 | elapsed time per iteration (s): 15.24 | learning rate: 1.303E-05 | global batch size:    16 | lm loss: 6.220292E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2487/  128728 | consumed samples:        39792 | consumed tokens:     81494016 | elapsed time per iteration (s): 15.22 | learning rate: 1.304E-05 | global batch size:    16 | lm loss: 6.380357E+00 | grad norm: 1.296 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2488/  128728 | consumed samples:        39808 | consumed tokens:     81526784 | elapsed time per iteration (s): 15.23 | learning rate: 1.304E-05 | global batch size:    16 | lm loss: 6.065780E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2489/  128728 | consumed samples:        39824 | consumed tokens:     81559552 | elapsed time per iteration (s): 15.26 | learning rate: 1.305E-05 | global batch size:    16 | lm loss: 6.013194E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2490/  128728 | consumed samples:        39840 | consumed tokens:     81592320 | elapsed time per iteration (s): 15.18 | learning rate: 1.305E-05 | global batch size:    16 | lm loss: 6.132867E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2491/  128728 | consumed samples:        39856 | consumed tokens:     81625088 | elapsed time per iteration (s): 15.21 | learning rate: 1.306E-05 | global batch size:    16 | lm loss: 6.028798E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2492/  128728 | consumed samples:        39872 | consumed tokens:     81657856 | elapsed time per iteration (s): 15.17 | learning rate: 1.307E-05 | global batch size:    16 | lm loss: 6.127688E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     2493/  128728 | consumed samples:        39888 | consumed tokens:     81690624 | elapsed time per iteration (s): 15.27 | learning rate: 1.307E-05 | global batch size:    16 | lm loss: 6.248683E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     2494/  128728 | consumed samples:        39904 | consumed tokens:     81723392 | elapsed time per iteration (s): 15.23 | learning rate: 1.308E-05 | global batch size:    16 | lm loss: 6.398225E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2495/  128728 | consumed samples:        39920 | consumed tokens:     81756160 | elapsed time per iteration (s): 15.21 | learning rate: 1.308E-05 | global batch size:    16 | lm loss: 6.293244E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2496/  128728 | consumed samples:        39936 | consumed tokens:     81788928 | elapsed time per iteration (s): 15.24 | learning rate: 1.309E-05 | global batch size:    16 | lm loss: 6.195220E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2497/  128728 | consumed samples:        39952 | consumed tokens:     81821696 | elapsed time per iteration (s): 15.23 | learning rate: 1.309E-05 | global batch size:    16 | lm loss: 6.319185E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2498/  128728 | consumed samples:        39968 | consumed tokens:     81854464 | elapsed time per iteration (s): 15.23 | learning rate: 1.310E-05 | global batch size:    16 | lm loss: 6.121705E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2499/  128728 | consumed samples:        39984 | consumed tokens:     81887232 | elapsed time per iteration (s): 15.28 | learning rate: 1.310E-05 | global batch size:    16 | lm loss: 6.574756E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     2500/  128728 | consumed samples:        40000 | consumed tokens:     81920000 | elapsed time per iteration (s): 15.24 | learning rate: 1.311E-05 | global batch size:    16 | lm loss: 5.883192E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default0]:saving checkpoint at iteration    2500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:[2022-03-03 16:32:24,121] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/mp_rank_00_model_states.pt
[default1]:[2022-03-03 16:32:24,906] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/mp_rank_01_model_states.pt
[default1]:[2022-03-03 16:32:34,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt
[default2]:[2022-03-03 16:32:35,588] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 16:32:35,832] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt
[default1]:[2022-03-03 16:32:36,265] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt
[default0]:[2022-03-03 16:32:36,251] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt
[default1]:[2022-03-03 16:32:36,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt
[default7]:[2022-03-03 16:32:36,508] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt
[default4]:[2022-03-03 16:32:36,388] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt
[default5]:[2022-03-03 16:32:36,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt
[default6]:[2022-03-03 16:32:36,381] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt
[default3]:[2022-03-03 16:32:36,520] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt
[default0]:[2022-03-03 16:32:36,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt
[default1]:[2022-03-03 16:32:36,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt
[default5]:[2022-03-03 16:32:37,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 16:32:36,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt
[default7]:[2022-03-03 16:32:37,114] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt
[default4]:[2022-03-03 16:32:37,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt
[default6]:[2022-03-03 16:32:37,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt
[default2]:[2022-03-03 16:32:37,264] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt
[default6]:[2022-03-03 16:32:37,347] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt
[default2]:[2022-03-03 16:32:37,535] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 16:32:37,590] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt
[default0]:[2022-03-03 16:32:37,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 16:32:37,840] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 16:32:37,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt
[default3]:[2022-03-03 16:32:37,819] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt
[default5]:[2022-03-03 16:32:37,944] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt
[default0]:[2022-03-03 16:32:38,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 16:32:38,517] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 16:32:38,645] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt
[default6]:[2022-03-03 16:32:38,636] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt
[default6]:[2022-03-03 16:32:38,876] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt
[default0]:[2022-03-03 16:32:38,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt
[default4]:[2022-03-03 16:32:38,927] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 16:32:38,934] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt
[default5]:[2022-03-03 16:32:38,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt
[default1]:[2022-03-03 16:32:39,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 16:32:39,028] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt
[default0]:[2022-03-03 16:32:39,036] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt
[default5]:[2022-03-03 16:32:39,105] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt
[default3]:[2022-03-03 16:32:39,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt
[default2]:[2022-03-03 16:32:39,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt
[default0]:[2022-03-03 16:32:39,230] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 16:32:39,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt
[default3]:[2022-03-03 16:32:39,264] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt
[default7]:[2022-03-03 16:32:39,389] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt
[default4]:[2022-03-03 16:32:39,397] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt
[default1]:[2022-03-03 16:32:39,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt
[default2]:[2022-03-03 16:32:39,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 16:32:39,438] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 16:32:39,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt
[default7]:[2022-03-03 16:32:39,453] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt
[default5]:[2022-03-03 16:32:39,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 16:32:39,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt
[default6]:[2022-03-03 16:32:39,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt
[default3]:[2022-03-03 16:32:39,539] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt
[default5]:[2022-03-03 16:32:39,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt
[default5]:[2022-03-03 16:32:39,544] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt
[default5]:[2022-03-03 16:32:39,596] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 16:32:39,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt
[default1]:[2022-03-03 16:32:39,636] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 16:32:39,660] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt
[default3]:[2022-03-03 16:32:39,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt
[default4]:[2022-03-03 16:32:39,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 16:32:39,982] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt
[default5]:[2022-03-03 16:32:40,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 16:32:40,074] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt
[default1]:[2022-03-03 16:32:40,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt
[default4]:[2022-03-03 16:32:40,262] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt
[default6]:[2022-03-03 16:32:40,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 16:32:40,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt
[default7]:[2022-03-03 16:32:40,350] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt
[default2]:[2022-03-03 16:32:40,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt
[default5]:[2022-03-03 16:32:40,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt
[default2]:[2022-03-03 16:32:40,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt
[default7]:[2022-03-03 16:32:40,584] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt
[default1]:[2022-03-03 16:32:40,563] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt
[default1]:[2022-03-03 16:32:40,613] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt
[default6]:[2022-03-03 16:32:40,621] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt
[default0]:[2022-03-03 16:32:40,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt
[default1]:[2022-03-03 16:32:40,588] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt
[default2]:[2022-03-03 16:32:40,578] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt
[default2]:[2022-03-03 16:32:40,617] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt
[default0]:[2022-03-03 16:32:40,687] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
[default1]:[2022-03-03 16:32:40,679] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 16:32:40,696] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt
[default1]:[2022-03-03 16:32:40,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt
[default4]:[2022-03-03 16:32:40,796] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
[default5]:[2022-03-03 16:32:40,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt
[default3]:[2022-03-03 16:32:40,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 16:32:40,832] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 16:32:40,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt
[default1]:[2022-03-03 16:32:40,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt
[default3]:[2022-03-03 16:32:40,859] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt
[default4]:[2022-03-03 16:32:40,823] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
[default4]:[2022-03-03 16:32:40,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt
[default5]:[2022-03-03 16:32:40,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 16:32:40,955] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt
[default6]:[2022-03-03 16:32:41,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 16:32:40,937] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 16:32:40,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt
[default3]:[2022-03-03 16:32:41,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 16:32:41,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt
[default2]:[2022-03-03 16:32:41,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt
[default0]:[2022-03-03 16:32:41,228] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt
[default4]:[2022-03-03 16:32:41,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt
[default7]:[2022-03-03 16:32:41,316] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt
[default2]:[2022-03-03 16:32:41,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt
[default4]:[2022-03-03 16:32:41,401] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt
[default6]:[2022-03-03 16:32:41,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt
[default0]:[2022-03-03 16:32:41,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 16:32:41,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt
[default0]:[2022-03-03 16:32:41,471] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt
[default4]:[2022-03-03 16:32:41,495] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt
[default7]:[2022-03-03 16:32:41,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 16:32:41,507] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt
[default1]:[2022-03-03 16:32:41,635] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt
[default6]:[2022-03-03 16:32:41,651] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 16:32:41,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt
[default2]:[2022-03-03 16:32:41,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 16:32:41,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt
[default7]:[2022-03-03 16:32:41,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt
[default3]:[2022-03-03 16:32:41,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 16:32:41,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt
[default4]:[2022-03-03 16:32:41,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 16:32:41,694] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt
[default7]:[2022-03-03 16:32:41,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt
[default6]:[2022-03-03 16:32:41,862] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt
[default2]:[2022-03-03 16:32:41,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 16:32:41,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 16:32:41,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt
[default4]:[2022-03-03 16:32:41,815] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 16:32:42,019] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt
[default5]:[2022-03-03 16:32:42,047] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt
[default3]:[2022-03-03 16:32:42,168] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt
[default3]:[2022-03-03 16:32:42,189] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt
[default2]:[2022-03-03 16:32:42,154] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt
[default7]:[2022-03-03 16:32:42,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt
[default1]:[2022-03-03 16:32:42,415] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt
[default7]:[2022-03-03 16:32:42,499] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt
[default1]:[2022-03-03 16:32:42,544] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt
[default0]:[2022-03-03 16:32:42,469] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt
[default6]:[2022-03-03 16:32:42,466] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt
[default2]:[2022-03-03 16:32:42,543] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt
[default0]:[2022-03-03 16:32:42,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt
[default2]:[2022-03-03 16:32:42,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt
[default5]:[2022-03-03 16:32:42,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt
[default4]:[2022-03-03 16:32:42,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 16:32:42,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt
[default5]:[2022-03-03 16:32:42,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt
[default1]:[2022-03-03 16:32:42,691] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt
[default0]:[2022-03-03 16:32:42,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 16:32:42,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt
[default6]:[2022-03-03 16:32:42,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt
[default2]:[2022-03-03 16:32:42,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt
[default1]:[2022-03-03 16:32:42,983] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 16:32:42,948] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt
[default1]:[2022-03-03 16:32:42,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt
[default7]:[2022-03-03 16:32:43,031] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt
[default3]:[2022-03-03 16:32:43,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt
[default0]:[2022-03-03 16:32:43,048] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt
[default3]:[2022-03-03 16:32:42,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt
[default4]:[2022-03-03 16:32:43,082] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt
[default0]:[2022-03-03 16:32:43,088] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt
[default0]:[2022-03-03 16:32:43,049] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt
[default2]:[2022-03-03 16:32:43,076] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt
[default7]:[2022-03-03 16:32:43,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt
[default5]:[2022-03-03 16:32:43,121] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt
[default1]:[2022-03-03 16:32:43,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt
[default3]:[2022-03-03 16:32:43,094] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt
[default3]:[2022-03-03 16:32:43,156] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt
[default0]:[2022-03-03 16:32:43,142] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt
[default7]:[2022-03-03 16:32:43,162] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 16:32:43,166] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt
[default4]:[2022-03-03 16:32:43,184] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt
[default3]:[2022-03-03 16:32:43,193] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 16:32:43,208] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt
[default1]:[2022-03-03 16:32:43,291] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt
[default6]:[2022-03-03 16:32:43,248] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 16:32:43,312] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 16:32:43,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt
[default3]:[2022-03-03 16:32:43,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt
[default7]:[2022-03-03 16:32:43,318] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt
[default3]:[2022-03-03 16:32:43,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 16:32:43,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 16:32:43,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt
[default6]:[2022-03-03 16:32:43,435] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt
[default5]:[2022-03-03 16:32:43,411] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt
[default2]:[2022-03-03 16:32:43,398] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 16:32:43,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt
[default3]:[2022-03-03 16:32:43,474] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt
[default4]:[2022-03-03 16:32:43,481] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 16:32:43,455] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt
[default3]:[2022-03-03 16:32:43,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt
[default3]:[2022-03-03 16:32:43,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 16:32:43,507] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 16:32:43,541] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 16:32:43,579] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt
[default1]:[2022-03-03 16:32:43,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt
[default2]:[2022-03-03 16:32:43,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt
[default6]:[2022-03-03 16:32:43,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt
[default1]:[2022-03-03 16:32:43,687] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 16:32:43,727] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt
[default0]:[2022-03-03 16:32:43,724] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
[default0]:[2022-03-03 16:32:43,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 16:32:43,812] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt
[default1]:[2022-03-03 16:32:43,862] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt
[default0]:[2022-03-03 16:32:43,858] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt
[default1]:[2022-03-03 16:32:43,853] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 16:32:43,941] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt
[default0]:[2022-03-03 16:32:43,892] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt
[default0]:[2022-03-03 16:32:43,963] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 16:32:43,945] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt
[default6]:[2022-03-03 16:32:44,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt
[default1]:[2022-03-03 16:32:44,029] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt
[default7]:[2022-03-03 16:32:44,030] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt
[default5]:[2022-03-03 16:32:44,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt
[default0]:[2022-03-03 16:32:44,061] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt
[default7]:[2022-03-03 16:32:44,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt
[default0]:[2022-03-03 16:32:44,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt
[default1]:[2022-03-03 16:32:44,124] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt
[default6]:[2022-03-03 16:32:44,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt
[default7]:[2022-03-03 16:32:44,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt
[default4]:[2022-03-03 16:32:44,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt
[default2]:[2022-03-03 16:32:44,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt
[default6]:[2022-03-03 16:32:44,365] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 16:32:44,348] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt
[default7]:[2022-03-03 16:32:44,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt
[default6]:[2022-03-03 16:32:44,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt
[default5]:[2022-03-03 16:32:44,364] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt
[default7]:[2022-03-03 16:32:44,512] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt
[default1]:[2022-03-03 16:32:44,536] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt
[default2]:[2022-03-03 16:32:44,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt
[default3]:[2022-03-03 16:32:44,590] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 16:32:44,581] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt
[default7]:[2022-03-03 16:32:44,676] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt
[default4]:[2022-03-03 16:32:44,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 16:32:44,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt
[default6]:[2022-03-03 16:32:44,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt
[default6]:[2022-03-03 16:32:44,670] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt
[default4]:[2022-03-03 16:32:44,644] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt
[default2]:[2022-03-03 16:32:44,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 16:32:44,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt
[default7]:[2022-03-03 16:32:44,715] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt
[default0]:[2022-03-03 16:32:44,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt
[default4]:[2022-03-03 16:32:44,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 16:32:44,865] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 16:32:44,934] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 16:32:44,990] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt
[default2]:[2022-03-03 16:32:44,962] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt
[default5]:[2022-03-03 16:32:45,013] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt
[default4]:[2022-03-03 16:32:45,095] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt
[default5]:[2022-03-03 16:32:45,134] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 16:32:45,138] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt
[default5]:[2022-03-03 16:32:45,108] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt
[default2]:[2022-03-03 16:32:45,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt
[default1]:[2022-03-03 16:32:45,180] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt
[default3]:[2022-03-03 16:32:45,272] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 16:32:45,272] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt
[default4]:[2022-03-03 16:32:45,202] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt
[default7]:[2022-03-03 16:32:45,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt
[default2]:[2022-03-03 16:32:45,326] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt
[default5]:[2022-03-03 16:32:45,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt
[default6]:[2022-03-03 16:32:45,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt
[default5]:[2022-03-03 16:32:45,445] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 16:32:45,463] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt
[default2]:[2022-03-03 16:32:45,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt
[default7]:[2022-03-03 16:32:45,607] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt
[default1]:[2022-03-03 16:32:45,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 16:32:45,467] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt
[default2]:[2022-03-03 16:32:45,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt
[default1]:[2022-03-03 16:32:45,809] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt
[default7]:[2022-03-03 16:32:45,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt
[default5]:[2022-03-03 16:32:45,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt
[default6]:[2022-03-03 16:32:45,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt
[default1]:[2022-03-03 16:32:45,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt
[default7]:[2022-03-03 16:32:45,989] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt
[default0]:[2022-03-03 16:32:46,072] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 16:32:46,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt
[default5]:[2022-03-03 16:32:46,201] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt
[default2]:[2022-03-03 16:32:46,266] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt
[default3]:[2022-03-03 16:32:46,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt
[default1]:[2022-03-03 16:32:46,232] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt
[default6]:[2022-03-03 16:32:46,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt
[default3]:[2022-03-03 16:32:46,302] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt
[default3]:[2022-03-03 16:32:46,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt
[default3]:[2022-03-03 16:32:46,259] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt
[default4]:[2022-03-03 16:32:46,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt
[default3]:[2022-03-03 16:32:46,233] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 16:32:46,305] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt
[default7]:[2022-03-03 16:32:46,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt
[default0]:[2022-03-03 16:32:46,232] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt
[default0]:[2022-03-03 16:32:46,366] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 16:32:46,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 16:32:46,380] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt
[default6]:[2022-03-03 16:32:46,372] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt
[default5]:[2022-03-03 16:32:46,463] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt
[default0]:[2022-03-03 16:32:46,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt
[default3]:[2022-03-03 16:32:46,486] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 16:32:46,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 16:32:46,563] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt
[default4]:[2022-03-03 16:32:46,484] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt
[default1]:[2022-03-03 16:32:46,559] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 16:32:46,584] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
[default0]:[2022-03-03 16:32:46,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt
[default6]:[2022-03-03 16:32:46,629] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt
[default7]:[2022-03-03 16:32:46,584] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 16:32:46,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt
[default3]:[2022-03-03 16:32:46,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt
[default6]:[2022-03-03 16:32:46,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt
[default2]:[2022-03-03 16:32:46,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt
[default2]:[2022-03-03 16:32:46,714] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt
[default6]:[2022-03-03 16:32:46,678] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt
[default1]:[2022-03-03 16:32:46,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt
[default2]:[2022-03-03 16:32:46,843] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt
[default4]:[2022-03-03 16:32:46,887] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 16:32:46,855] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt
[default3]:[2022-03-03 16:32:46,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt
[default0]:[2022-03-03 16:32:46,986] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt
[default0]:[2022-03-03 16:32:46,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt
[default4]:[2022-03-03 16:32:47,009] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 16:32:46,970] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 16:32:46,934] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt
[default3]:[2022-03-03 16:32:46,987] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 16:32:47,044] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt
[default1]:[2022-03-03 16:32:46,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt
[default5]:[2022-03-03 16:32:47,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 16:32:47,119] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt
[default7]:[2022-03-03 16:32:47,462] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 16:32:47,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt
[default2]:[2022-03-03 16:32:47,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt
[default1]:[2022-03-03 16:32:47,629] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt
[default0]:[2022-03-03 16:32:47,757] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt
[default1]:[2022-03-03 16:32:47,708] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt
[default5]:[2022-03-03 16:32:47,669] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt
[default7]:[2022-03-03 16:32:47,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt
[default7]:[2022-03-03 16:32:47,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt
[default2]:[2022-03-03 16:32:47,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 16:32:47,822] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt
[default6]:[2022-03-03 16:32:47,840] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt
[default6]:[2022-03-03 16:32:47,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt
[default0]:[2022-03-03 16:32:47,889] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[default0]:[2022-03-03 16:32:47,940] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt
[default4]:[2022-03-03 16:32:48,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt
[default6]:[2022-03-03 16:32:48,137] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt
[default5]:[2022-03-03 16:32:48,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 16:32:48,175] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt
[default7]:[2022-03-03 16:32:48,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 16:32:48,471] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt
[default4]:[2022-03-03 16:32:48,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt
[default6]:[2022-03-03 16:32:48,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt
[default2]:[2022-03-03 16:32:48,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt
[default4]:[2022-03-03 16:32:48,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt
[default6]:[2022-03-03 16:32:48,746] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt
[default7]:[2022-03-03 16:32:48,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt
[default5]:[2022-03-03 16:32:48,837] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt
[default6]:[2022-03-03 16:32:48,941] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt
[default0]:[2022-03-03 16:32:48,993] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 16:32:49,456] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt
[default5]:[2022-03-03 16:32:49,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 16:32:49,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt
[default6]:[2022-03-03 16:32:49,534] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt
[default1]:[2022-03-03 16:32:49,616] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt
[default1]:[2022-03-03 16:32:49,850] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt
[default6]:[2022-03-03 16:32:50,223] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt
[default1]:[2022-03-03 16:32:50,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt
[default0]:[2022-03-03 16:32:51,368] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt
[default0]:[2022-03-03 16:32:52,306] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
[default2]:[2022-03-03 16:32:52,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt
[default6]:[2022-03-03 16:32:52,429] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 16:32:52,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt
[default3]:[2022-03-03 16:32:52,455] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt
[default4]:[2022-03-03 16:32:53,038] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
[default1]:[2022-03-03 16:32:53,106] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt
[default5]:[2022-03-03 16:32:53,199] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt
[default7]:[2022-03-03 16:32:53,272] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt
[default4]:[2022-03-03 16:32:53,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 16:32:53,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt
[default4]:[2022-03-03 16:32:53,489] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt
[default4]:[2022-03-03 16:32:53,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt
[default4]:[2022-03-03 16:32:53,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 16:32:53,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt
[default5]:[2022-03-03 16:32:53,529] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt
[default0]:  successfully saved checkpoint at iteration    2500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default7]:time (ms) | save-checkpoint: 38545.52
[default5]:[2022-03-03 16:32:53,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt
[default7]: iteration     2501/  128728 | consumed samples:        40016 | consumed tokens:     81952768 | elapsed time per iteration (s): 53.79 | learning rate: 1.311E-05 | global batch size:    16 | lm loss: 6.252749E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.297 | TFLOPs: 2.28 |
[default7]: iteration     2502/  128728 | consumed samples:        40032 | consumed tokens:     81985536 | elapsed time per iteration (s): 15.24 | learning rate: 1.312E-05 | global batch size:    16 | lm loss: 5.942217E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2503/  128728 | consumed samples:        40048 | consumed tokens:     82018304 | elapsed time per iteration (s): 15.22 | learning rate: 1.312E-05 | global batch size:    16 | lm loss: 6.333421E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2504/  128728 | consumed samples:        40064 | consumed tokens:     82051072 | elapsed time per iteration (s): 15.24 | learning rate: 1.313E-05 | global batch size:    16 | lm loss: 6.306670E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2505/  128728 | consumed samples:        40080 | consumed tokens:     82083840 | elapsed time per iteration (s): 15.24 | learning rate: 1.313E-05 | global batch size:    16 | lm loss: 6.183002E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2506/  128728 | consumed samples:        40096 | consumed tokens:     82116608 | elapsed time per iteration (s): 15.25 | learning rate: 1.314E-05 | global batch size:    16 | lm loss: 6.207052E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2507/  128728 | consumed samples:        40112 | consumed tokens:     82149376 | elapsed time per iteration (s): 15.25 | learning rate: 1.314E-05 | global batch size:    16 | lm loss: 6.162314E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2508/  128728 | consumed samples:        40128 | consumed tokens:     82182144 | elapsed time per iteration (s): 15.25 | learning rate: 1.315E-05 | global batch size:    16 | lm loss: 6.242827E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2509/  128728 | consumed samples:        40144 | consumed tokens:     82214912 | elapsed time per iteration (s): 15.21 | learning rate: 1.315E-05 | global batch size:    16 | lm loss: 6.144494E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2510/  128728 | consumed samples:        40160 | consumed tokens:     82247680 | elapsed time per iteration (s): 15.21 | learning rate: 1.316E-05 | global batch size:    16 | lm loss: 6.119376E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2511/  128728 | consumed samples:        40176 | consumed tokens:     82280448 | elapsed time per iteration (s): 15.18 | learning rate: 1.316E-05 | global batch size:    16 | lm loss: 6.218392E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2512/  128728 | consumed samples:        40192 | consumed tokens:     82313216 | elapsed time per iteration (s): 15.19 | learning rate: 1.317E-05 | global batch size:    16 | lm loss: 6.246577E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2513/  128728 | consumed samples:        40208 | consumed tokens:     82345984 | elapsed time per iteration (s): 15.23 | learning rate: 1.318E-05 | global batch size:    16 | lm loss: 6.041477E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2514/  128728 | consumed samples:        40224 | consumed tokens:     82378752 | elapsed time per iteration (s): 15.17 | learning rate: 1.318E-05 | global batch size:    16 | lm loss: 6.023715E+00 | grad norm: 0.856 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2515/  128728 | consumed samples:        40240 | consumed tokens:     82411520 | elapsed time per iteration (s): 15.24 | learning rate: 1.319E-05 | global batch size:    16 | lm loss: 6.201522E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2516/  128728 | consumed samples:        40256 | consumed tokens:     82444288 | elapsed time per iteration (s): 15.26 | learning rate: 1.319E-05 | global batch size:    16 | lm loss: 6.286212E+00 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2517/  128728 | consumed samples:        40272 | consumed tokens:     82477056 | elapsed time per iteration (s): 15.23 | learning rate: 1.320E-05 | global batch size:    16 | lm loss: 6.275428E+00 | grad norm: 1.433 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2518/  128728 | consumed samples:        40288 | consumed tokens:     82509824 | elapsed time per iteration (s): 15.22 | learning rate: 1.320E-05 | global batch size:    16 | lm loss: 6.296135E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2519/  128728 | consumed samples:        40304 | consumed tokens:     82542592 | elapsed time per iteration (s): 15.22 | learning rate: 1.321E-05 | global batch size:    16 | lm loss: 6.135454E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2520/  128728 | consumed samples:        40320 | consumed tokens:     82575360 | elapsed time per iteration (s): 15.23 | learning rate: 1.321E-05 | global batch size:    16 | lm loss: 6.092723E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2521/  128728 | consumed samples:        40336 | consumed tokens:     82608128 | elapsed time per iteration (s): 15.24 | learning rate: 1.322E-05 | global batch size:    16 | lm loss: 5.971050E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2522/  128728 | consumed samples:        40352 | consumed tokens:     82640896 | elapsed time per iteration (s): 15.22 | learning rate: 1.322E-05 | global batch size:    16 | lm loss: 6.361732E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2523/  128728 | consumed samples:        40368 | consumed tokens:     82673664 | elapsed time per iteration (s): 15.21 | learning rate: 1.323E-05 | global batch size:    16 | lm loss: 6.271525E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2524/  128728 | consumed samples:        40384 | consumed tokens:     82706432 | elapsed time per iteration (s): 15.24 | learning rate: 1.323E-05 | global batch size:    16 | lm loss: 5.950109E+00 | grad norm: 0.658 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2525/  128728 | consumed samples:        40400 | consumed tokens:     82739200 | elapsed time per iteration (s): 15.23 | learning rate: 1.324E-05 | global batch size:    16 | lm loss: 6.177237E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2526/  128728 | consumed samples:        40416 | consumed tokens:     82771968 | elapsed time per iteration (s): 15.21 | learning rate: 1.324E-05 | global batch size:    16 | lm loss: 6.452248E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2527/  128728 | consumed samples:        40432 | consumed tokens:     82804736 | elapsed time per iteration (s): 15.23 | learning rate: 1.325E-05 | global batch size:    16 | lm loss: 6.125947E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2528/  128728 | consumed samples:        40448 | consumed tokens:     82837504 | elapsed time per iteration (s): 15.23 | learning rate: 1.325E-05 | global batch size:    16 | lm loss: 6.275990E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2529/  128728 | consumed samples:        40464 | consumed tokens:     82870272 | elapsed time per iteration (s): 15.18 | learning rate: 1.326E-05 | global batch size:    16 | lm loss: 6.224532E+00 | grad norm: 1.093 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2530/  128728 | consumed samples:        40480 | consumed tokens:     82903040 | elapsed time per iteration (s): 15.22 | learning rate: 1.326E-05 | global batch size:    16 | lm loss: 6.188871E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2531/  128728 | consumed samples:        40496 | consumed tokens:     82935808 | elapsed time per iteration (s): 15.19 | learning rate: 1.327E-05 | global batch size:    16 | lm loss: 6.316185E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2532/  128728 | consumed samples:        40512 | consumed tokens:     82968576 | elapsed time per iteration (s): 15.17 | learning rate: 1.328E-05 | global batch size:    16 | lm loss: 6.173674E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2533/  128728 | consumed samples:        40528 | consumed tokens:     83001344 | elapsed time per iteration (s): 15.22 | learning rate: 1.328E-05 | global batch size:    16 | lm loss: 6.066485E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2534/  128728 | consumed samples:        40544 | consumed tokens:     83034112 | elapsed time per iteration (s): 15.23 | learning rate: 1.329E-05 | global batch size:    16 | lm loss: 5.854393E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2535/  128728 | consumed samples:        40560 | consumed tokens:     83066880 | elapsed time per iteration (s): 15.22 | learning rate: 1.329E-05 | global batch size:    16 | lm loss: 6.168399E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2536/  128728 | consumed samples:        40576 | consumed tokens:     83099648 | elapsed time per iteration (s): 15.21 | learning rate: 1.330E-05 | global batch size:    16 | lm loss: 6.101472E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2537/  128728 | consumed samples:        40592 | consumed tokens:     83132416 | elapsed time per iteration (s): 15.20 | learning rate: 1.330E-05 | global batch size:    16 | lm loss: 6.186237E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2538/  128728 | consumed samples:        40608 | consumed tokens:     83165184 | elapsed time per iteration (s): 15.26 | learning rate: 1.331E-05 | global batch size:    16 | lm loss: 6.074104E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2539/  128728 | consumed samples:        40624 | consumed tokens:     83197952 | elapsed time per iteration (s): 15.22 | learning rate: 1.331E-05 | global batch size:    16 | lm loss: 6.332224E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2540/  128728 | consumed samples:        40640 | consumed tokens:     83230720 | elapsed time per iteration (s): 15.23 | learning rate: 1.332E-05 | global batch size:    16 | lm loss: 6.297553E+00 | grad norm: 0.874 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2541/  128728 | consumed samples:        40656 | consumed tokens:     83263488 | elapsed time per iteration (s): 15.22 | learning rate: 1.332E-05 | global batch size:    16 | lm loss: 6.340025E+00 | grad norm: 1.212 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2542/  128728 | consumed samples:        40672 | consumed tokens:     83296256 | elapsed time per iteration (s): 15.22 | learning rate: 1.333E-05 | global batch size:    16 | lm loss: 6.225410E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2543/  128728 | consumed samples:        40688 | consumed tokens:     83329024 | elapsed time per iteration (s): 15.20 | learning rate: 1.333E-05 | global batch size:    16 | lm loss: 6.225701E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2544/  128728 | consumed samples:        40704 | consumed tokens:     83361792 | elapsed time per iteration (s): 15.19 | learning rate: 1.334E-05 | global batch size:    16 | lm loss: 6.324197E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2545/  128728 | consumed samples:        40720 | consumed tokens:     83394560 | elapsed time per iteration (s): 15.21 | learning rate: 1.334E-05 | global batch size:    16 | lm loss: 6.328512E+00 | grad norm: 1.342 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2546/  128728 | consumed samples:        40736 | consumed tokens:     83427328 | elapsed time per iteration (s): 15.15 | learning rate: 1.335E-05 | global batch size:    16 | lm loss: 6.272426E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2547/  128728 | consumed samples:        40752 | consumed tokens:     83460096 | elapsed time per iteration (s): 15.21 | learning rate: 1.335E-05 | global batch size:    16 | lm loss: 6.139767E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2548/  128728 | consumed samples:        40768 | consumed tokens:     83492864 | elapsed time per iteration (s): 15.19 | learning rate: 1.336E-05 | global batch size:    16 | lm loss: 5.918054E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2549/  128728 | consumed samples:        40784 | consumed tokens:     83525632 | elapsed time per iteration (s): 15.20 | learning rate: 1.336E-05 | global batch size:    16 | lm loss: 6.227513E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2550/  128728 | consumed samples:        40800 | consumed tokens:     83558400 | elapsed time per iteration (s): 15.20 | learning rate: 1.337E-05 | global batch size:    16 | lm loss: 6.322637E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2551/  128728 | consumed samples:        40816 | consumed tokens:     83591168 | elapsed time per iteration (s): 15.20 | learning rate: 1.337E-05 | global batch size:    16 | lm loss: 6.055058E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2552/  128728 | consumed samples:        40832 | consumed tokens:     83623936 | elapsed time per iteration (s): 15.20 | learning rate: 1.338E-05 | global batch size:    16 | lm loss: 6.212307E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2553/  128728 | consumed samples:        40848 | consumed tokens:     83656704 | elapsed time per iteration (s): 15.16 | learning rate: 1.339E-05 | global batch size:    16 | lm loss: 6.109908E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2554/  128728 | consumed samples:        40864 | consumed tokens:     83689472 | elapsed time per iteration (s): 15.21 | learning rate: 1.339E-05 | global batch size:    16 | lm loss: 6.312997E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2555/  128728 | consumed samples:        40880 | consumed tokens:     83722240 | elapsed time per iteration (s): 15.20 | learning rate: 1.340E-05 | global batch size:    16 | lm loss: 6.015300E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2556/  128728 | consumed samples:        40896 | consumed tokens:     83755008 | elapsed time per iteration (s): 15.21 | learning rate: 1.340E-05 | global batch size:    16 | lm loss: 6.201309E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2557/  128728 | consumed samples:        40912 | consumed tokens:     83787776 | elapsed time per iteration (s): 15.17 | learning rate: 1.341E-05 | global batch size:    16 | lm loss: 6.085812E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2558/  128728 | consumed samples:        40928 | consumed tokens:     83820544 | elapsed time per iteration (s): 15.24 | learning rate: 1.341E-05 | global batch size:    16 | lm loss: 6.176955E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2559/  128728 | consumed samples:        40944 | consumed tokens:     83853312 | elapsed time per iteration (s): 15.22 | learning rate: 1.342E-05 | global batch size:    16 | lm loss: 6.113753E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2560/  128728 | consumed samples:        40960 | consumed tokens:     83886080 | elapsed time per iteration (s): 15.23 | learning rate: 1.342E-05 | global batch size:    16 | lm loss: 6.127898E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2561/  128728 | consumed samples:        40976 | consumed tokens:     83918848 | elapsed time per iteration (s): 15.21 | learning rate: 1.343E-05 | global batch size:    16 | lm loss: 5.811277E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2562/  128728 | consumed samples:        40992 | consumed tokens:     83951616 | elapsed time per iteration (s): 15.25 | learning rate: 1.343E-05 | global batch size:    16 | lm loss: 5.929381E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2563/  128728 | consumed samples:        41008 | consumed tokens:     83984384 | elapsed time per iteration (s): 15.28 | learning rate: 1.344E-05 | global batch size:    16 | lm loss: 6.315426E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     2564/  128728 | consumed samples:        41024 | consumed tokens:     84017152 | elapsed time per iteration (s): 15.23 | learning rate: 1.344E-05 | global batch size:    16 | lm loss: 6.300920E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2565/  128728 | consumed samples:        41040 | consumed tokens:     84049920 | elapsed time per iteration (s): 15.24 | learning rate: 1.345E-05 | global batch size:    16 | lm loss: 6.197340E+00 | grad norm: 1.105 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2566/  128728 | consumed samples:        41056 | consumed tokens:     84082688 | elapsed time per iteration (s): 15.21 | learning rate: 1.345E-05 | global batch size:    16 | lm loss: 6.466086E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2567/  128728 | consumed samples:        41072 | consumed tokens:     84115456 | elapsed time per iteration (s): 15.21 | learning rate: 1.346E-05 | global batch size:    16 | lm loss: 6.310643E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2568/  128728 | consumed samples:        41088 | consumed tokens:     84148224 | elapsed time per iteration (s): 15.22 | learning rate: 1.346E-05 | global batch size:    16 | lm loss: 5.983983E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2569/  128728 | consumed samples:        41104 | consumed tokens:     84180992 | elapsed time per iteration (s): 15.25 | learning rate: 1.347E-05 | global batch size:    16 | lm loss: 6.170852E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2570/  128728 | consumed samples:        41120 | consumed tokens:     84213760 | elapsed time per iteration (s): 15.22 | learning rate: 1.347E-05 | global batch size:    16 | lm loss: 6.011130E+00 | grad norm: 1.987 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2571/  128728 | consumed samples:        41136 | consumed tokens:     84246528 | elapsed time per iteration (s): 15.28 | learning rate: 1.348E-05 | global batch size:    16 | lm loss: 6.224883E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     2572/  128728 | consumed samples:        41152 | consumed tokens:     84279296 | elapsed time per iteration (s): 15.23 | learning rate: 1.348E-05 | global batch size:    16 | lm loss: 6.098954E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2573/  128728 | consumed samples:        41168 | consumed tokens:     84312064 | elapsed time per iteration (s): 15.24 | learning rate: 1.349E-05 | global batch size:    16 | lm loss: 6.223440E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2574/  128728 | consumed samples:        41184 | consumed tokens:     84344832 | elapsed time per iteration (s): 15.23 | learning rate: 1.350E-05 | global batch size:    16 | lm loss: 6.194892E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2575/  128728 | consumed samples:        41200 | consumed tokens:     84377600 | elapsed time per iteration (s): 15.25 | learning rate: 1.350E-05 | global batch size:    16 | lm loss: 5.966484E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2576/  128728 | consumed samples:        41216 | consumed tokens:     84410368 | elapsed time per iteration (s): 15.17 | learning rate: 1.351E-05 | global batch size:    16 | lm loss: 6.128808E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2577/  128728 | consumed samples:        41232 | consumed tokens:     84443136 | elapsed time per iteration (s): 15.22 | learning rate: 1.351E-05 | global batch size:    16 | lm loss: 6.020306E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2578/  128728 | consumed samples:        41248 | consumed tokens:     84475904 | elapsed time per iteration (s): 15.24 | learning rate: 1.352E-05 | global batch size:    16 | lm loss: 6.008765E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2579/  128728 | consumed samples:        41264 | consumed tokens:     84508672 | elapsed time per iteration (s): 15.23 | learning rate: 1.352E-05 | global batch size:    16 | lm loss: 6.182366E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2580/  128728 | consumed samples:        41280 | consumed tokens:     84541440 | elapsed time per iteration (s): 15.19 | learning rate: 1.353E-05 | global batch size:    16 | lm loss: 6.212614E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2581/  128728 | consumed samples:        41296 | consumed tokens:     84574208 | elapsed time per iteration (s): 15.20 | learning rate: 1.353E-05 | global batch size:    16 | lm loss: 6.101485E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2582/  128728 | consumed samples:        41312 | consumed tokens:     84606976 | elapsed time per iteration (s): 15.28 | learning rate: 1.354E-05 | global batch size:    16 | lm loss: 5.973782E+00 | grad norm: 1.128 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     2583/  128728 | consumed samples:        41328 | consumed tokens:     84639744 | elapsed time per iteration (s): 15.15 | learning rate: 1.354E-05 | global batch size:    16 | lm loss: 6.166084E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2584/  128728 | consumed samples:        41344 | consumed tokens:     84672512 | elapsed time per iteration (s): 15.62 | learning rate: 1.355E-05 | global batch size:    16 | lm loss: 6.170146E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.024 | TFLOPs: 7.84 |
[default7]: iteration     2585/  128728 | consumed samples:        41360 | consumed tokens:     84705280 | elapsed time per iteration (s): 15.23 | learning rate: 1.355E-05 | global batch size:    16 | lm loss: 6.140182E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2586/  128728 | consumed samples:        41376 | consumed tokens:     84738048 | elapsed time per iteration (s): 15.19 | learning rate: 1.356E-05 | global batch size:    16 | lm loss: 6.219534E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2587/  128728 | consumed samples:        41392 | consumed tokens:     84770816 | elapsed time per iteration (s): 15.23 | learning rate: 1.356E-05 | global batch size:    16 | lm loss: 6.216126E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2588/  128728 | consumed samples:        41408 | consumed tokens:     84803584 | elapsed time per iteration (s): 15.26 | learning rate: 1.357E-05 | global batch size:    16 | lm loss: 6.328304E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2589/  128728 | consumed samples:        41424 | consumed tokens:     84836352 | elapsed time per iteration (s): 14.78 | learning rate: 1.357E-05 | global batch size:    16 | lm loss: 6.339481E+00 | grad norm: 7.573 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.082 | TFLOPs: 8.29 |
[default7]: iteration     2590/  128728 | consumed samples:        41440 | consumed tokens:     84869120 | elapsed time per iteration (s): 15.68 | learning rate: 1.358E-05 | global batch size:    16 | lm loss: 6.163667E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.020 | TFLOPs: 7.81 |
[default7]: iteration     2591/  128728 | consumed samples:        41456 | consumed tokens:     84901888 | elapsed time per iteration (s): 15.23 | learning rate: 1.358E-05 | global batch size:    16 | lm loss: 6.277731E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2592/  128728 | consumed samples:        41472 | consumed tokens:     84934656 | elapsed time per iteration (s): 15.20 | learning rate: 1.359E-05 | global batch size:    16 | lm loss: 6.341899E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2593/  128728 | consumed samples:        41488 | consumed tokens:     84967424 | elapsed time per iteration (s): 15.21 | learning rate: 1.359E-05 | global batch size:    16 | lm loss: 6.278369E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2594/  128728 | consumed samples:        41504 | consumed tokens:     85000192 | elapsed time per iteration (s): 15.18 | learning rate: 1.360E-05 | global batch size:    16 | lm loss: 6.110337E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2595/  128728 | consumed samples:        41520 | consumed tokens:     85032960 | elapsed time per iteration (s): 15.15 | learning rate: 1.361E-05 | global batch size:    16 | lm loss: 6.144503E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2596/  128728 | consumed samples:        41536 | consumed tokens:     85065728 | elapsed time per iteration (s): 15.23 | learning rate: 1.361E-05 | global batch size:    16 | lm loss: 5.943759E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2597/  128728 | consumed samples:        41552 | consumed tokens:     85098496 | elapsed time per iteration (s): 15.04 | learning rate: 1.362E-05 | global batch size:    16 | lm loss: 6.147716E+00 | grad norm: 1.372 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.064 | TFLOPs: 8.14 |
[default7]: iteration     2598/  128728 | consumed samples:        41568 | consumed tokens:     85131264 | elapsed time per iteration (s): 14.94 | learning rate: 1.362E-05 | global batch size:    16 | lm loss: 6.098436E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.071 | TFLOPs: 8.20 |
[default7]: iteration     2599/  128728 | consumed samples:        41584 | consumed tokens:     85164032 | elapsed time per iteration (s): 15.26 | learning rate: 1.363E-05 | global batch size:    16 | lm loss: 6.340265E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2600/  128728 | consumed samples:        41600 | consumed tokens:     85196800 | elapsed time per iteration (s): 15.19 | learning rate: 1.363E-05 | global batch size:    16 | lm loss: 6.089229E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2601/  128728 | consumed samples:        41616 | consumed tokens:     85229568 | elapsed time per iteration (s): 15.26 | learning rate: 1.364E-05 | global batch size:    16 | lm loss: 6.258206E+00 | grad norm: 0.967 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2602/  128728 | consumed samples:        41632 | consumed tokens:     85262336 | elapsed time per iteration (s): 15.24 | learning rate: 1.364E-05 | global batch size:    16 | lm loss: 6.186719E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2603/  128728 | consumed samples:        41648 | consumed tokens:     85295104 | elapsed time per iteration (s): 15.21 | learning rate: 1.365E-05 | global batch size:    16 | lm loss: 6.095049E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2604/  128728 | consumed samples:        41664 | consumed tokens:     85327872 | elapsed time per iteration (s): 15.23 | learning rate: 1.365E-05 | global batch size:    16 | lm loss: 6.124999E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2605/  128728 | consumed samples:        41680 | consumed tokens:     85360640 | elapsed time per iteration (s): 15.23 | learning rate: 1.366E-05 | global batch size:    16 | lm loss: 5.955814E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2606/  128728 | consumed samples:        41696 | consumed tokens:     85393408 | elapsed time per iteration (s): 15.22 | learning rate: 1.366E-05 | global batch size:    16 | lm loss: 5.977965E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2607/  128728 | consumed samples:        41712 | consumed tokens:     85426176 | elapsed time per iteration (s): 15.22 | learning rate: 1.367E-05 | global batch size:    16 | lm loss: 6.389388E+00 | grad norm: 0.888 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2608/  128728 | consumed samples:        41728 | consumed tokens:     85458944 | elapsed time per iteration (s): 15.13 | learning rate: 1.367E-05 | global batch size:    16 | lm loss: 5.978179E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.058 | TFLOPs: 8.10 |
[default7]: iteration     2609/  128728 | consumed samples:        41744 | consumed tokens:     85491712 | elapsed time per iteration (s): 15.16 | learning rate: 1.368E-05 | global batch size:    16 | lm loss: 6.004305E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2610/  128728 | consumed samples:        41760 | consumed tokens:     85524480 | elapsed time per iteration (s): 15.21 | learning rate: 1.368E-05 | global batch size:    16 | lm loss: 6.251733E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2611/  128728 | consumed samples:        41776 | consumed tokens:     85557248 | elapsed time per iteration (s): 15.21 | learning rate: 1.369E-05 | global batch size:    16 | lm loss: 6.239475E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2612/  128728 | consumed samples:        41792 | consumed tokens:     85590016 | elapsed time per iteration (s): 15.23 | learning rate: 1.369E-05 | global batch size:    16 | lm loss: 6.155906E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2613/  128728 | consumed samples:        41808 | consumed tokens:     85622784 | elapsed time per iteration (s): 15.22 | learning rate: 1.370E-05 | global batch size:    16 | lm loss: 6.004362E+00 | grad norm: 1.437 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2614/  128728 | consumed samples:        41824 | consumed tokens:     85655552 | elapsed time per iteration (s): 15.23 | learning rate: 1.370E-05 | global batch size:    16 | lm loss: 6.035411E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2615/  128728 | consumed samples:        41840 | consumed tokens:     85688320 | elapsed time per iteration (s): 15.22 | learning rate: 1.371E-05 | global batch size:    16 | lm loss: 6.144122E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2616/  128728 | consumed samples:        41856 | consumed tokens:     85721088 | elapsed time per iteration (s): 15.19 | learning rate: 1.372E-05 | global batch size:    16 | lm loss: 5.992271E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2617/  128728 | consumed samples:        41872 | consumed tokens:     85753856 | elapsed time per iteration (s): 15.15 | learning rate: 1.372E-05 | global batch size:    16 | lm loss: 6.198568E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2618/  128728 | consumed samples:        41888 | consumed tokens:     85786624 | elapsed time per iteration (s): 15.26 | learning rate: 1.373E-05 | global batch size:    16 | lm loss: 6.315387E+00 | grad norm: 1.224 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2619/  128728 | consumed samples:        41904 | consumed tokens:     85819392 | elapsed time per iteration (s): 15.21 | learning rate: 1.373E-05 | global batch size:    16 | lm loss: 6.252526E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2620/  128728 | consumed samples:        41920 | consumed tokens:     85852160 | elapsed time per iteration (s): 15.26 | learning rate: 1.374E-05 | global batch size:    16 | lm loss: 6.126781E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2621/  128728 | consumed samples:        41936 | consumed tokens:     85884928 | elapsed time per iteration (s): 15.24 | learning rate: 1.374E-05 | global batch size:    16 | lm loss: 6.050614E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2622/  128728 | consumed samples:        41952 | consumed tokens:     85917696 | elapsed time per iteration (s): 15.22 | learning rate: 1.375E-05 | global batch size:    16 | lm loss: 6.248853E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2623/  128728 | consumed samples:        41968 | consumed tokens:     85950464 | elapsed time per iteration (s): 15.24 | learning rate: 1.375E-05 | global batch size:    16 | lm loss: 5.849868E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2624/  128728 | consumed samples:        41984 | consumed tokens:     85983232 | elapsed time per iteration (s): 15.21 | learning rate: 1.376E-05 | global batch size:    16 | lm loss: 6.024261E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2625/  128728 | consumed samples:        42000 | consumed tokens:     86016000 | elapsed time per iteration (s): 15.21 | learning rate: 1.376E-05 | global batch size:    16 | lm loss: 6.284721E+00 | grad norm: 0.891 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2626/  128728 | consumed samples:        42016 | consumed tokens:     86048768 | elapsed time per iteration (s): 15.21 | learning rate: 1.377E-05 | global batch size:    16 | lm loss: 6.214346E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2627/  128728 | consumed samples:        42032 | consumed tokens:     86081536 | elapsed time per iteration (s): 15.23 | learning rate: 1.377E-05 | global batch size:    16 | lm loss: 6.019969E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2628/  128728 | consumed samples:        42048 | consumed tokens:     86114304 | elapsed time per iteration (s): 15.23 | learning rate: 1.378E-05 | global batch size:    16 | lm loss: 6.116952E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2629/  128728 | consumed samples:        42064 | consumed tokens:     86147072 | elapsed time per iteration (s): 15.28 | learning rate: 1.378E-05 | global batch size:    16 | lm loss: 6.207554E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     2630/  128728 | consumed samples:        42080 | consumed tokens:     86179840 | elapsed time per iteration (s): 15.17 | learning rate: 1.379E-05 | global batch size:    16 | lm loss: 6.012637E+00 | grad norm: 0.908 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2631/  128728 | consumed samples:        42096 | consumed tokens:     86212608 | elapsed time per iteration (s): 15.21 | learning rate: 1.379E-05 | global batch size:    16 | lm loss: 6.151033E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2632/  128728 | consumed samples:        42112 | consumed tokens:     86245376 | elapsed time per iteration (s): 15.25 | learning rate: 1.380E-05 | global batch size:    16 | lm loss: 5.952607E+00 | grad norm: 1.420 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2633/  128728 | consumed samples:        42128 | consumed tokens:     86278144 | elapsed time per iteration (s): 15.21 | learning rate: 1.380E-05 | global batch size:    16 | lm loss: 6.403392E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2634/  128728 | consumed samples:        42144 | consumed tokens:     86310912 | elapsed time per iteration (s): 15.24 | learning rate: 1.381E-05 | global batch size:    16 | lm loss: 6.004301E+00 | grad norm: 0.635 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2635/  128728 | consumed samples:        42160 | consumed tokens:     86343680 | elapsed time per iteration (s): 15.23 | learning rate: 1.382E-05 | global batch size:    16 | lm loss: 6.076480E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2636/  128728 | consumed samples:        42176 | consumed tokens:     86376448 | elapsed time per iteration (s): 15.20 | learning rate: 1.382E-05 | global batch size:    16 | lm loss: 5.914468E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2637/  128728 | consumed samples:        42192 | consumed tokens:     86409216 | elapsed time per iteration (s): 15.24 | learning rate: 1.383E-05 | global batch size:    16 | lm loss: 6.257906E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2638/  128728 | consumed samples:        42208 | consumed tokens:     86441984 | elapsed time per iteration (s): 15.23 | learning rate: 1.383E-05 | global batch size:    16 | lm loss: 6.093236E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2639/  128728 | consumed samples:        42224 | consumed tokens:     86474752 | elapsed time per iteration (s): 15.23 | learning rate: 1.384E-05 | global batch size:    16 | lm loss: 6.195016E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2640/  128728 | consumed samples:        42240 | consumed tokens:     86507520 | elapsed time per iteration (s): 15.22 | learning rate: 1.384E-05 | global batch size:    16 | lm loss: 6.247684E+00 | grad norm: 0.985 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2641/  128728 | consumed samples:        42256 | consumed tokens:     86540288 | elapsed time per iteration (s): 15.24 | learning rate: 1.385E-05 | global batch size:    16 | lm loss: 6.176681E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2642/  128728 | consumed samples:        42272 | consumed tokens:     86573056 | elapsed time per iteration (s): 15.16 | learning rate: 1.385E-05 | global batch size:    16 | lm loss: 6.208982E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2643/  128728 | consumed samples:        42288 | consumed tokens:     86605824 | elapsed time per iteration (s): 15.21 | learning rate: 1.386E-05 | global batch size:    16 | lm loss: 5.945809E+00 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2644/  128728 | consumed samples:        42304 | consumed tokens:     86638592 | elapsed time per iteration (s): 15.24 | learning rate: 1.386E-05 | global batch size:    16 | lm loss: 6.031917E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2645/  128728 | consumed samples:        42320 | consumed tokens:     86671360 | elapsed time per iteration (s): 15.24 | learning rate: 1.387E-05 | global batch size:    16 | lm loss: 6.110291E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2646/  128728 | consumed samples:        42336 | consumed tokens:     86704128 | elapsed time per iteration (s): 15.20 | learning rate: 1.387E-05 | global batch size:    16 | lm loss: 6.099847E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2647/  128728 | consumed samples:        42352 | consumed tokens:     86736896 | elapsed time per iteration (s): 15.21 | learning rate: 1.388E-05 | global batch size:    16 | lm loss: 5.954161E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2648/  128728 | consumed samples:        42368 | consumed tokens:     86769664 | elapsed time per iteration (s): 15.20 | learning rate: 1.388E-05 | global batch size:    16 | lm loss: 6.157164E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2649/  128728 | consumed samples:        42384 | consumed tokens:     86802432 | elapsed time per iteration (s): 15.19 | learning rate: 1.389E-05 | global batch size:    16 | lm loss: 6.200693E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2650/  128728 | consumed samples:        42400 | consumed tokens:     86835200 | elapsed time per iteration (s): 15.19 | learning rate: 1.389E-05 | global batch size:    16 | lm loss: 6.027765E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2651/  128728 | consumed samples:        42416 | consumed tokens:     86867968 | elapsed time per iteration (s): 15.21 | learning rate: 1.390E-05 | global batch size:    16 | lm loss: 6.053953E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2652/  128728 | consumed samples:        42432 | consumed tokens:     86900736 | elapsed time per iteration (s): 15.22 | learning rate: 1.390E-05 | global batch size:    16 | lm loss: 6.035723E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2653/  128728 | consumed samples:        42448 | consumed tokens:     86933504 | elapsed time per iteration (s): 15.24 | learning rate: 1.391E-05 | global batch size:    16 | lm loss: 6.109938E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2654/  128728 | consumed samples:        42464 | consumed tokens:     86966272 | elapsed time per iteration (s): 15.24 | learning rate: 1.391E-05 | global batch size:    16 | lm loss: 6.273432E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2655/  128728 | consumed samples:        42480 | consumed tokens:     86999040 | elapsed time per iteration (s): 15.23 | learning rate: 1.392E-05 | global batch size:    16 | lm loss: 6.122913E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2656/  128728 | consumed samples:        42496 | consumed tokens:     87031808 | elapsed time per iteration (s): 15.21 | learning rate: 1.393E-05 | global batch size:    16 | lm loss: 6.010287E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2657/  128728 | consumed samples:        42512 | consumed tokens:     87064576 | elapsed time per iteration (s): 15.26 | learning rate: 1.393E-05 | global batch size:    16 | lm loss: 6.254018E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2658/  128728 | consumed samples:        42528 | consumed tokens:     87097344 | elapsed time per iteration (s): 15.21 | learning rate: 1.394E-05 | global batch size:    16 | lm loss: 6.063126E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2659/  128728 | consumed samples:        42544 | consumed tokens:     87130112 | elapsed time per iteration (s): 15.25 | learning rate: 1.394E-05 | global batch size:    16 | lm loss: 6.102057E+00 | grad norm: 1.325 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2660/  128728 | consumed samples:        42560 | consumed tokens:     87162880 | elapsed time per iteration (s): 15.17 | learning rate: 1.395E-05 | global batch size:    16 | lm loss: 5.900523E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2661/  128728 | consumed samples:        42576 | consumed tokens:     87195648 | elapsed time per iteration (s): 15.27 | learning rate: 1.395E-05 | global batch size:    16 | lm loss: 6.192137E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     2662/  128728 | consumed samples:        42592 | consumed tokens:     87228416 | elapsed time per iteration (s): 15.19 | learning rate: 1.396E-05 | global batch size:    16 | lm loss: 6.078798E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2663/  128728 | consumed samples:        42608 | consumed tokens:     87261184 | elapsed time per iteration (s): 15.14 | learning rate: 1.396E-05 | global batch size:    16 | lm loss: 6.099391E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2664/  128728 | consumed samples:        42624 | consumed tokens:     87293952 | elapsed time per iteration (s): 15.22 | learning rate: 1.397E-05 | global batch size:    16 | lm loss: 6.029523E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2665/  128728 | consumed samples:        42640 | consumed tokens:     87326720 | elapsed time per iteration (s): 15.27 | learning rate: 1.397E-05 | global batch size:    16 | lm loss: 5.808035E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     2666/  128728 | consumed samples:        42656 | consumed tokens:     87359488 | elapsed time per iteration (s): 15.24 | learning rate: 1.398E-05 | global batch size:    16 | lm loss: 6.260103E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2667/  128728 | consumed samples:        42672 | consumed tokens:     87392256 | elapsed time per iteration (s): 15.18 | learning rate: 1.398E-05 | global batch size:    16 | lm loss: 6.125736E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2668/  128728 | consumed samples:        42688 | consumed tokens:     87425024 | elapsed time per iteration (s): 15.23 | learning rate: 1.399E-05 | global batch size:    16 | lm loss: 5.999493E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2669/  128728 | consumed samples:        42704 | consumed tokens:     87457792 | elapsed time per iteration (s): 15.23 | learning rate: 1.399E-05 | global batch size:    16 | lm loss: 6.127405E+00 | grad norm: 0.638 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2670/  128728 | consumed samples:        42720 | consumed tokens:     87490560 | elapsed time per iteration (s): 15.26 | learning rate: 1.400E-05 | global batch size:    16 | lm loss: 6.203554E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2671/  128728 | consumed samples:        42736 | consumed tokens:     87523328 | elapsed time per iteration (s): 15.26 | learning rate: 1.400E-05 | global batch size:    16 | lm loss: 6.156468E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2672/  128728 | consumed samples:        42752 | consumed tokens:     87556096 | elapsed time per iteration (s): 15.21 | learning rate: 1.401E-05 | global batch size:    16 | lm loss: 6.088578E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2673/  128728 | consumed samples:        42768 | consumed tokens:     87588864 | elapsed time per iteration (s): 15.22 | learning rate: 1.401E-05 | global batch size:    16 | lm loss: 6.113354E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2674/  128728 | consumed samples:        42784 | consumed tokens:     87621632 | elapsed time per iteration (s): 15.27 | learning rate: 1.402E-05 | global batch size:    16 | lm loss: 6.172616E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     2675/  128728 | consumed samples:        42800 | consumed tokens:     87654400 | elapsed time per iteration (s): 15.22 | learning rate: 1.402E-05 | global batch size:    16 | lm loss: 6.198242E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2676/  128728 | consumed samples:        42816 | consumed tokens:     87687168 | elapsed time per iteration (s): 15.25 | learning rate: 1.403E-05 | global batch size:    16 | lm loss: 5.941981E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2677/  128728 | consumed samples:        42832 | consumed tokens:     87719936 | elapsed time per iteration (s): 15.23 | learning rate: 1.404E-05 | global batch size:    16 | lm loss: 5.984716E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2678/  128728 | consumed samples:        42848 | consumed tokens:     87752704 | elapsed time per iteration (s): 15.22 | learning rate: 1.404E-05 | global batch size:    16 | lm loss: 6.288304E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2679/  128728 | consumed samples:        42864 | consumed tokens:     87785472 | elapsed time per iteration (s): 15.26 | learning rate: 1.405E-05 | global batch size:    16 | lm loss: 5.836905E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2680/  128728 | consumed samples:        42880 | consumed tokens:     87818240 | elapsed time per iteration (s): 15.22 | learning rate: 1.405E-05 | global batch size:    16 | lm loss: 5.946983E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2681/  128728 | consumed samples:        42896 | consumed tokens:     87851008 | elapsed time per iteration (s): 15.26 | learning rate: 1.406E-05 | global batch size:    16 | lm loss: 5.952541E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2682/  128728 | consumed samples:        42912 | consumed tokens:     87883776 | elapsed time per iteration (s): 15.20 | learning rate: 1.406E-05 | global batch size:    16 | lm loss: 6.209697E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2683/  128728 | consumed samples:        42928 | consumed tokens:     87916544 | elapsed time per iteration (s): 15.22 | learning rate: 1.407E-05 | global batch size:    16 | lm loss: 6.055411E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2684/  128728 | consumed samples:        42944 | consumed tokens:     87949312 | elapsed time per iteration (s): 15.18 | learning rate: 1.407E-05 | global batch size:    16 | lm loss: 6.116272E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2685/  128728 | consumed samples:        42960 | consumed tokens:     87982080 | elapsed time per iteration (s): 15.20 | learning rate: 1.408E-05 | global batch size:    16 | lm loss: 6.151689E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2686/  128728 | consumed samples:        42976 | consumed tokens:     88014848 | elapsed time per iteration (s): 15.21 | learning rate: 1.408E-05 | global batch size:    16 | lm loss: 6.154226E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2687/  128728 | consumed samples:        42992 | consumed tokens:     88047616 | elapsed time per iteration (s): 15.26 | learning rate: 1.409E-05 | global batch size:    16 | lm loss: 5.946739E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2688/  128728 | consumed samples:        43008 | consumed tokens:     88080384 | elapsed time per iteration (s): 15.23 | learning rate: 1.409E-05 | global batch size:    16 | lm loss: 5.993872E+00 | grad norm: 1.039 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2689/  128728 | consumed samples:        43024 | consumed tokens:     88113152 | elapsed time per iteration (s): 15.25 | learning rate: 1.410E-05 | global batch size:    16 | lm loss: 6.235291E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     2690/  128728 | consumed samples:        43040 | consumed tokens:     88145920 | elapsed time per iteration (s): 15.23 | learning rate: 1.410E-05 | global batch size:    16 | lm loss: 6.016863E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2691/  128728 | consumed samples:        43056 | consumed tokens:     88178688 | elapsed time per iteration (s): 15.21 | learning rate: 1.411E-05 | global batch size:    16 | lm loss: 6.055921E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2692/  128728 | consumed samples:        43072 | consumed tokens:     88211456 | elapsed time per iteration (s): 15.29 | learning rate: 1.411E-05 | global batch size:    16 | lm loss: 6.147404E+00 | grad norm: 1.075 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration     2693/  128728 | consumed samples:        43088 | consumed tokens:     88244224 | elapsed time per iteration (s): 15.25 | learning rate: 1.412E-05 | global batch size:    16 | lm loss: 6.023476E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2694/  128728 | consumed samples:        43104 | consumed tokens:     88276992 | elapsed time per iteration (s): 15.21 | learning rate: 1.412E-05 | global batch size:    16 | lm loss: 6.106614E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2695/  128728 | consumed samples:        43120 | consumed tokens:     88309760 | elapsed time per iteration (s): 15.19 | learning rate: 1.413E-05 | global batch size:    16 | lm loss: 6.147112E+00 | grad norm: 0.985 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2696/  128728 | consumed samples:        43136 | consumed tokens:     88342528 | elapsed time per iteration (s): 15.26 | learning rate: 1.413E-05 | global batch size:    16 | lm loss: 6.314603E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2697/  128728 | consumed samples:        43152 | consumed tokens:     88375296 | elapsed time per iteration (s): 15.22 | learning rate: 1.414E-05 | global batch size:    16 | lm loss: 6.222948E+00 | grad norm: 0.918 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2698/  128728 | consumed samples:        43168 | consumed tokens:     88408064 | elapsed time per iteration (s): 15.22 | learning rate: 1.415E-05 | global batch size:    16 | lm loss: 6.301141E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2699/  128728 | consumed samples:        43184 | consumed tokens:     88440832 | elapsed time per iteration (s): 15.21 | learning rate: 1.415E-05 | global batch size:    16 | lm loss: 6.171608E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2700/  128728 | consumed samples:        43200 | consumed tokens:     88473600 | elapsed time per iteration (s): 15.23 | learning rate: 1.416E-05 | global batch size:    16 | lm loss: 6.046288E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2701/  128728 | consumed samples:        43216 | consumed tokens:     88506368 | elapsed time per iteration (s): 15.23 | learning rate: 1.416E-05 | global batch size:    16 | lm loss: 6.242400E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2702/  128728 | consumed samples:        43232 | consumed tokens:     88539136 | elapsed time per iteration (s): 15.18 | learning rate: 1.417E-05 | global batch size:    16 | lm loss: 6.125425E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2703/  128728 | consumed samples:        43248 | consumed tokens:     88571904 | elapsed time per iteration (s): 15.22 | learning rate: 1.417E-05 | global batch size:    16 | lm loss: 6.006852E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2704/  128728 | consumed samples:        43264 | consumed tokens:     88604672 | elapsed time per iteration (s): 15.23 | learning rate: 1.418E-05 | global batch size:    16 | lm loss: 5.967431E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2705/  128728 | consumed samples:        43280 | consumed tokens:     88637440 | elapsed time per iteration (s): 15.19 | learning rate: 1.418E-05 | global batch size:    16 | lm loss: 5.898385E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2706/  128728 | consumed samples:        43296 | consumed tokens:     88670208 | elapsed time per iteration (s): 15.25 | learning rate: 1.419E-05 | global batch size:    16 | lm loss: 6.021958E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2707/  128728 | consumed samples:        43312 | consumed tokens:     88702976 | elapsed time per iteration (s): 15.20 | learning rate: 1.419E-05 | global batch size:    16 | lm loss: 6.094849E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2708/  128728 | consumed samples:        43328 | consumed tokens:     88735744 | elapsed time per iteration (s): 15.22 | learning rate: 1.420E-05 | global batch size:    16 | lm loss: 6.081100E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2709/  128728 | consumed samples:        43344 | consumed tokens:     88768512 | elapsed time per iteration (s): 15.22 | learning rate: 1.420E-05 | global batch size:    16 | lm loss: 6.196400E+00 | grad norm: 0.849 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2710/  128728 | consumed samples:        43360 | consumed tokens:     88801280 | elapsed time per iteration (s): 15.21 | learning rate: 1.421E-05 | global batch size:    16 | lm loss: 5.977609E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2711/  128728 | consumed samples:        43376 | consumed tokens:     88834048 | elapsed time per iteration (s): 15.21 | learning rate: 1.421E-05 | global batch size:    16 | lm loss: 6.154242E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2712/  128728 | consumed samples:        43392 | consumed tokens:     88866816 | elapsed time per iteration (s): 15.20 | learning rate: 1.422E-05 | global batch size:    16 | lm loss: 6.021749E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2713/  128728 | consumed samples:        43408 | consumed tokens:     88899584 | elapsed time per iteration (s): 15.19 | learning rate: 1.422E-05 | global batch size:    16 | lm loss: 6.182495E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2714/  128728 | consumed samples:        43424 | consumed tokens:     88932352 | elapsed time per iteration (s): 15.21 | learning rate: 1.423E-05 | global batch size:    16 | lm loss: 5.947534E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2715/  128728 | consumed samples:        43440 | consumed tokens:     88965120 | elapsed time per iteration (s): 15.24 | learning rate: 1.423E-05 | global batch size:    16 | lm loss: 5.916839E+00 | grad norm: 0.629 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2716/  128728 | consumed samples:        43456 | consumed tokens:     88997888 | elapsed time per iteration (s): 15.21 | learning rate: 1.424E-05 | global batch size:    16 | lm loss: 5.961128E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2717/  128728 | consumed samples:        43472 | consumed tokens:     89030656 | elapsed time per iteration (s): 15.27 | learning rate: 1.424E-05 | global batch size:    16 | lm loss: 6.250119E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     2718/  128728 | consumed samples:        43488 | consumed tokens:     89063424 | elapsed time per iteration (s): 15.19 | learning rate: 1.425E-05 | global batch size:    16 | lm loss: 6.063711E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2719/  128728 | consumed samples:        43504 | consumed tokens:     89096192 | elapsed time per iteration (s): 15.25 | learning rate: 1.426E-05 | global batch size:    16 | lm loss: 5.790985E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2720/  128728 | consumed samples:        43520 | consumed tokens:     89128960 | elapsed time per iteration (s): 15.22 | learning rate: 1.426E-05 | global batch size:    16 | lm loss: 6.230259E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2721/  128728 | consumed samples:        43536 | consumed tokens:     89161728 | elapsed time per iteration (s): 15.20 | learning rate: 1.427E-05 | global batch size:    16 | lm loss: 6.079679E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2722/  128728 | consumed samples:        43552 | consumed tokens:     89194496 | elapsed time per iteration (s): 15.22 | learning rate: 1.427E-05 | global batch size:    16 | lm loss: 6.003428E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2723/  128728 | consumed samples:        43568 | consumed tokens:     89227264 | elapsed time per iteration (s): 15.17 | learning rate: 1.428E-05 | global batch size:    16 | lm loss: 6.202793E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2724/  128728 | consumed samples:        43584 | consumed tokens:     89260032 | elapsed time per iteration (s): 15.22 | learning rate: 1.428E-05 | global batch size:    16 | lm loss: 5.948997E+00 | grad norm: 0.636 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2725/  128728 | consumed samples:        43600 | consumed tokens:     89292800 | elapsed time per iteration (s): 15.17 | learning rate: 1.429E-05 | global batch size:    16 | lm loss: 6.143008E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2726/  128728 | consumed samples:        43616 | consumed tokens:     89325568 | elapsed time per iteration (s): 15.23 | learning rate: 1.429E-05 | global batch size:    16 | lm loss: 6.032366E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2727/  128728 | consumed samples:        43632 | consumed tokens:     89358336 | elapsed time per iteration (s): 15.13 | learning rate: 1.430E-05 | global batch size:    16 | lm loss: 6.206609E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     2728/  128728 | consumed samples:        43648 | consumed tokens:     89391104 | elapsed time per iteration (s): 15.15 | learning rate: 1.430E-05 | global batch size:    16 | lm loss: 5.929503E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2729/  128728 | consumed samples:        43664 | consumed tokens:     89423872 | elapsed time per iteration (s): 15.19 | learning rate: 1.431E-05 | global batch size:    16 | lm loss: 6.076304E+00 | grad norm: 0.654 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2730/  128728 | consumed samples:        43680 | consumed tokens:     89456640 | elapsed time per iteration (s): 15.15 | learning rate: 1.431E-05 | global batch size:    16 | lm loss: 6.175723E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2731/  128728 | consumed samples:        43696 | consumed tokens:     89489408 | elapsed time per iteration (s): 15.25 | learning rate: 1.432E-05 | global batch size:    16 | lm loss: 6.105374E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2732/  128728 | consumed samples:        43712 | consumed tokens:     89522176 | elapsed time per iteration (s): 15.21 | learning rate: 1.432E-05 | global batch size:    16 | lm loss: 6.372894E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2733/  128728 | consumed samples:        43728 | consumed tokens:     89554944 | elapsed time per iteration (s): 15.23 | learning rate: 1.433E-05 | global batch size:    16 | lm loss: 6.022964E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2734/  128728 | consumed samples:        43744 | consumed tokens:     89587712 | elapsed time per iteration (s): 15.24 | learning rate: 1.433E-05 | global batch size:    16 | lm loss: 5.931406E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2735/  128728 | consumed samples:        43760 | consumed tokens:     89620480 | elapsed time per iteration (s): 15.23 | learning rate: 1.434E-05 | global batch size:    16 | lm loss: 6.318775E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2736/  128728 | consumed samples:        43776 | consumed tokens:     89653248 | elapsed time per iteration (s): 15.20 | learning rate: 1.434E-05 | global batch size:    16 | lm loss: 5.932520E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2737/  128728 | consumed samples:        43792 | consumed tokens:     89686016 | elapsed time per iteration (s): 15.25 | learning rate: 1.435E-05 | global batch size:    16 | lm loss: 5.937093E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2738/  128728 | consumed samples:        43808 | consumed tokens:     89718784 | elapsed time per iteration (s): 15.20 | learning rate: 1.436E-05 | global batch size:    16 | lm loss: 6.135614E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2739/  128728 | consumed samples:        43824 | consumed tokens:     89751552 | elapsed time per iteration (s): 15.24 | learning rate: 1.436E-05 | global batch size:    16 | lm loss: 6.076690E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2740/  128728 | consumed samples:        43840 | consumed tokens:     89784320 | elapsed time per iteration (s): 15.25 | learning rate: 1.437E-05 | global batch size:    16 | lm loss: 5.929999E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2741/  128728 | consumed samples:        43856 | consumed tokens:     89817088 | elapsed time per iteration (s): 15.20 | learning rate: 1.437E-05 | global batch size:    16 | lm loss: 6.232322E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2742/  128728 | consumed samples:        43872 | consumed tokens:     89849856 | elapsed time per iteration (s): 15.21 | learning rate: 1.438E-05 | global batch size:    16 | lm loss: 6.364085E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2743/  128728 | consumed samples:        43888 | consumed tokens:     89882624 | elapsed time per iteration (s): 15.26 | learning rate: 1.438E-05 | global batch size:    16 | lm loss: 5.733549E+00 | grad norm: 1.389 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2744/  128728 | consumed samples:        43904 | consumed tokens:     89915392 | elapsed time per iteration (s): 15.21 | learning rate: 1.439E-05 | global batch size:    16 | lm loss: 5.822972E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2745/  128728 | consumed samples:        43920 | consumed tokens:     89948160 | elapsed time per iteration (s): 15.24 | learning rate: 1.439E-05 | global batch size:    16 | lm loss: 6.177995E+00 | grad norm: 0.978 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2746/  128728 | consumed samples:        43936 | consumed tokens:     89980928 | elapsed time per iteration (s): 15.15 | learning rate: 1.440E-05 | global batch size:    16 | lm loss: 6.296174E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     2747/  128728 | consumed samples:        43952 | consumed tokens:     90013696 | elapsed time per iteration (s): 15.23 | learning rate: 1.440E-05 | global batch size:    16 | lm loss: 6.298337E+00 | grad norm: 1.130 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2748/  128728 | consumed samples:        43968 | consumed tokens:     90046464 | elapsed time per iteration (s): 15.25 | learning rate: 1.441E-05 | global batch size:    16 | lm loss: 6.149353E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2749/  128728 | consumed samples:        43984 | consumed tokens:     90079232 | elapsed time per iteration (s): 15.22 | learning rate: 1.441E-05 | global batch size:    16 | lm loss: 5.981178E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2750/  128728 | consumed samples:        44000 | consumed tokens:     90112000 | elapsed time per iteration (s): 15.19 | learning rate: 1.442E-05 | global batch size:    16 | lm loss: 6.031982E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2751/  128728 | consumed samples:        44016 | consumed tokens:     90144768 | elapsed time per iteration (s): 15.16 | learning rate: 1.442E-05 | global batch size:    16 | lm loss: 5.927257E+00 | grad norm: 0.918 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2752/  128728 | consumed samples:        44032 | consumed tokens:     90177536 | elapsed time per iteration (s): 15.23 | learning rate: 1.443E-05 | global batch size:    16 | lm loss: 5.992155E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2753/  128728 | consumed samples:        44048 | consumed tokens:     90210304 | elapsed time per iteration (s): 15.24 | learning rate: 1.443E-05 | global batch size:    16 | lm loss: 6.082148E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2754/  128728 | consumed samples:        44064 | consumed tokens:     90243072 | elapsed time per iteration (s): 15.21 | learning rate: 1.444E-05 | global batch size:    16 | lm loss: 5.980026E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2755/  128728 | consumed samples:        44080 | consumed tokens:     90275840 | elapsed time per iteration (s): 15.21 | learning rate: 1.444E-05 | global batch size:    16 | lm loss: 6.085819E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2756/  128728 | consumed samples:        44096 | consumed tokens:     90308608 | elapsed time per iteration (s): 15.22 | learning rate: 1.445E-05 | global batch size:    16 | lm loss: 6.038049E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2757/  128728 | consumed samples:        44112 | consumed tokens:     90341376 | elapsed time per iteration (s): 15.23 | learning rate: 1.445E-05 | global batch size:    16 | lm loss: 5.992010E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2758/  128728 | consumed samples:        44128 | consumed tokens:     90374144 | elapsed time per iteration (s): 15.21 | learning rate: 1.446E-05 | global batch size:    16 | lm loss: 5.833893E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2759/  128728 | consumed samples:        44144 | consumed tokens:     90406912 | elapsed time per iteration (s): 15.23 | learning rate: 1.447E-05 | global batch size:    16 | lm loss: 6.127007E+00 | grad norm: 0.658 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2760/  128728 | consumed samples:        44160 | consumed tokens:     90439680 | elapsed time per iteration (s): 15.21 | learning rate: 1.447E-05 | global batch size:    16 | lm loss: 6.115055E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2761/  128728 | consumed samples:        44176 | consumed tokens:     90472448 | elapsed time per iteration (s): 15.24 | learning rate: 1.448E-05 | global batch size:    16 | lm loss: 6.209776E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2762/  128728 | consumed samples:        44192 | consumed tokens:     90505216 | elapsed time per iteration (s): 15.20 | learning rate: 1.448E-05 | global batch size:    16 | lm loss: 6.111469E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2763/  128728 | consumed samples:        44208 | consumed tokens:     90537984 | elapsed time per iteration (s): 15.25 | learning rate: 1.449E-05 | global batch size:    16 | lm loss: 6.213965E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2764/  128728 | consumed samples:        44224 | consumed tokens:     90570752 | elapsed time per iteration (s): 15.25 | learning rate: 1.449E-05 | global batch size:    16 | lm loss: 5.969048E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2765/  128728 | consumed samples:        44240 | consumed tokens:     90603520 | elapsed time per iteration (s): 15.24 | learning rate: 1.450E-05 | global batch size:    16 | lm loss: 6.218442E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2766/  128728 | consumed samples:        44256 | consumed tokens:     90636288 | elapsed time per iteration (s): 15.21 | learning rate: 1.450E-05 | global batch size:    16 | lm loss: 6.126570E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2767/  128728 | consumed samples:        44272 | consumed tokens:     90669056 | elapsed time per iteration (s): 15.16 | learning rate: 1.451E-05 | global batch size:    16 | lm loss: 6.056056E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2768/  128728 | consumed samples:        44288 | consumed tokens:     90701824 | elapsed time per iteration (s): 15.24 | learning rate: 1.451E-05 | global batch size:    16 | lm loss: 6.007943E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2769/  128728 | consumed samples:        44304 | consumed tokens:     90734592 | elapsed time per iteration (s): 15.19 | learning rate: 1.452E-05 | global batch size:    16 | lm loss: 5.851771E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2770/  128728 | consumed samples:        44320 | consumed tokens:     90767360 | elapsed time per iteration (s): 15.24 | learning rate: 1.452E-05 | global batch size:    16 | lm loss: 6.106419E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2771/  128728 | consumed samples:        44336 | consumed tokens:     90800128 | elapsed time per iteration (s): 15.22 | learning rate: 1.453E-05 | global batch size:    16 | lm loss: 5.806401E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2772/  128728 | consumed samples:        44352 | consumed tokens:     90832896 | elapsed time per iteration (s): 15.24 | learning rate: 1.453E-05 | global batch size:    16 | lm loss: 6.068120E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2773/  128728 | consumed samples:        44368 | consumed tokens:     90865664 | elapsed time per iteration (s): 15.20 | learning rate: 1.454E-05 | global batch size:    16 | lm loss: 5.843704E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2774/  128728 | consumed samples:        44384 | consumed tokens:     90898432 | elapsed time per iteration (s): 15.19 | learning rate: 1.454E-05 | global batch size:    16 | lm loss: 6.001309E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2775/  128728 | consumed samples:        44400 | consumed tokens:     90931200 | elapsed time per iteration (s): 15.18 | learning rate: 1.455E-05 | global batch size:    16 | lm loss: 6.218292E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2776/  128728 | consumed samples:        44416 | consumed tokens:     90963968 | elapsed time per iteration (s): 15.21 | learning rate: 1.455E-05 | global batch size:    16 | lm loss: 6.178038E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2777/  128728 | consumed samples:        44432 | consumed tokens:     90996736 | elapsed time per iteration (s): 15.15 | learning rate: 1.456E-05 | global batch size:    16 | lm loss: 6.058540E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2778/  128728 | consumed samples:        44448 | consumed tokens:     91029504 | elapsed time per iteration (s): 15.22 | learning rate: 1.456E-05 | global batch size:    16 | lm loss: 6.073587E+00 | grad norm: 0.865 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2779/  128728 | consumed samples:        44464 | consumed tokens:     91062272 | elapsed time per iteration (s): 15.21 | learning rate: 1.457E-05 | global batch size:    16 | lm loss: 6.025464E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2780/  128728 | consumed samples:        44480 | consumed tokens:     91095040 | elapsed time per iteration (s): 15.24 | learning rate: 1.458E-05 | global batch size:    16 | lm loss: 6.045417E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2781/  128728 | consumed samples:        44496 | consumed tokens:     91127808 | elapsed time per iteration (s): 15.22 | learning rate: 1.458E-05 | global batch size:    16 | lm loss: 5.972544E+00 | grad norm: 0.634 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2782/  128728 | consumed samples:        44512 | consumed tokens:     91160576 | elapsed time per iteration (s): 15.22 | learning rate: 1.459E-05 | global batch size:    16 | lm loss: 6.040277E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2783/  128728 | consumed samples:        44528 | consumed tokens:     91193344 | elapsed time per iteration (s): 15.23 | learning rate: 1.459E-05 | global batch size:    16 | lm loss: 6.183329E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2784/  128728 | consumed samples:        44544 | consumed tokens:     91226112 | elapsed time per iteration (s): 15.20 | learning rate: 1.460E-05 | global batch size:    16 | lm loss: 5.996538E+00 | grad norm: 0.598 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2785/  128728 | consumed samples:        44560 | consumed tokens:     91258880 | elapsed time per iteration (s): 15.22 | learning rate: 1.460E-05 | global batch size:    16 | lm loss: 6.052549E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2786/  128728 | consumed samples:        44576 | consumed tokens:     91291648 | elapsed time per iteration (s): 15.24 | learning rate: 1.461E-05 | global batch size:    16 | lm loss: 6.023203E+00 | grad norm: 0.650 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2787/  128728 | consumed samples:        44592 | consumed tokens:     91324416 | elapsed time per iteration (s): 15.23 | learning rate: 1.461E-05 | global batch size:    16 | lm loss: 5.934374E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2788/  128728 | consumed samples:        44608 | consumed tokens:     91357184 | elapsed time per iteration (s): 15.17 | learning rate: 1.462E-05 | global batch size:    16 | lm loss: 5.979886E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     2789/  128728 | consumed samples:        44624 | consumed tokens:     91389952 | elapsed time per iteration (s): 15.20 | learning rate: 1.462E-05 | global batch size:    16 | lm loss: 5.939224E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2790/  128728 | consumed samples:        44640 | consumed tokens:     91422720 | elapsed time per iteration (s): 15.21 | learning rate: 1.463E-05 | global batch size:    16 | lm loss: 6.009098E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2791/  128728 | consumed samples:        44656 | consumed tokens:     91455488 | elapsed time per iteration (s): 15.21 | learning rate: 1.463E-05 | global batch size:    16 | lm loss: 5.886978E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2792/  128728 | consumed samples:        44672 | consumed tokens:     91488256 | elapsed time per iteration (s): 15.18 | learning rate: 1.464E-05 | global batch size:    16 | lm loss: 5.919722E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2793/  128728 | consumed samples:        44688 | consumed tokens:     91521024 | elapsed time per iteration (s): 15.21 | learning rate: 1.464E-05 | global batch size:    16 | lm loss: 5.969708E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2794/  128728 | consumed samples:        44704 | consumed tokens:     91553792 | elapsed time per iteration (s): 15.23 | learning rate: 1.465E-05 | global batch size:    16 | lm loss: 6.022653E+00 | grad norm: 0.881 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2795/  128728 | consumed samples:        44720 | consumed tokens:     91586560 | elapsed time per iteration (s): 15.19 | learning rate: 1.465E-05 | global batch size:    16 | lm loss: 6.179086E+00 | grad norm: 1.187 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2796/  128728 | consumed samples:        44736 | consumed tokens:     91619328 | elapsed time per iteration (s): 15.19 | learning rate: 1.466E-05 | global batch size:    16 | lm loss: 5.982589E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2797/  128728 | consumed samples:        44752 | consumed tokens:     91652096 | elapsed time per iteration (s): 15.21 | learning rate: 1.466E-05 | global batch size:    16 | lm loss: 6.000892E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2798/  128728 | consumed samples:        44768 | consumed tokens:     91684864 | elapsed time per iteration (s): 15.22 | learning rate: 1.467E-05 | global batch size:    16 | lm loss: 6.116832E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2799/  128728 | consumed samples:        44784 | consumed tokens:     91717632 | elapsed time per iteration (s): 15.23 | learning rate: 1.467E-05 | global batch size:    16 | lm loss: 6.036739E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2800/  128728 | consumed samples:        44800 | consumed tokens:     91750400 | elapsed time per iteration (s): 15.29 | learning rate: 1.468E-05 | global batch size:    16 | lm loss: 6.083531E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     2801/  128728 | consumed samples:        44816 | consumed tokens:     91783168 | elapsed time per iteration (s): 15.20 | learning rate: 1.469E-05 | global batch size:    16 | lm loss: 5.965879E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2802/  128728 | consumed samples:        44832 | consumed tokens:     91815936 | elapsed time per iteration (s): 15.24 | learning rate: 1.469E-05 | global batch size:    16 | lm loss: 5.960813E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2803/  128728 | consumed samples:        44848 | consumed tokens:     91848704 | elapsed time per iteration (s): 15.29 | learning rate: 1.470E-05 | global batch size:    16 | lm loss: 6.129034E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration     2804/  128728 | consumed samples:        44864 | consumed tokens:     91881472 | elapsed time per iteration (s): 15.23 | learning rate: 1.470E-05 | global batch size:    16 | lm loss: 6.187738E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2805/  128728 | consumed samples:        44880 | consumed tokens:     91914240 | elapsed time per iteration (s): 15.19 | learning rate: 1.471E-05 | global batch size:    16 | lm loss: 5.718319E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2806/  128728 | consumed samples:        44896 | consumed tokens:     91947008 | elapsed time per iteration (s): 15.25 | learning rate: 1.471E-05 | global batch size:    16 | lm loss: 5.992695E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     2807/  128728 | consumed samples:        44912 | consumed tokens:     91979776 | elapsed time per iteration (s): 15.22 | learning rate: 1.472E-05 | global batch size:    16 | lm loss: 6.170292E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2808/  128728 | consumed samples:        44928 | consumed tokens:     92012544 | elapsed time per iteration (s): 15.26 | learning rate: 1.472E-05 | global batch size:    16 | lm loss: 5.820615E+00 | grad norm: 1.288 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2809/  128728 | consumed samples:        44944 | consumed tokens:     92045312 | elapsed time per iteration (s): 15.29 | learning rate: 1.473E-05 | global batch size:    16 | lm loss: 6.132642E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration     2810/  128728 | consumed samples:        44960 | consumed tokens:     92078080 | elapsed time per iteration (s): 15.23 | learning rate: 1.473E-05 | global batch size:    16 | lm loss: 5.860527E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2811/  128728 | consumed samples:        44976 | consumed tokens:     92110848 | elapsed time per iteration (s): 15.22 | learning rate: 1.474E-05 | global batch size:    16 | lm loss: 6.165506E+00 | grad norm: 1.233 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2812/  128728 | consumed samples:        44992 | consumed tokens:     92143616 | elapsed time per iteration (s): 15.21 | learning rate: 1.474E-05 | global batch size:    16 | lm loss: 6.085719E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2813/  128728 | consumed samples:        45008 | consumed tokens:     92176384 | elapsed time per iteration (s): 15.24 | learning rate: 1.475E-05 | global batch size:    16 | lm loss: 6.115023E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2814/  128728 | consumed samples:        45024 | consumed tokens:     92209152 | elapsed time per iteration (s): 15.21 | learning rate: 1.475E-05 | global batch size:    16 | lm loss: 5.843146E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2815/  128728 | consumed samples:        45040 | consumed tokens:     92241920 | elapsed time per iteration (s): 15.20 | learning rate: 1.476E-05 | global batch size:    16 | lm loss: 5.976727E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2816/  128728 | consumed samples:        45056 | consumed tokens:     92274688 | elapsed time per iteration (s): 15.26 | learning rate: 1.476E-05 | global batch size:    16 | lm loss: 6.070988E+00 | grad norm: 1.025 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2817/  128728 | consumed samples:        45072 | consumed tokens:     92307456 | elapsed time per iteration (s): 15.21 | learning rate: 1.477E-05 | global batch size:    16 | lm loss: 6.018933E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2818/  128728 | consumed samples:        45088 | consumed tokens:     92340224 | elapsed time per iteration (s): 15.24 | learning rate: 1.477E-05 | global batch size:    16 | lm loss: 6.134639E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2819/  128728 | consumed samples:        45104 | consumed tokens:     92372992 | elapsed time per iteration (s): 15.21 | learning rate: 1.478E-05 | global batch size:    16 | lm loss: 5.952758E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2820/  128728 | consumed samples:        45120 | consumed tokens:     92405760 | elapsed time per iteration (s): 15.24 | learning rate: 1.478E-05 | global batch size:    16 | lm loss: 6.103268E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2821/  128728 | consumed samples:        45136 | consumed tokens:     92438528 | elapsed time per iteration (s): 15.24 | learning rate: 1.479E-05 | global batch size:    16 | lm loss: 5.782512E+00 | grad norm: 1.492 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2822/  128728 | consumed samples:        45152 | consumed tokens:     92471296 | elapsed time per iteration (s): 15.23 | learning rate: 1.480E-05 | global batch size:    16 | lm loss: 6.080799E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2823/  128728 | consumed samples:        45168 | consumed tokens:     92504064 | elapsed time per iteration (s): 15.21 | learning rate: 1.480E-05 | global batch size:    16 | lm loss: 6.054215E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2824/  128728 | consumed samples:        45184 | consumed tokens:     92536832 | elapsed time per iteration (s): 15.19 | learning rate: 1.481E-05 | global batch size:    16 | lm loss: 6.130510E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2825/  128728 | consumed samples:        45200 | consumed tokens:     92569600 | elapsed time per iteration (s): 15.19 | learning rate: 1.481E-05 | global batch size:    16 | lm loss: 6.226121E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2826/  128728 | consumed samples:        45216 | consumed tokens:     92602368 | elapsed time per iteration (s): 15.16 | learning rate: 1.482E-05 | global batch size:    16 | lm loss: 5.877883E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2827/  128728 | consumed samples:        45232 | consumed tokens:     92635136 | elapsed time per iteration (s): 15.23 | learning rate: 1.482E-05 | global batch size:    16 | lm loss: 5.866010E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2828/  128728 | consumed samples:        45248 | consumed tokens:     92667904 | elapsed time per iteration (s): 15.21 | learning rate: 1.483E-05 | global batch size:    16 | lm loss: 6.033381E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2829/  128728 | consumed samples:        45264 | consumed tokens:     92700672 | elapsed time per iteration (s): 15.22 | learning rate: 1.483E-05 | global batch size:    16 | lm loss: 6.256545E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2830/  128728 | consumed samples:        45280 | consumed tokens:     92733440 | elapsed time per iteration (s): 15.21 | learning rate: 1.484E-05 | global batch size:    16 | lm loss: 5.981166E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2831/  128728 | consumed samples:        45296 | consumed tokens:     92766208 | elapsed time per iteration (s): 15.24 | learning rate: 1.484E-05 | global batch size:    16 | lm loss: 6.093549E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2832/  128728 | consumed samples:        45312 | consumed tokens:     92798976 | elapsed time per iteration (s): 15.21 | learning rate: 1.485E-05 | global batch size:    16 | lm loss: 5.899080E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2833/  128728 | consumed samples:        45328 | consumed tokens:     92831744 | elapsed time per iteration (s): 15.21 | learning rate: 1.485E-05 | global batch size:    16 | lm loss: 6.259049E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2834/  128728 | consumed samples:        45344 | consumed tokens:     92864512 | elapsed time per iteration (s): 15.23 | learning rate: 1.486E-05 | global batch size:    16 | lm loss: 5.930161E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2835/  128728 | consumed samples:        45360 | consumed tokens:     92897280 | elapsed time per iteration (s): 15.20 | learning rate: 1.486E-05 | global batch size:    16 | lm loss: 6.179988E+00 | grad norm: 0.994 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2836/  128728 | consumed samples:        45376 | consumed tokens:     92930048 | elapsed time per iteration (s): 15.24 | learning rate: 1.487E-05 | global batch size:    16 | lm loss: 5.902924E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2837/  128728 | consumed samples:        45392 | consumed tokens:     92962816 | elapsed time per iteration (s): 15.23 | learning rate: 1.487E-05 | global batch size:    16 | lm loss: 5.806733E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2838/  128728 | consumed samples:        45408 | consumed tokens:     92995584 | elapsed time per iteration (s): 15.22 | learning rate: 1.488E-05 | global batch size:    16 | lm loss: 5.926982E+00 | grad norm: 1.011 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2839/  128728 | consumed samples:        45424 | consumed tokens:     93028352 | elapsed time per iteration (s): 15.24 | learning rate: 1.488E-05 | global batch size:    16 | lm loss: 5.809728E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2840/  128728 | consumed samples:        45440 | consumed tokens:     93061120 | elapsed time per iteration (s): 15.24 | learning rate: 1.489E-05 | global batch size:    16 | lm loss: 5.952487E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2841/  128728 | consumed samples:        45456 | consumed tokens:     93093888 | elapsed time per iteration (s): 15.23 | learning rate: 1.490E-05 | global batch size:    16 | lm loss: 6.089927E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2842/  128728 | consumed samples:        45472 | consumed tokens:     93126656 | elapsed time per iteration (s): 15.22 | learning rate: 1.490E-05 | global batch size:    16 | lm loss: 5.907791E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2843/  128728 | consumed samples:        45488 | consumed tokens:     93159424 | elapsed time per iteration (s): 15.26 | learning rate: 1.491E-05 | global batch size:    16 | lm loss: 5.926930E+00 | grad norm: 0.647 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2844/  128728 | consumed samples:        45504 | consumed tokens:     93192192 | elapsed time per iteration (s): 15.23 | learning rate: 1.491E-05 | global batch size:    16 | lm loss: 5.910907E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2845/  128728 | consumed samples:        45520 | consumed tokens:     93224960 | elapsed time per iteration (s): 15.16 | learning rate: 1.492E-05 | global batch size:    16 | lm loss: 6.070807E+00 | grad norm: 0.936 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2846/  128728 | consumed samples:        45536 | consumed tokens:     93257728 | elapsed time per iteration (s): 15.22 | learning rate: 1.492E-05 | global batch size:    16 | lm loss: 5.915307E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2847/  128728 | consumed samples:        45552 | consumed tokens:     93290496 | elapsed time per iteration (s): 15.22 | learning rate: 1.493E-05 | global batch size:    16 | lm loss: 6.010011E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2848/  128728 | consumed samples:        45568 | consumed tokens:     93323264 | elapsed time per iteration (s): 15.23 | learning rate: 1.493E-05 | global batch size:    16 | lm loss: 5.922984E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2849/  128728 | consumed samples:        45584 | consumed tokens:     93356032 | elapsed time per iteration (s): 15.19 | learning rate: 1.494E-05 | global batch size:    16 | lm loss: 6.111296E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2850/  128728 | consumed samples:        45600 | consumed tokens:     93388800 | elapsed time per iteration (s): 15.22 | learning rate: 1.494E-05 | global batch size:    16 | lm loss: 6.022770E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2851/  128728 | consumed samples:        45616 | consumed tokens:     93421568 | elapsed time per iteration (s): 15.22 | learning rate: 1.495E-05 | global batch size:    16 | lm loss: 5.970350E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2852/  128728 | consumed samples:        45632 | consumed tokens:     93454336 | elapsed time per iteration (s): 15.20 | learning rate: 1.495E-05 | global batch size:    16 | lm loss: 6.093951E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2853/  128728 | consumed samples:        45648 | consumed tokens:     93487104 | elapsed time per iteration (s): 15.24 | learning rate: 1.496E-05 | global batch size:    16 | lm loss: 5.879686E+00 | grad norm: 1.039 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2854/  128728 | consumed samples:        45664 | consumed tokens:     93519872 | elapsed time per iteration (s): 15.19 | learning rate: 1.496E-05 | global batch size:    16 | lm loss: 5.586125E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     2855/  128728 | consumed samples:        45680 | consumed tokens:     93552640 | elapsed time per iteration (s): 15.22 | learning rate: 1.497E-05 | global batch size:    16 | lm loss: 5.921970E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2856/  128728 | consumed samples:        45696 | consumed tokens:     93585408 | elapsed time per iteration (s): 15.13 | learning rate: 1.497E-05 | global batch size:    16 | lm loss: 5.962622E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     2857/  128728 | consumed samples:        45712 | consumed tokens:     93618176 | elapsed time per iteration (s): 15.22 | learning rate: 1.498E-05 | global batch size:    16 | lm loss: 6.157983E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2858/  128728 | consumed samples:        45728 | consumed tokens:     93650944 | elapsed time per iteration (s): 15.21 | learning rate: 1.498E-05 | global batch size:    16 | lm loss: 5.974092E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2859/  128728 | consumed samples:        45744 | consumed tokens:     93683712 | elapsed time per iteration (s): 15.21 | learning rate: 1.499E-05 | global batch size:    16 | lm loss: 5.760711E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2860/  128728 | consumed samples:        45760 | consumed tokens:     93716480 | elapsed time per iteration (s): 15.20 | learning rate: 1.499E-05 | global batch size:    16 | lm loss: 6.026981E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2861/  128728 | consumed samples:        45776 | consumed tokens:     93749248 | elapsed time per iteration (s): 15.20 | learning rate: 1.500E-05 | global batch size:    16 | lm loss: 5.793530E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2862/  128728 | consumed samples:        45792 | consumed tokens:     93782016 | elapsed time per iteration (s): 15.22 | learning rate: 1.501E-05 | global batch size:    16 | lm loss: 5.890173E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2863/  128728 | consumed samples:        45808 | consumed tokens:     93814784 | elapsed time per iteration (s): 15.22 | learning rate: 1.501E-05 | global batch size:    16 | lm loss: 6.015519E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2864/  128728 | consumed samples:        45824 | consumed tokens:     93847552 | elapsed time per iteration (s): 15.24 | learning rate: 1.502E-05 | global batch size:    16 | lm loss: 6.149529E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2865/  128728 | consumed samples:        45840 | consumed tokens:     93880320 | elapsed time per iteration (s): 15.18 | learning rate: 1.502E-05 | global batch size:    16 | lm loss: 6.066201E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2866/  128728 | consumed samples:        45856 | consumed tokens:     93913088 | elapsed time per iteration (s): 15.25 | learning rate: 1.503E-05 | global batch size:    16 | lm loss: 6.205139E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2867/  128728 | consumed samples:        45872 | consumed tokens:     93945856 | elapsed time per iteration (s): 15.25 | learning rate: 1.503E-05 | global batch size:    16 | lm loss: 6.108381E+00 | grad norm: 0.631 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2868/  128728 | consumed samples:        45888 | consumed tokens:     93978624 | elapsed time per iteration (s): 15.17 | learning rate: 1.504E-05 | global batch size:    16 | lm loss: 5.996854E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2869/  128728 | consumed samples:        45904 | consumed tokens:     94011392 | elapsed time per iteration (s): 15.18 | learning rate: 1.504E-05 | global batch size:    16 | lm loss: 5.922822E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2870/  128728 | consumed samples:        45920 | consumed tokens:     94044160 | elapsed time per iteration (s): 15.22 | learning rate: 1.505E-05 | global batch size:    16 | lm loss: 6.114247E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2871/  128728 | consumed samples:        45936 | consumed tokens:     94076928 | elapsed time per iteration (s): 15.19 | learning rate: 1.505E-05 | global batch size:    16 | lm loss: 6.018162E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2872/  128728 | consumed samples:        45952 | consumed tokens:     94109696 | elapsed time per iteration (s): 15.21 | learning rate: 1.506E-05 | global batch size:    16 | lm loss: 5.803544E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2873/  128728 | consumed samples:        45968 | consumed tokens:     94142464 | elapsed time per iteration (s): 15.21 | learning rate: 1.506E-05 | global batch size:    16 | lm loss: 5.869973E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2874/  128728 | consumed samples:        45984 | consumed tokens:     94175232 | elapsed time per iteration (s): 15.22 | learning rate: 1.507E-05 | global batch size:    16 | lm loss: 6.040289E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2875/  128728 | consumed samples:        46000 | consumed tokens:     94208000 | elapsed time per iteration (s): 15.20 | learning rate: 1.507E-05 | global batch size:    16 | lm loss: 5.794731E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2876/  128728 | consumed samples:        46016 | consumed tokens:     94240768 | elapsed time per iteration (s): 15.19 | learning rate: 1.508E-05 | global batch size:    16 | lm loss: 6.144478E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2877/  128728 | consumed samples:        46032 | consumed tokens:     94273536 | elapsed time per iteration (s): 15.23 | learning rate: 1.508E-05 | global batch size:    16 | lm loss: 5.903439E+00 | grad norm: 0.629 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2878/  128728 | consumed samples:        46048 | consumed tokens:     94306304 | elapsed time per iteration (s): 15.23 | learning rate: 1.509E-05 | global batch size:    16 | lm loss: 5.949089E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2879/  128728 | consumed samples:        46064 | consumed tokens:     94339072 | elapsed time per iteration (s): 15.18 | learning rate: 1.509E-05 | global batch size:    16 | lm loss: 5.951438E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2880/  128728 | consumed samples:        46080 | consumed tokens:     94371840 | elapsed time per iteration (s): 15.18 | learning rate: 1.510E-05 | global batch size:    16 | lm loss: 5.964561E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2881/  128728 | consumed samples:        46096 | consumed tokens:     94404608 | elapsed time per iteration (s): 15.22 | learning rate: 1.510E-05 | global batch size:    16 | lm loss: 5.830142E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2882/  128728 | consumed samples:        46112 | consumed tokens:     94437376 | elapsed time per iteration (s): 15.22 | learning rate: 1.511E-05 | global batch size:    16 | lm loss: 6.060420E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2883/  128728 | consumed samples:        46128 | consumed tokens:     94470144 | elapsed time per iteration (s): 15.17 | learning rate: 1.512E-05 | global batch size:    16 | lm loss: 5.988078E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2884/  128728 | consumed samples:        46144 | consumed tokens:     94502912 | elapsed time per iteration (s): 15.24 | learning rate: 1.512E-05 | global batch size:    16 | lm loss: 6.100799E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2885/  128728 | consumed samples:        46160 | consumed tokens:     94535680 | elapsed time per iteration (s): 15.16 | learning rate: 1.513E-05 | global batch size:    16 | lm loss: 6.090507E+00 | grad norm: 1.120 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2886/  128728 | consumed samples:        46176 | consumed tokens:     94568448 | elapsed time per iteration (s): 15.22 | learning rate: 1.513E-05 | global batch size:    16 | lm loss: 5.873533E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2887/  128728 | consumed samples:        46192 | consumed tokens:     94601216 | elapsed time per iteration (s): 15.22 | learning rate: 1.514E-05 | global batch size:    16 | lm loss: 5.988422E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2888/  128728 | consumed samples:        46208 | consumed tokens:     94633984 | elapsed time per iteration (s): 15.26 | learning rate: 1.514E-05 | global batch size:    16 | lm loss: 5.750258E+00 | grad norm: 1.020 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2889/  128728 | consumed samples:        46224 | consumed tokens:     94666752 | elapsed time per iteration (s): 15.22 | learning rate: 1.515E-05 | global batch size:    16 | lm loss: 5.921540E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2890/  128728 | consumed samples:        46240 | consumed tokens:     94699520 | elapsed time per iteration (s): 15.21 | learning rate: 1.515E-05 | global batch size:    16 | lm loss: 6.116239E+00 | grad norm: 0.683 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2891/  128728 | consumed samples:        46256 | consumed tokens:     94732288 | elapsed time per iteration (s): 15.19 | learning rate: 1.516E-05 | global batch size:    16 | lm loss: 6.022903E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2892/  128728 | consumed samples:        46272 | consumed tokens:     94765056 | elapsed time per iteration (s): 15.25 | learning rate: 1.516E-05 | global batch size:    16 | lm loss: 6.116355E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2893/  128728 | consumed samples:        46288 | consumed tokens:     94797824 | elapsed time per iteration (s): 15.22 | learning rate: 1.517E-05 | global batch size:    16 | lm loss: 5.981586E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2894/  128728 | consumed samples:        46304 | consumed tokens:     94830592 | elapsed time per iteration (s): 15.24 | learning rate: 1.517E-05 | global batch size:    16 | lm loss: 6.004777E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2895/  128728 | consumed samples:        46320 | consumed tokens:     94863360 | elapsed time per iteration (s): 15.20 | learning rate: 1.518E-05 | global batch size:    16 | lm loss: 6.011148E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2896/  128728 | consumed samples:        46336 | consumed tokens:     94896128 | elapsed time per iteration (s): 15.20 | learning rate: 1.518E-05 | global batch size:    16 | lm loss: 5.884268E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2897/  128728 | consumed samples:        46352 | consumed tokens:     94928896 | elapsed time per iteration (s): 15.26 | learning rate: 1.519E-05 | global batch size:    16 | lm loss: 5.814329E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2898/  128728 | consumed samples:        46368 | consumed tokens:     94961664 | elapsed time per iteration (s): 15.23 | learning rate: 1.519E-05 | global batch size:    16 | lm loss: 6.280364E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2899/  128728 | consumed samples:        46384 | consumed tokens:     94994432 | elapsed time per iteration (s): 15.22 | learning rate: 1.520E-05 | global batch size:    16 | lm loss: 5.785411E+00 | grad norm: 0.650 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2900/  128728 | consumed samples:        46400 | consumed tokens:     95027200 | elapsed time per iteration (s): 15.20 | learning rate: 1.520E-05 | global batch size:    16 | lm loss: 6.041264E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2901/  128728 | consumed samples:        46416 | consumed tokens:     95059968 | elapsed time per iteration (s): 15.22 | learning rate: 1.521E-05 | global batch size:    16 | lm loss: 5.860376E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2902/  128728 | consumed samples:        46432 | consumed tokens:     95092736 | elapsed time per iteration (s): 15.26 | learning rate: 1.521E-05 | global batch size:    16 | lm loss: 5.820327E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2903/  128728 | consumed samples:        46448 | consumed tokens:     95125504 | elapsed time per iteration (s): 15.21 | learning rate: 1.522E-05 | global batch size:    16 | lm loss: 5.791872E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2904/  128728 | consumed samples:        46464 | consumed tokens:     95158272 | elapsed time per iteration (s): 15.22 | learning rate: 1.523E-05 | global batch size:    16 | lm loss: 5.807111E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2905/  128728 | consumed samples:        46480 | consumed tokens:     95191040 | elapsed time per iteration (s): 15.18 | learning rate: 1.523E-05 | global batch size:    16 | lm loss: 5.866320E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2906/  128728 | consumed samples:        46496 | consumed tokens:     95223808 | elapsed time per iteration (s): 15.25 | learning rate: 1.524E-05 | global batch size:    16 | lm loss: 6.055687E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2907/  128728 | consumed samples:        46512 | consumed tokens:     95256576 | elapsed time per iteration (s): 15.22 | learning rate: 1.524E-05 | global batch size:    16 | lm loss: 5.993578E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2908/  128728 | consumed samples:        46528 | consumed tokens:     95289344 | elapsed time per iteration (s): 15.20 | learning rate: 1.525E-05 | global batch size:    16 | lm loss: 6.036336E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2909/  128728 | consumed samples:        46544 | consumed tokens:     95322112 | elapsed time per iteration (s): 15.23 | learning rate: 1.525E-05 | global batch size:    16 | lm loss: 5.817921E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2910/  128728 | consumed samples:        46560 | consumed tokens:     95354880 | elapsed time per iteration (s): 15.24 | learning rate: 1.526E-05 | global batch size:    16 | lm loss: 6.041966E+00 | grad norm: 0.965 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2911/  128728 | consumed samples:        46576 | consumed tokens:     95387648 | elapsed time per iteration (s): 15.24 | learning rate: 1.526E-05 | global batch size:    16 | lm loss: 5.893199E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2912/  128728 | consumed samples:        46592 | consumed tokens:     95420416 | elapsed time per iteration (s): 15.22 | learning rate: 1.527E-05 | global batch size:    16 | lm loss: 5.920829E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2913/  128728 | consumed samples:        46608 | consumed tokens:     95453184 | elapsed time per iteration (s): 15.21 | learning rate: 1.527E-05 | global batch size:    16 | lm loss: 6.020864E+00 | grad norm: 0.999 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2914/  128728 | consumed samples:        46624 | consumed tokens:     95485952 | elapsed time per iteration (s): 15.23 | learning rate: 1.528E-05 | global batch size:    16 | lm loss: 5.852686E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2915/  128728 | consumed samples:        46640 | consumed tokens:     95518720 | elapsed time per iteration (s): 15.28 | learning rate: 1.528E-05 | global batch size:    16 | lm loss: 6.035823E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     2916/  128728 | consumed samples:        46656 | consumed tokens:     95551488 | elapsed time per iteration (s): 15.20 | learning rate: 1.529E-05 | global batch size:    16 | lm loss: 5.785281E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2917/  128728 | consumed samples:        46672 | consumed tokens:     95584256 | elapsed time per iteration (s): 15.23 | learning rate: 1.529E-05 | global batch size:    16 | lm loss: 5.900357E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2918/  128728 | consumed samples:        46688 | consumed tokens:     95617024 | elapsed time per iteration (s): 15.23 | learning rate: 1.530E-05 | global batch size:    16 | lm loss: 6.010538E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2919/  128728 | consumed samples:        46704 | consumed tokens:     95649792 | elapsed time per iteration (s): 15.24 | learning rate: 1.530E-05 | global batch size:    16 | lm loss: 5.867478E+00 | grad norm: 0.659 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2920/  128728 | consumed samples:        46720 | consumed tokens:     95682560 | elapsed time per iteration (s): 15.25 | learning rate: 1.531E-05 | global batch size:    16 | lm loss: 5.778384E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     2921/  128728 | consumed samples:        46736 | consumed tokens:     95715328 | elapsed time per iteration (s): 15.20 | learning rate: 1.531E-05 | global batch size:    16 | lm loss: 5.962376E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2922/  128728 | consumed samples:        46752 | consumed tokens:     95748096 | elapsed time per iteration (s): 15.21 | learning rate: 1.532E-05 | global batch size:    16 | lm loss: 5.962127E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2923/  128728 | consumed samples:        46768 | consumed tokens:     95780864 | elapsed time per iteration (s): 15.14 | learning rate: 1.532E-05 | global batch size:    16 | lm loss: 5.935369E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     2924/  128728 | consumed samples:        46784 | consumed tokens:     95813632 | elapsed time per iteration (s): 15.23 | learning rate: 1.533E-05 | global batch size:    16 | lm loss: 5.939228E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2925/  128728 | consumed samples:        46800 | consumed tokens:     95846400 | elapsed time per iteration (s): 15.19 | learning rate: 1.534E-05 | global batch size:    16 | lm loss: 5.899131E+00 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2926/  128728 | consumed samples:        46816 | consumed tokens:     95879168 | elapsed time per iteration (s): 15.20 | learning rate: 1.534E-05 | global batch size:    16 | lm loss: 5.991677E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2927/  128728 | consumed samples:        46832 | consumed tokens:     95911936 | elapsed time per iteration (s): 15.23 | learning rate: 1.535E-05 | global batch size:    16 | lm loss: 6.101864E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2928/  128728 | consumed samples:        46848 | consumed tokens:     95944704 | elapsed time per iteration (s): 15.23 | learning rate: 1.535E-05 | global batch size:    16 | lm loss: 5.901472E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     2929/  128728 | consumed samples:        46864 | consumed tokens:     95977472 | elapsed time per iteration (s): 15.22 | learning rate: 1.536E-05 | global batch size:    16 | lm loss: 6.057093E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2930/  128728 | consumed samples:        46880 | consumed tokens:     96010240 | elapsed time per iteration (s): 15.22 | learning rate: 1.536E-05 | global batch size:    16 | lm loss: 5.913117E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2931/  128728 | consumed samples:        46896 | consumed tokens:     96043008 | elapsed time per iteration (s): 15.20 | learning rate: 1.537E-05 | global batch size:    16 | lm loss: 5.945035E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2932/  128728 | consumed samples:        46912 | consumed tokens:     96075776 | elapsed time per iteration (s): 15.21 | learning rate: 1.537E-05 | global batch size:    16 | lm loss: 5.830423E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2933/  128728 | consumed samples:        46928 | consumed tokens:     96108544 | elapsed time per iteration (s): 15.22 | learning rate: 1.538E-05 | global batch size:    16 | lm loss: 6.088906E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2934/  128728 | consumed samples:        46944 | consumed tokens:     96141312 | elapsed time per iteration (s): 15.28 | learning rate: 1.538E-05 | global batch size:    16 | lm loss: 5.862062E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     2935/  128728 | consumed samples:        46960 | consumed tokens:     96174080 | elapsed time per iteration (s): 15.22 | learning rate: 1.539E-05 | global batch size:    16 | lm loss: 5.764572E+00 | grad norm: 0.651 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2936/  128728 | consumed samples:        46976 | consumed tokens:     96206848 | elapsed time per iteration (s): 15.23 | learning rate: 1.539E-05 | global batch size:    16 | lm loss: 5.989824E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2937/  128728 | consumed samples:        46992 | consumed tokens:     96239616 | elapsed time per iteration (s): 15.27 | learning rate: 1.540E-05 | global batch size:    16 | lm loss: 5.880247E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     2938/  128728 | consumed samples:        47008 | consumed tokens:     96272384 | elapsed time per iteration (s): 15.19 | learning rate: 1.540E-05 | global batch size:    16 | lm loss: 5.923770E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2939/  128728 | consumed samples:        47024 | consumed tokens:     96305152 | elapsed time per iteration (s): 15.22 | learning rate: 1.541E-05 | global batch size:    16 | lm loss: 5.879602E+00 | grad norm: 0.654 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2940/  128728 | consumed samples:        47040 | consumed tokens:     96337920 | elapsed time per iteration (s): 15.23 | learning rate: 1.541E-05 | global batch size:    16 | lm loss: 5.848747E+00 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2941/  128728 | consumed samples:        47056 | consumed tokens:     96370688 | elapsed time per iteration (s): 15.22 | learning rate: 1.542E-05 | global batch size:    16 | lm loss: 5.908345E+00 | grad norm: 0.626 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2942/  128728 | consumed samples:        47072 | consumed tokens:     96403456 | elapsed time per iteration (s): 15.25 | learning rate: 1.542E-05 | global batch size:    16 | lm loss: 5.707866E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2943/  128728 | consumed samples:        47088 | consumed tokens:     96436224 | elapsed time per iteration (s): 15.22 | learning rate: 1.543E-05 | global batch size:    16 | lm loss: 6.033948E+00 | grad norm: 0.640 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2944/  128728 | consumed samples:        47104 | consumed tokens:     96468992 | elapsed time per iteration (s): 15.20 | learning rate: 1.544E-05 | global batch size:    16 | lm loss: 5.967467E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2945/  128728 | consumed samples:        47120 | consumed tokens:     96501760 | elapsed time per iteration (s): 15.20 | learning rate: 1.544E-05 | global batch size:    16 | lm loss: 5.921725E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2946/  128728 | consumed samples:        47136 | consumed tokens:     96534528 | elapsed time per iteration (s): 15.22 | learning rate: 1.545E-05 | global batch size:    16 | lm loss: 5.984942E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2947/  128728 | consumed samples:        47152 | consumed tokens:     96567296 | elapsed time per iteration (s): 15.22 | learning rate: 1.545E-05 | global batch size:    16 | lm loss: 5.708416E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2948/  128728 | consumed samples:        47168 | consumed tokens:     96600064 | elapsed time per iteration (s): 15.26 | learning rate: 1.546E-05 | global batch size:    16 | lm loss: 5.940567E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2949/  128728 | consumed samples:        47184 | consumed tokens:     96632832 | elapsed time per iteration (s): 15.21 | learning rate: 1.546E-05 | global batch size:    16 | lm loss: 5.731608E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2950/  128728 | consumed samples:        47200 | consumed tokens:     96665600 | elapsed time per iteration (s): 15.19 | learning rate: 1.547E-05 | global batch size:    16 | lm loss: 5.956516E+00 | grad norm: 0.964 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2951/  128728 | consumed samples:        47216 | consumed tokens:     96698368 | elapsed time per iteration (s): 15.26 | learning rate: 1.547E-05 | global batch size:    16 | lm loss: 6.100035E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2952/  128728 | consumed samples:        47232 | consumed tokens:     96731136 | elapsed time per iteration (s): 15.23 | learning rate: 1.548E-05 | global batch size:    16 | lm loss: 5.803092E+00 | grad norm: 0.650 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2953/  128728 | consumed samples:        47248 | consumed tokens:     96763904 | elapsed time per iteration (s): 15.23 | learning rate: 1.548E-05 | global batch size:    16 | lm loss: 5.983268E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2954/  128728 | consumed samples:        47264 | consumed tokens:     96796672 | elapsed time per iteration (s): 15.20 | learning rate: 1.549E-05 | global batch size:    16 | lm loss: 5.938457E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2955/  128728 | consumed samples:        47280 | consumed tokens:     96829440 | elapsed time per iteration (s): 15.25 | learning rate: 1.549E-05 | global batch size:    16 | lm loss: 5.933385E+00 | grad norm: 1.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2956/  128728 | consumed samples:        47296 | consumed tokens:     96862208 | elapsed time per iteration (s): 15.26 | learning rate: 1.550E-05 | global batch size:    16 | lm loss: 5.850451E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2957/  128728 | consumed samples:        47312 | consumed tokens:     96894976 | elapsed time per iteration (s): 15.20 | learning rate: 1.550E-05 | global batch size:    16 | lm loss: 5.800276E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2958/  128728 | consumed samples:        47328 | consumed tokens:     96927744 | elapsed time per iteration (s): 15.17 | learning rate: 1.551E-05 | global batch size:    16 | lm loss: 6.125942E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2959/  128728 | consumed samples:        47344 | consumed tokens:     96960512 | elapsed time per iteration (s): 15.22 | learning rate: 1.551E-05 | global batch size:    16 | lm loss: 5.967272E+00 | grad norm: 1.568 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2960/  128728 | consumed samples:        47360 | consumed tokens:     96993280 | elapsed time per iteration (s): 15.23 | learning rate: 1.552E-05 | global batch size:    16 | lm loss: 6.135997E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2961/  128728 | consumed samples:        47376 | consumed tokens:     97026048 | elapsed time per iteration (s): 15.17 | learning rate: 1.552E-05 | global batch size:    16 | lm loss: 6.001085E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     2962/  128728 | consumed samples:        47392 | consumed tokens:     97058816 | elapsed time per iteration (s): 15.25 | learning rate: 1.553E-05 | global batch size:    16 | lm loss: 6.062928E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     2963/  128728 | consumed samples:        47408 | consumed tokens:     97091584 | elapsed time per iteration (s): 15.23 | learning rate: 1.553E-05 | global batch size:    16 | lm loss: 6.055041E+00 | grad norm: 1.352 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2964/  128728 | consumed samples:        47424 | consumed tokens:     97124352 | elapsed time per iteration (s): 15.23 | learning rate: 1.554E-05 | global batch size:    16 | lm loss: 5.878264E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2965/  128728 | consumed samples:        47440 | consumed tokens:     97157120 | elapsed time per iteration (s): 15.19 | learning rate: 1.555E-05 | global batch size:    16 | lm loss: 6.206885E+00 | grad norm: 1.143 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2966/  128728 | consumed samples:        47456 | consumed tokens:     97189888 | elapsed time per iteration (s): 15.21 | learning rate: 1.555E-05 | global batch size:    16 | lm loss: 6.068411E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2967/  128728 | consumed samples:        47472 | consumed tokens:     97222656 | elapsed time per iteration (s): 15.20 | learning rate: 1.556E-05 | global batch size:    16 | lm loss: 5.927691E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2968/  128728 | consumed samples:        47488 | consumed tokens:     97255424 | elapsed time per iteration (s): 15.26 | learning rate: 1.556E-05 | global batch size:    16 | lm loss: 6.127417E+00 | grad norm: 1.294 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2969/  128728 | consumed samples:        47504 | consumed tokens:     97288192 | elapsed time per iteration (s): 15.24 | learning rate: 1.557E-05 | global batch size:    16 | lm loss: 6.099837E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2970/  128728 | consumed samples:        47520 | consumed tokens:     97320960 | elapsed time per iteration (s): 15.19 | learning rate: 1.557E-05 | global batch size:    16 | lm loss: 5.764379E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2971/  128728 | consumed samples:        47536 | consumed tokens:     97353728 | elapsed time per iteration (s): 15.22 | learning rate: 1.558E-05 | global batch size:    16 | lm loss: 5.941983E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2972/  128728 | consumed samples:        47552 | consumed tokens:     97386496 | elapsed time per iteration (s): 15.23 | learning rate: 1.558E-05 | global batch size:    16 | lm loss: 5.736744E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2973/  128728 | consumed samples:        47568 | consumed tokens:     97419264 | elapsed time per iteration (s): 15.24 | learning rate: 1.559E-05 | global batch size:    16 | lm loss: 5.593853E+00 | grad norm: 0.634 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2974/  128728 | consumed samples:        47584 | consumed tokens:     97452032 | elapsed time per iteration (s): 15.24 | learning rate: 1.559E-05 | global batch size:    16 | lm loss: 5.908027E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2975/  128728 | consumed samples:        47600 | consumed tokens:     97484800 | elapsed time per iteration (s): 15.17 | learning rate: 1.560E-05 | global batch size:    16 | lm loss: 5.938254E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2976/  128728 | consumed samples:        47616 | consumed tokens:     97517568 | elapsed time per iteration (s): 15.22 | learning rate: 1.560E-05 | global batch size:    16 | lm loss: 5.775309E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2977/  128728 | consumed samples:        47632 | consumed tokens:     97550336 | elapsed time per iteration (s): 15.23 | learning rate: 1.561E-05 | global batch size:    16 | lm loss: 6.102681E+00 | grad norm: 1.160 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2978/  128728 | consumed samples:        47648 | consumed tokens:     97583104 | elapsed time per iteration (s): 15.26 | learning rate: 1.561E-05 | global batch size:    16 | lm loss: 5.797580E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     2979/  128728 | consumed samples:        47664 | consumed tokens:     97615872 | elapsed time per iteration (s): 15.21 | learning rate: 1.562E-05 | global batch size:    16 | lm loss: 5.752298E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2980/  128728 | consumed samples:        47680 | consumed tokens:     97648640 | elapsed time per iteration (s): 15.22 | learning rate: 1.562E-05 | global batch size:    16 | lm loss: 6.039430E+00 | grad norm: 1.642 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2981/  128728 | consumed samples:        47696 | consumed tokens:     97681408 | elapsed time per iteration (s): 15.22 | learning rate: 1.563E-05 | global batch size:    16 | lm loss: 6.008101E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2982/  128728 | consumed samples:        47712 | consumed tokens:     97714176 | elapsed time per iteration (s): 15.18 | learning rate: 1.563E-05 | global batch size:    16 | lm loss: 5.872960E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     2983/  128728 | consumed samples:        47728 | consumed tokens:     97746944 | elapsed time per iteration (s): 15.19 | learning rate: 1.564E-05 | global batch size:    16 | lm loss: 6.110078E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2984/  128728 | consumed samples:        47744 | consumed tokens:     97779712 | elapsed time per iteration (s): 15.24 | learning rate: 1.564E-05 | global batch size:    16 | lm loss: 6.011197E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2985/  128728 | consumed samples:        47760 | consumed tokens:     97812480 | elapsed time per iteration (s): 15.16 | learning rate: 1.565E-05 | global batch size:    16 | lm loss: 5.898206E+00 | grad norm: 0.940 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2986/  128728 | consumed samples:        47776 | consumed tokens:     97845248 | elapsed time per iteration (s): 15.22 | learning rate: 1.566E-05 | global batch size:    16 | lm loss: 5.987176E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     2987/  128728 | consumed samples:        47792 | consumed tokens:     97878016 | elapsed time per iteration (s): 15.16 | learning rate: 1.566E-05 | global batch size:    16 | lm loss: 5.976408E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     2988/  128728 | consumed samples:        47808 | consumed tokens:     97910784 | elapsed time per iteration (s): 15.22 | learning rate: 1.567E-05 | global batch size:    16 | lm loss: 5.972953E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2989/  128728 | consumed samples:        47824 | consumed tokens:     97943552 | elapsed time per iteration (s): 15.25 | learning rate: 1.567E-05 | global batch size:    16 | lm loss: 6.006942E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     2990/  128728 | consumed samples:        47840 | consumed tokens:     97976320 | elapsed time per iteration (s): 15.19 | learning rate: 1.568E-05 | global batch size:    16 | lm loss: 5.912127E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2991/  128728 | consumed samples:        47856 | consumed tokens:     98009088 | elapsed time per iteration (s): 15.20 | learning rate: 1.568E-05 | global batch size:    16 | lm loss: 5.904402E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2992/  128728 | consumed samples:        47872 | consumed tokens:     98041856 | elapsed time per iteration (s): 15.23 | learning rate: 1.569E-05 | global batch size:    16 | lm loss: 5.815178E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2993/  128728 | consumed samples:        47888 | consumed tokens:     98074624 | elapsed time per iteration (s): 15.16 | learning rate: 1.569E-05 | global batch size:    16 | lm loss: 5.658585E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     2994/  128728 | consumed samples:        47904 | consumed tokens:     98107392 | elapsed time per iteration (s): 15.21 | learning rate: 1.570E-05 | global batch size:    16 | lm loss: 5.849427E+00 | grad norm: 1.324 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     2995/  128728 | consumed samples:        47920 | consumed tokens:     98140160 | elapsed time per iteration (s): 15.23 | learning rate: 1.570E-05 | global batch size:    16 | lm loss: 5.904146E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2996/  128728 | consumed samples:        47936 | consumed tokens:     98172928 | elapsed time per iteration (s): 15.20 | learning rate: 1.571E-05 | global batch size:    16 | lm loss: 5.926609E+00 | grad norm: 0.650 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     2997/  128728 | consumed samples:        47952 | consumed tokens:     98205696 | elapsed time per iteration (s): 15.22 | learning rate: 1.571E-05 | global batch size:    16 | lm loss: 6.086730E+00 | grad norm: 1.033 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     2998/  128728 | consumed samples:        47968 | consumed tokens:     98238464 | elapsed time per iteration (s): 15.23 | learning rate: 1.572E-05 | global batch size:    16 | lm loss: 5.667955E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     2999/  128728 | consumed samples:        47984 | consumed tokens:     98271232 | elapsed time per iteration (s): 15.21 | learning rate: 1.572E-05 | global batch size:    16 | lm loss: 5.905001E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3000/  128728 | consumed samples:        48000 | consumed tokens:     98304000 | elapsed time per iteration (s): 15.21 | learning rate: 1.573E-05 | global batch size:    16 | lm loss: 6.000812E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default0]:saving checkpoint at iteration    3000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default7]:------------------------------------------------------------------------------------------
[default7]:valid loss at iteration 3000 | lm loss value: 6.276583E+00 | lm loss PPL: 5.319677E+02 | 
[default7]:------------------------------------------------------------------------------------------
[default1]:[2022-03-03 18:40:11,519] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/mp_rank_01_model_states.pt
[default0]:[2022-03-03 18:40:11,489] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/mp_rank_00_model_states.pt
[default1]:[2022-03-03 18:40:25,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt
[default5]:[2022-03-03 18:40:25,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt
[default4]:[2022-03-03 18:40:25,961] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt
[default0]:[2022-03-03 18:40:26,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 18:40:26,005] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt
[default0]:[2022-03-03 18:40:26,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt
[default6]:[2022-03-03 18:40:26,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 18:40:26,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 18:40:26,445] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt
[default6]:[2022-03-03 18:40:26,859] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt
[default1]:[2022-03-03 18:40:26,799] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 18:40:26,902] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 18:40:26,868] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt
[default5]:[2022-03-03 18:40:27,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt
[default2]:[2022-03-03 18:40:27,142] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt
[default2]:[2022-03-03 18:40:27,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 18:40:27,139] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt
[default7]:[2022-03-03 18:40:27,480] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt
[default0]:[2022-03-03 18:40:27,665] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt
[default6]:[2022-03-03 18:40:27,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt
[default3]:[2022-03-03 18:40:27,666] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 18:40:27,791] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 18:40:27,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 18:40:27,933] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt
[default7]:[2022-03-03 18:40:28,213] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt
[default4]:[2022-03-03 18:40:28,223] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt
[default4]:[2022-03-03 18:40:28,222] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt
[default2]:[2022-03-03 18:40:28,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt
[default4]:[2022-03-03 18:40:28,343] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 18:40:28,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt
[default5]:[2022-03-03 18:40:28,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt
[default7]:[2022-03-03 18:40:28,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt
[default5]:[2022-03-03 18:40:28,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt
[default6]:[2022-03-03 18:40:28,556] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt
[default1]:[2022-03-03 18:40:28,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt
[default2]:[2022-03-03 18:40:28,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt
[default4]:[2022-03-03 18:40:28,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt
[default2]:[2022-03-03 18:40:28,880] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt
[default7]:[2022-03-03 18:40:28,874] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt
[default5]:[2022-03-03 18:40:28,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt
[default6]:[2022-03-03 18:40:28,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 18:40:28,957] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt
[default3]:[2022-03-03 18:40:29,134] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt
[default0]:[2022-03-03 18:40:29,218] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt
[default1]:[2022-03-03 18:40:29,238] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt
[default7]:[2022-03-03 18:40:29,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt
[default5]:[2022-03-03 18:40:29,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt
[default6]:[2022-03-03 18:40:29,325] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt
[default1]:[2022-03-03 18:40:29,350] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt
[default1]:[2022-03-03 18:40:29,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 18:40:29,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt
[default0]:[2022-03-03 18:40:29,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt
[default2]:[2022-03-03 18:40:29,488] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt
[default0]:[2022-03-03 18:40:29,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 18:40:29,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt
[default3]:[2022-03-03 18:40:29,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt
[default3]:[2022-03-03 18:40:29,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt
[default0]:[2022-03-03 18:40:29,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt
[default2]:[2022-03-03 18:40:29,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt
[default1]:[2022-03-03 18:40:29,771] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt
[default0]:[2022-03-03 18:40:29,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt
[default7]:[2022-03-03 18:40:29,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 18:40:29,798] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt
[default4]:[2022-03-03 18:40:29,771] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
[default5]:[2022-03-03 18:40:29,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt
[default2]:[2022-03-03 18:40:29,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt
[default0]:[2022-03-03 18:40:29,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 18:40:29,950] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 18:40:29,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt
[default3]:[2022-03-03 18:40:30,005] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt
[default0]:[2022-03-03 18:40:29,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt
[default3]:[2022-03-03 18:40:29,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 18:40:30,019] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt
[default1]:[2022-03-03 18:40:29,969] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt
[default5]:[2022-03-03 18:40:30,116] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt
[default6]:[2022-03-03 18:40:30,222] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 18:40:30,224] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt
[default3]:[2022-03-03 18:40:30,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt
[default5]:[2022-03-03 18:40:30,288] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 18:40:30,293] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt
[default7]:[2022-03-03 18:40:30,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt
[default5]:[2022-03-03 18:40:30,348] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 18:40:30,403] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 18:40:30,326] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt
[default0]:[2022-03-03 18:40:30,398] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt
[default2]:[2022-03-03 18:40:30,354] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt
[default4]:[2022-03-03 18:40:30,430] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt
[default0]:[2022-03-03 18:40:30,489] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt
[default6]:[2022-03-03 18:40:30,574] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt
[default4]:[2022-03-03 18:40:30,582] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 18:40:30,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt
[default1]:[2022-03-03 18:40:30,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt
[default5]:[2022-03-03 18:40:30,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt
[default5]:[2022-03-03 18:40:30,713] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt
[default6]:[2022-03-03 18:40:30,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt
[default1]:[2022-03-03 18:40:30,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt
[default7]:[2022-03-03 18:40:30,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 18:40:30,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt
[default0]:[2022-03-03 18:40:30,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt
[default7]:[2022-03-03 18:40:30,840] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt
[default1]:[2022-03-03 18:40:30,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt
[default6]:[2022-03-03 18:40:30,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt
[default1]:[2022-03-03 18:40:30,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt
[default4]:[2022-03-03 18:40:30,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt
[default4]:[2022-03-03 18:40:30,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt
[default6]:[2022-03-03 18:40:30,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt
[default1]:[2022-03-03 18:40:31,023] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 18:40:31,011] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 18:40:30,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 18:40:31,056] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt
[default5]:[2022-03-03 18:40:30,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 18:40:31,072] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 18:40:30,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt
[default3]:[2022-03-03 18:40:31,109] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt
[default2]:[2022-03-03 18:40:31,036] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt
[default1]:[2022-03-03 18:40:31,076] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt
[default0]:[2022-03-03 18:40:31,090] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt
[default4]:[2022-03-03 18:40:31,086] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt
[default5]:[2022-03-03 18:40:31,054] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt
[default7]:[2022-03-03 18:40:31,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt
[default5]:[2022-03-03 18:40:31,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt
[default0]:[2022-03-03 18:40:31,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
[default4]:[2022-03-03 18:40:31,092] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 18:40:31,223] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 18:40:31,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt
[default7]:[2022-03-03 18:40:31,333] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 18:40:31,358] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt
[default6]:[2022-03-03 18:40:31,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt
[default6]:[2022-03-03 18:40:31,452] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt
[default5]:[2022-03-03 18:40:31,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 18:40:31,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt
[default0]:[2022-03-03 18:40:31,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[default5]:[2022-03-03 18:40:31,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 18:40:31,438] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt
[default1]:[2022-03-03 18:40:31,484] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 18:40:31,484] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt
[default6]:[2022-03-03 18:40:31,465] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt
[default2]:[2022-03-03 18:40:31,482] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 18:40:31,556] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 18:40:31,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt
[default2]:[2022-03-03 18:40:31,507] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt
[default3]:[2022-03-03 18:40:31,607] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt
[default5]:[2022-03-03 18:40:31,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 18:40:31,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt
[default2]:[2022-03-03 18:40:31,669] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 18:40:31,721] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 18:40:31,825] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt
[default3]:[2022-03-03 18:40:31,878] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt
[default0]:[2022-03-03 18:40:31,843] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt
[default2]:[2022-03-03 18:40:31,926] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt
[default7]:[2022-03-03 18:40:31,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt
[default4]:[2022-03-03 18:40:32,068] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 18:40:32,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt
[default3]:[2022-03-03 18:40:32,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt
[default5]:[2022-03-03 18:40:32,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt
[default1]:[2022-03-03 18:40:32,318] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt
[default4]:[2022-03-03 18:40:32,245] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt
[default3]:[2022-03-03 18:40:32,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt
[default4]:[2022-03-03 18:40:32,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt
[default7]:[2022-03-03 18:40:32,282] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt
[default7]:[2022-03-03 18:40:32,285] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt
[default1]:[2022-03-03 18:40:32,358] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 18:40:32,366] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt
[default1]:[2022-03-03 18:40:32,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 18:40:32,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt
[default6]:[2022-03-03 18:40:32,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt
[default0]:[2022-03-03 18:40:32,366] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt
[default5]:[2022-03-03 18:40:32,411] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt
[default1]:[2022-03-03 18:40:32,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt
[default0]:[2022-03-03 18:40:32,489] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt
[default5]:[2022-03-03 18:40:32,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt
[default3]:[2022-03-03 18:40:32,557] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt
[default4]:[2022-03-03 18:40:32,572] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt
[default2]:[2022-03-03 18:40:32,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt
[default6]:[2022-03-03 18:40:32,798] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt
[default0]:[2022-03-03 18:40:32,992] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt
[default5]:[2022-03-03 18:40:32,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt
[default7]:[2022-03-03 18:40:33,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt
[default6]:[2022-03-03 18:40:33,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt
[default5]:[2022-03-03 18:40:33,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt
[default3]:[2022-03-03 18:40:32,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 18:40:33,013] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt
[default1]:[2022-03-03 18:40:33,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt
[default1]:[2022-03-03 18:40:32,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 18:40:33,097] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt
[default3]:[2022-03-03 18:40:33,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt
[default2]:[2022-03-03 18:40:33,146] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 18:40:33,193] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt
[default6]:[2022-03-03 18:40:33,233] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt
[default2]:[2022-03-03 18:40:33,241] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt
[default4]:[2022-03-03 18:40:33,266] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt
[default4]:[2022-03-03 18:40:33,217] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt
[default5]:[2022-03-03 18:40:33,400] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt
[default1]:[2022-03-03 18:40:33,455] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 18:40:33,499] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt
[default6]:[2022-03-03 18:40:33,543] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt
[default4]:[2022-03-03 18:40:33,590] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt
[default1]:[2022-03-03 18:40:33,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt
[default6]:[2022-03-03 18:40:33,612] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt
[default1]:[2022-03-03 18:40:33,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt
[default0]:[2022-03-03 18:40:33,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
[default1]:[2022-03-03 18:40:33,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 18:40:33,659] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt
[default2]:[2022-03-03 18:40:33,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt
[default0]:[2022-03-03 18:40:33,736] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt
[default0]:[2022-03-03 18:40:33,780] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 18:40:33,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt
[default0]:[2022-03-03 18:40:33,759] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 18:40:33,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 18:40:33,706] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt
[default1]:[2022-03-03 18:40:33,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt
[default7]:[2022-03-03 18:40:33,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 18:40:33,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt
[default6]:[2022-03-03 18:40:33,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 18:40:33,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt
[default7]:[2022-03-03 18:40:33,894] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt
[default3]:[2022-03-03 18:40:33,912] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt
[default2]:[2022-03-03 18:40:33,974] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt
[default4]:[2022-03-03 18:40:33,902] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt
[default6]:[2022-03-03 18:40:33,996] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt
[default7]:[2022-03-03 18:40:34,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt
[default2]:[2022-03-03 18:40:34,129] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt
[default2]:[2022-03-03 18:40:34,157] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt
[default6]:[2022-03-03 18:40:34,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt
[default0]:[2022-03-03 18:40:34,147] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt
[default6]:[2022-03-03 18:40:34,152] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt
[default6]:[2022-03-03 18:40:34,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt
[default1]:[2022-03-03 18:40:34,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt
[default3]:[2022-03-03 18:40:34,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt
[default1]:[2022-03-03 18:40:34,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt
[default6]:[2022-03-03 18:40:34,278] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt
[default2]:[2022-03-03 18:40:34,230] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt
[default0]:[2022-03-03 18:40:34,223] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 18:40:34,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt
[default5]:[2022-03-03 18:40:34,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt
[default7]:[2022-03-03 18:40:34,307] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt
[default6]:[2022-03-03 18:40:34,441] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt
[default3]:[2022-03-03 18:40:34,435] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt
[default7]:[2022-03-03 18:40:34,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt
[default7]:[2022-03-03 18:40:34,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 18:40:34,387] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt
[default7]:[2022-03-03 18:40:34,476] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt
[default2]:[2022-03-03 18:40:34,562] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt
[default7]:[2022-03-03 18:40:34,627] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt
[default0]:[2022-03-03 18:40:34,724] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt
[default6]:[2022-03-03 18:40:34,761] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt
[default1]:[2022-03-03 18:40:34,786] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt
[default7]:[2022-03-03 18:40:34,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 18:40:34,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt
[default2]:[2022-03-03 18:40:34,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt
[default5]:[2022-03-03 18:40:34,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt
[default4]:[2022-03-03 18:40:34,914] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
[default3]:[2022-03-03 18:40:34,950] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt
[default0]:[2022-03-03 18:40:34,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt
[default0]:[2022-03-03 18:40:35,061] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt
[default2]:[2022-03-03 18:40:35,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt
[default2]:[2022-03-03 18:40:35,039] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 18:40:35,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 18:40:35,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt
[default0]:[2022-03-03 18:40:35,221] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 18:40:35,324] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 18:40:35,320] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 18:40:35,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt
[default4]:[2022-03-03 18:40:35,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
[default5]:[2022-03-03 18:40:35,379] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt
[default1]:[2022-03-03 18:40:35,419] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 18:40:35,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 18:40:35,450] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt
[default6]:[2022-03-03 18:40:35,516] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt
[default6]:[2022-03-03 18:40:35,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 18:40:35,465] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt
[default1]:[2022-03-03 18:40:35,587] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 18:40:35,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 18:40:35,699] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt
[default3]:[2022-03-03 18:40:35,702] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt
[default3]:[2022-03-03 18:40:35,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 18:40:35,848] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 18:40:35,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt
[default3]:[2022-03-03 18:40:36,014] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt
[default2]:[2022-03-03 18:40:36,027] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 18:40:36,138] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt
[default2]:[2022-03-03 18:40:36,102] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt
[default4]:[2022-03-03 18:40:36,126] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
[default3]:[2022-03-03 18:40:36,133] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt
[default0]:[2022-03-03 18:40:36,208] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt
[default7]:[2022-03-03 18:40:36,230] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt
[default2]:[2022-03-03 18:40:36,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt
[default1]:[2022-03-03 18:40:36,262] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt
[default7]:[2022-03-03 18:40:36,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt
[default3]:[2022-03-03 18:40:36,286] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt
[default7]:[2022-03-03 18:40:36,312] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt
[default5]:[2022-03-03 18:40:36,281] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 18:40:36,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt
[default4]:[2022-03-03 18:40:36,415] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt
[default6]:[2022-03-03 18:40:36,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt
[default0]:[2022-03-03 18:40:36,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt
[default5]:[2022-03-03 18:40:36,442] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt
[default6]:[2022-03-03 18:40:36,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt
[default3]:[2022-03-03 18:40:36,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt
[default7]:[2022-03-03 18:40:36,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt
[default5]:[2022-03-03 18:40:36,644] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt
[default3]:[2022-03-03 18:40:36,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt
[default4]:[2022-03-03 18:40:36,677] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt
[default4]:[2022-03-03 18:40:36,761] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 18:40:36,776] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt
[default0]:[2022-03-03 18:40:36,821] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt
[default3]:[2022-03-03 18:40:36,792] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt
[default0]:[2022-03-03 18:40:36,831] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt
[default4]:[2022-03-03 18:40:36,850] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt
[default1]:[2022-03-03 18:40:36,842] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt
[default6]:[2022-03-03 18:40:36,885] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt
[default5]:[2022-03-03 18:40:36,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt
[default6]:[2022-03-03 18:40:36,989] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt
[default6]:[2022-03-03 18:40:36,974] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt
[default4]:[2022-03-03 18:40:36,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt
[default7]:[2022-03-03 18:40:36,980] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt
[default1]:[2022-03-03 18:40:36,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt
[default1]:[2022-03-03 18:40:37,014] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 18:40:37,059] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt
[default0]:[2022-03-03 18:40:37,121] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt
[default1]:[2022-03-03 18:40:37,160] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 18:40:37,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt
[default7]:[2022-03-03 18:40:37,300] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt
[default6]:[2022-03-03 18:40:37,245] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 18:40:37,315] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt
[default1]:[2022-03-03 18:40:37,278] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt
[default3]:[2022-03-03 18:40:37,310] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 18:40:37,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt
[default2]:[2022-03-03 18:40:37,322] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt
[default2]:[2022-03-03 18:40:37,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt
[default5]:[2022-03-03 18:40:37,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt
[default4]:[2022-03-03 18:40:37,482] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt
[default6]:[2022-03-03 18:40:37,609] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 18:40:37,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt
[default7]:[2022-03-03 18:40:37,627] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 18:40:37,670] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 18:40:37,665] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 18:40:37,714] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt
[default7]:[2022-03-03 18:40:37,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt
[default5]:[2022-03-03 18:40:37,964] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt
[default7]:[2022-03-03 18:40:37,893] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt
[default3]:[2022-03-03 18:40:37,992] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 18:40:37,965] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt
[default5]:[2022-03-03 18:40:37,969] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt
[default3]:[2022-03-03 18:40:37,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt
[default4]:[2022-03-03 18:40:38,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt
[default3]:[2022-03-03 18:40:38,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 18:40:38,123] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt
[default4]:[2022-03-03 18:40:38,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt
[default2]:[2022-03-03 18:40:38,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt
[default3]:[2022-03-03 18:40:38,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt
[default1]:[2022-03-03 18:40:38,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 18:40:38,631] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt
[default0]:[2022-03-03 18:40:38,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt
[default4]:[2022-03-03 18:40:38,747] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt
[default0]:[2022-03-03 18:40:38,717] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
[default4]:[2022-03-03 18:40:38,780] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 18:40:38,813] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt
[default5]:[2022-03-03 18:40:38,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt
[default5]:[2022-03-03 18:40:38,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 18:40:39,008] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt
[default6]:[2022-03-03 18:40:39,019] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt
[default6]:[2022-03-03 18:40:39,043] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 18:40:39,154] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt
[default4]:[2022-03-03 18:40:39,318] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt
[default5]:[2022-03-03 18:40:39,343] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 18:40:39,387] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt
[default7]:[2022-03-03 18:40:39,531] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt
[default1]:[2022-03-03 18:40:39,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt
[default6]:[2022-03-03 18:40:39,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 18:40:39,910] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt
[default4]:[2022-03-03 18:40:40,001] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt
[default5]:[2022-03-03 18:40:40,022] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt
[default7]:[2022-03-03 18:40:40,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt
[default5]:[2022-03-03 18:40:40,476] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt
[default5]:[2022-03-03 18:40:41,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt
[default7]:[2022-03-03 18:40:41,123] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 18:40:41,400] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 18:40:41,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt
[default6]:[2022-03-03 18:40:41,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt
[default3]:[2022-03-03 18:40:42,500] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt
[default2]:[2022-03-03 18:40:42,546] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt
[default0]:[2022-03-03 18:40:44,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt
[default7]:time (ms) | save-checkpoint: 42368.40
[default1]:[2022-03-03 18:40:44,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt
[default0]:  successfully saved checkpoint at iteration    3000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default7]: iteration     3001/  128728 | consumed samples:        48016 | consumed tokens:     98336768 | elapsed time per iteration (s): 77.11 | learning rate: 1.573E-05 | global batch size:    16 | lm loss: 5.976147E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.208 | TFLOPs: 1.59 |
[default7]: iteration     3002/  128728 | consumed samples:        48032 | consumed tokens:     98369536 | elapsed time per iteration (s): 15.17 | learning rate: 1.574E-05 | global batch size:    16 | lm loss: 5.967981E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3003/  128728 | consumed samples:        48048 | consumed tokens:     98402304 | elapsed time per iteration (s): 15.19 | learning rate: 1.574E-05 | global batch size:    16 | lm loss: 5.914820E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3004/  128728 | consumed samples:        48064 | consumed tokens:     98435072 | elapsed time per iteration (s): 15.24 | learning rate: 1.575E-05 | global batch size:    16 | lm loss: 5.897120E+00 | grad norm: 0.624 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3005/  128728 | consumed samples:        48080 | consumed tokens:     98467840 | elapsed time per iteration (s): 15.25 | learning rate: 1.575E-05 | global batch size:    16 | lm loss: 5.955826E+00 | grad norm: 1.228 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     3006/  128728 | consumed samples:        48096 | consumed tokens:     98500608 | elapsed time per iteration (s): 15.23 | learning rate: 1.576E-05 | global batch size:    16 | lm loss: 5.987964E+00 | grad norm: 0.892 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3007/  128728 | consumed samples:        48112 | consumed tokens:     98533376 | elapsed time per iteration (s): 15.20 | learning rate: 1.577E-05 | global batch size:    16 | lm loss: 5.960895E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3008/  128728 | consumed samples:        48128 | consumed tokens:     98566144 | elapsed time per iteration (s): 15.21 | learning rate: 1.577E-05 | global batch size:    16 | lm loss: 5.917996E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3009/  128728 | consumed samples:        48144 | consumed tokens:     98598912 | elapsed time per iteration (s): 15.17 | learning rate: 1.578E-05 | global batch size:    16 | lm loss: 5.884965E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3010/  128728 | consumed samples:        48160 | consumed tokens:     98631680 | elapsed time per iteration (s): 15.23 | learning rate: 1.578E-05 | global batch size:    16 | lm loss: 5.855898E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3011/  128728 | consumed samples:        48176 | consumed tokens:     98664448 | elapsed time per iteration (s): 15.19 | learning rate: 1.579E-05 | global batch size:    16 | lm loss: 6.119870E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3012/  128728 | consumed samples:        48192 | consumed tokens:     98697216 | elapsed time per iteration (s): 15.17 | learning rate: 1.579E-05 | global batch size:    16 | lm loss: 6.020233E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3013/  128728 | consumed samples:        48208 | consumed tokens:     98729984 | elapsed time per iteration (s): 15.24 | learning rate: 1.580E-05 | global batch size:    16 | lm loss: 6.022349E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3014/  128728 | consumed samples:        48224 | consumed tokens:     98762752 | elapsed time per iteration (s): 15.20 | learning rate: 1.580E-05 | global batch size:    16 | lm loss: 5.822513E+00 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3015/  128728 | consumed samples:        48240 | consumed tokens:     98795520 | elapsed time per iteration (s): 15.22 | learning rate: 1.581E-05 | global batch size:    16 | lm loss: 5.816571E+00 | grad norm: 0.653 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3016/  128728 | consumed samples:        48256 | consumed tokens:     98828288 | elapsed time per iteration (s): 15.22 | learning rate: 1.581E-05 | global batch size:    16 | lm loss: 5.939666E+00 | grad norm: 0.624 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3017/  128728 | consumed samples:        48272 | consumed tokens:     98861056 | elapsed time per iteration (s): 15.23 | learning rate: 1.582E-05 | global batch size:    16 | lm loss: 5.874893E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3018/  128728 | consumed samples:        48288 | consumed tokens:     98893824 | elapsed time per iteration (s): 15.21 | learning rate: 1.582E-05 | global batch size:    16 | lm loss: 6.413375E+00 | grad norm: 0.891 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3019/  128728 | consumed samples:        48304 | consumed tokens:     98926592 | elapsed time per iteration (s): 15.23 | learning rate: 1.583E-05 | global batch size:    16 | lm loss: 5.910774E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3020/  128728 | consumed samples:        48320 | consumed tokens:     98959360 | elapsed time per iteration (s): 15.28 | learning rate: 1.583E-05 | global batch size:    16 | lm loss: 5.987436E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     3021/  128728 | consumed samples:        48336 | consumed tokens:     98992128 | elapsed time per iteration (s): 15.20 | learning rate: 1.584E-05 | global batch size:    16 | lm loss: 5.816168E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3022/  128728 | consumed samples:        48352 | consumed tokens:     99024896 | elapsed time per iteration (s): 15.21 | learning rate: 1.584E-05 | global batch size:    16 | lm loss: 6.000154E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3023/  128728 | consumed samples:        48368 | consumed tokens:     99057664 | elapsed time per iteration (s): 15.22 | learning rate: 1.585E-05 | global batch size:    16 | lm loss: 6.203218E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3024/  128728 | consumed samples:        48384 | consumed tokens:     99090432 | elapsed time per iteration (s): 15.23 | learning rate: 1.585E-05 | global batch size:    16 | lm loss: 5.741538E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3025/  128728 | consumed samples:        48400 | consumed tokens:     99123200 | elapsed time per iteration (s): 15.22 | learning rate: 1.586E-05 | global batch size:    16 | lm loss: 6.002611E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3026/  128728 | consumed samples:        48416 | consumed tokens:     99155968 | elapsed time per iteration (s): 15.23 | learning rate: 1.586E-05 | global batch size:    16 | lm loss: 5.864077E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3027/  128728 | consumed samples:        48432 | consumed tokens:     99188736 | elapsed time per iteration (s): 15.22 | learning rate: 1.587E-05 | global batch size:    16 | lm loss: 5.858949E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3028/  128728 | consumed samples:        48448 | consumed tokens:     99221504 | elapsed time per iteration (s): 15.23 | learning rate: 1.588E-05 | global batch size:    16 | lm loss: 5.833308E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3029/  128728 | consumed samples:        48464 | consumed tokens:     99254272 | elapsed time per iteration (s): 15.19 | learning rate: 1.588E-05 | global batch size:    16 | lm loss: 6.036957E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3030/  128728 | consumed samples:        48480 | consumed tokens:     99287040 | elapsed time per iteration (s): 15.22 | learning rate: 1.589E-05 | global batch size:    16 | lm loss: 5.693832E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3031/  128728 | consumed samples:        48496 | consumed tokens:     99319808 | elapsed time per iteration (s): 15.23 | learning rate: 1.589E-05 | global batch size:    16 | lm loss: 6.020626E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3032/  128728 | consumed samples:        48512 | consumed tokens:     99352576 | elapsed time per iteration (s): 15.23 | learning rate: 1.590E-05 | global batch size:    16 | lm loss: 5.864520E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3033/  128728 | consumed samples:        48528 | consumed tokens:     99385344 | elapsed time per iteration (s): 15.23 | learning rate: 1.590E-05 | global batch size:    16 | lm loss: 5.856801E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3034/  128728 | consumed samples:        48544 | consumed tokens:     99418112 | elapsed time per iteration (s): 15.22 | learning rate: 1.591E-05 | global batch size:    16 | lm loss: 5.953742E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3035/  128728 | consumed samples:        48560 | consumed tokens:     99450880 | elapsed time per iteration (s): 15.24 | learning rate: 1.591E-05 | global batch size:    16 | lm loss: 5.934213E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3036/  128728 | consumed samples:        48576 | consumed tokens:     99483648 | elapsed time per iteration (s): 15.26 | learning rate: 1.592E-05 | global batch size:    16 | lm loss: 5.850968E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3037/  128728 | consumed samples:        48592 | consumed tokens:     99516416 | elapsed time per iteration (s): 15.22 | learning rate: 1.592E-05 | global batch size:    16 | lm loss: 6.049872E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3038/  128728 | consumed samples:        48608 | consumed tokens:     99549184 | elapsed time per iteration (s): 15.21 | learning rate: 1.593E-05 | global batch size:    16 | lm loss: 5.903430E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3039/  128728 | consumed samples:        48624 | consumed tokens:     99581952 | elapsed time per iteration (s): 15.24 | learning rate: 1.593E-05 | global batch size:    16 | lm loss: 6.003817E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3040/  128728 | consumed samples:        48640 | consumed tokens:     99614720 | elapsed time per iteration (s): 15.25 | learning rate: 1.594E-05 | global batch size:    16 | lm loss: 5.985853E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3041/  128728 | consumed samples:        48656 | consumed tokens:     99647488 | elapsed time per iteration (s): 15.24 | learning rate: 1.594E-05 | global batch size:    16 | lm loss: 5.714824E+00 | grad norm: 0.649 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3042/  128728 | consumed samples:        48672 | consumed tokens:     99680256 | elapsed time per iteration (s): 15.24 | learning rate: 1.595E-05 | global batch size:    16 | lm loss: 6.073945E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3043/  128728 | consumed samples:        48688 | consumed tokens:     99713024 | elapsed time per iteration (s): 15.23 | learning rate: 1.595E-05 | global batch size:    16 | lm loss: 5.912009E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3044/  128728 | consumed samples:        48704 | consumed tokens:     99745792 | elapsed time per iteration (s): 15.22 | learning rate: 1.596E-05 | global batch size:    16 | lm loss: 5.936331E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3045/  128728 | consumed samples:        48720 | consumed tokens:     99778560 | elapsed time per iteration (s): 15.29 | learning rate: 1.596E-05 | global batch size:    16 | lm loss: 5.901987E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     3046/  128728 | consumed samples:        48736 | consumed tokens:     99811328 | elapsed time per iteration (s): 15.23 | learning rate: 1.597E-05 | global batch size:    16 | lm loss: 5.832729E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3047/  128728 | consumed samples:        48752 | consumed tokens:     99844096 | elapsed time per iteration (s): 15.20 | learning rate: 1.598E-05 | global batch size:    16 | lm loss: 6.031357E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3048/  128728 | consumed samples:        48768 | consumed tokens:     99876864 | elapsed time per iteration (s): 15.25 | learning rate: 1.598E-05 | global batch size:    16 | lm loss: 5.672740E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3049/  128728 | consumed samples:        48784 | consumed tokens:     99909632 | elapsed time per iteration (s): 15.28 | learning rate: 1.599E-05 | global batch size:    16 | lm loss: 6.076912E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     3050/  128728 | consumed samples:        48800 | consumed tokens:     99942400 | elapsed time per iteration (s): 15.20 | learning rate: 1.599E-05 | global batch size:    16 | lm loss: 5.738910E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3051/  128728 | consumed samples:        48816 | consumed tokens:     99975168 | elapsed time per iteration (s): 15.28 | learning rate: 1.600E-05 | global batch size:    16 | lm loss: 5.781271E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     3052/  128728 | consumed samples:        48832 | consumed tokens:    100007936 | elapsed time per iteration (s): 15.20 | learning rate: 1.600E-05 | global batch size:    16 | lm loss: 5.867689E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3053/  128728 | consumed samples:        48848 | consumed tokens:    100040704 | elapsed time per iteration (s): 15.24 | learning rate: 1.601E-05 | global batch size:    16 | lm loss: 5.961505E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3054/  128728 | consumed samples:        48864 | consumed tokens:    100073472 | elapsed time per iteration (s): 15.23 | learning rate: 1.601E-05 | global batch size:    16 | lm loss: 6.001435E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3055/  128728 | consumed samples:        48880 | consumed tokens:    100106240 | elapsed time per iteration (s): 15.25 | learning rate: 1.602E-05 | global batch size:    16 | lm loss: 5.903691E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3056/  128728 | consumed samples:        48896 | consumed tokens:    100139008 | elapsed time per iteration (s): 15.22 | learning rate: 1.602E-05 | global batch size:    16 | lm loss: 5.782066E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3057/  128728 | consumed samples:        48912 | consumed tokens:    100171776 | elapsed time per iteration (s): 15.26 | learning rate: 1.603E-05 | global batch size:    16 | lm loss: 5.891513E+00 | grad norm: 1.014 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     3058/  128728 | consumed samples:        48928 | consumed tokens:    100204544 | elapsed time per iteration (s): 15.22 | learning rate: 1.603E-05 | global batch size:    16 | lm loss: 5.959929E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3059/  128728 | consumed samples:        48944 | consumed tokens:    100237312 | elapsed time per iteration (s): 15.19 | learning rate: 1.604E-05 | global batch size:    16 | lm loss: 5.808131E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3060/  128728 | consumed samples:        48960 | consumed tokens:    100270080 | elapsed time per iteration (s): 15.23 | learning rate: 1.604E-05 | global batch size:    16 | lm loss: 5.985348E+00 | grad norm: 0.663 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3061/  128728 | consumed samples:        48976 | consumed tokens:    100302848 | elapsed time per iteration (s): 15.23 | learning rate: 1.605E-05 | global batch size:    16 | lm loss: 5.834366E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3062/  128728 | consumed samples:        48992 | consumed tokens:    100335616 | elapsed time per iteration (s): 15.23 | learning rate: 1.605E-05 | global batch size:    16 | lm loss: 5.852916E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3063/  128728 | consumed samples:        49008 | consumed tokens:    100368384 | elapsed time per iteration (s): 15.24 | learning rate: 1.606E-05 | global batch size:    16 | lm loss: 6.065343E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3064/  128728 | consumed samples:        49024 | consumed tokens:    100401152 | elapsed time per iteration (s): 15.21 | learning rate: 1.606E-05 | global batch size:    16 | lm loss: 5.798189E+00 | grad norm: 1.131 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3065/  128728 | consumed samples:        49040 | consumed tokens:    100433920 | elapsed time per iteration (s): 15.20 | learning rate: 1.607E-05 | global batch size:    16 | lm loss: 5.934473E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3066/  128728 | consumed samples:        49056 | consumed tokens:    100466688 | elapsed time per iteration (s): 15.19 | learning rate: 1.607E-05 | global batch size:    16 | lm loss: 5.927220E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3067/  128728 | consumed samples:        49072 | consumed tokens:    100499456 | elapsed time per iteration (s): 15.25 | learning rate: 1.608E-05 | global batch size:    16 | lm loss: 5.879972E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3068/  128728 | consumed samples:        49088 | consumed tokens:    100532224 | elapsed time per iteration (s): 15.23 | learning rate: 1.609E-05 | global batch size:    16 | lm loss: 5.720819E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3069/  128728 | consumed samples:        49104 | consumed tokens:    100564992 | elapsed time per iteration (s): 15.22 | learning rate: 1.609E-05 | global batch size:    16 | lm loss: 6.000784E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3070/  128728 | consumed samples:        49120 | consumed tokens:    100597760 | elapsed time per iteration (s): 15.23 | learning rate: 1.610E-05 | global batch size:    16 | lm loss: 5.922574E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3071/  128728 | consumed samples:        49136 | consumed tokens:    100630528 | elapsed time per iteration (s): 15.21 | learning rate: 1.610E-05 | global batch size:    16 | lm loss: 5.932800E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3072/  128728 | consumed samples:        49152 | consumed tokens:    100663296 | elapsed time per iteration (s): 15.24 | learning rate: 1.611E-05 | global batch size:    16 | lm loss: 5.778855E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3073/  128728 | consumed samples:        49168 | consumed tokens:    100696064 | elapsed time per iteration (s): 15.20 | learning rate: 1.611E-05 | global batch size:    16 | lm loss: 5.972422E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3074/  128728 | consumed samples:        49184 | consumed tokens:    100728832 | elapsed time per iteration (s): 15.22 | learning rate: 1.612E-05 | global batch size:    16 | lm loss: 5.960331E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3075/  128728 | consumed samples:        49200 | consumed tokens:    100761600 | elapsed time per iteration (s): 15.22 | learning rate: 1.612E-05 | global batch size:    16 | lm loss: 5.690085E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3076/  128728 | consumed samples:        49216 | consumed tokens:    100794368 | elapsed time per iteration (s): 15.23 | learning rate: 1.613E-05 | global batch size:    16 | lm loss: 5.920603E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3077/  128728 | consumed samples:        49232 | consumed tokens:    100827136 | elapsed time per iteration (s): 15.24 | learning rate: 1.613E-05 | global batch size:    16 | lm loss: 6.182066E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3078/  128728 | consumed samples:        49248 | consumed tokens:    100859904 | elapsed time per iteration (s): 15.25 | learning rate: 1.614E-05 | global batch size:    16 | lm loss: 5.818954E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3079/  128728 | consumed samples:        49264 | consumed tokens:    100892672 | elapsed time per iteration (s): 15.20 | learning rate: 1.614E-05 | global batch size:    16 | lm loss: 5.869929E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3080/  128728 | consumed samples:        49280 | consumed tokens:    100925440 | elapsed time per iteration (s): 15.23 | learning rate: 1.615E-05 | global batch size:    16 | lm loss: 5.978646E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3081/  128728 | consumed samples:        49296 | consumed tokens:    100958208 | elapsed time per iteration (s): 15.21 | learning rate: 1.615E-05 | global batch size:    16 | lm loss: 5.753775E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3082/  128728 | consumed samples:        49312 | consumed tokens:    100990976 | elapsed time per iteration (s): 15.23 | learning rate: 1.616E-05 | global batch size:    16 | lm loss: 5.812270E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3083/  128728 | consumed samples:        49328 | consumed tokens:    101023744 | elapsed time per iteration (s): 15.20 | learning rate: 1.616E-05 | global batch size:    16 | lm loss: 5.786464E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3084/  128728 | consumed samples:        49344 | consumed tokens:    101056512 | elapsed time per iteration (s): 15.24 | learning rate: 1.617E-05 | global batch size:    16 | lm loss: 5.646963E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3085/  128728 | consumed samples:        49360 | consumed tokens:    101089280 | elapsed time per iteration (s): 15.22 | learning rate: 1.617E-05 | global batch size:    16 | lm loss: 6.141891E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3086/  128728 | consumed samples:        49376 | consumed tokens:    101122048 | elapsed time per iteration (s): 15.23 | learning rate: 1.618E-05 | global batch size:    16 | lm loss: 5.876431E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3087/  128728 | consumed samples:        49392 | consumed tokens:    101154816 | elapsed time per iteration (s): 15.22 | learning rate: 1.618E-05 | global batch size:    16 | lm loss: 5.696089E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3088/  128728 | consumed samples:        49408 | consumed tokens:    101187584 | elapsed time per iteration (s): 15.23 | learning rate: 1.619E-05 | global batch size:    16 | lm loss: 5.823549E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3089/  128728 | consumed samples:        49424 | consumed tokens:    101220352 | elapsed time per iteration (s): 15.22 | learning rate: 1.620E-05 | global batch size:    16 | lm loss: 5.682597E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3090/  128728 | consumed samples:        49440 | consumed tokens:    101253120 | elapsed time per iteration (s): 15.22 | learning rate: 1.620E-05 | global batch size:    16 | lm loss: 5.883008E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3091/  128728 | consumed samples:        49456 | consumed tokens:    101285888 | elapsed time per iteration (s): 15.21 | learning rate: 1.621E-05 | global batch size:    16 | lm loss: 5.790089E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3092/  128728 | consumed samples:        49472 | consumed tokens:    101318656 | elapsed time per iteration (s): 15.22 | learning rate: 1.621E-05 | global batch size:    16 | lm loss: 6.044188E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3093/  128728 | consumed samples:        49488 | consumed tokens:    101351424 | elapsed time per iteration (s): 15.24 | learning rate: 1.622E-05 | global batch size:    16 | lm loss: 5.811264E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3094/  128728 | consumed samples:        49504 | consumed tokens:    101384192 | elapsed time per iteration (s): 15.23 | learning rate: 1.622E-05 | global batch size:    16 | lm loss: 5.842374E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3095/  128728 | consumed samples:        49520 | consumed tokens:    101416960 | elapsed time per iteration (s): 15.19 | learning rate: 1.623E-05 | global batch size:    16 | lm loss: 5.868669E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3096/  128728 | consumed samples:        49536 | consumed tokens:    101449728 | elapsed time per iteration (s): 15.18 | learning rate: 1.623E-05 | global batch size:    16 | lm loss: 5.716575E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3097/  128728 | consumed samples:        49552 | consumed tokens:    101482496 | elapsed time per iteration (s): 15.17 | learning rate: 1.624E-05 | global batch size:    16 | lm loss: 5.883733E+00 | grad norm: 0.672 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3098/  128728 | consumed samples:        49568 | consumed tokens:    101515264 | elapsed time per iteration (s): 15.22 | learning rate: 1.624E-05 | global batch size:    16 | lm loss: 5.890719E+00 | grad norm: 1.230 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3099/  128728 | consumed samples:        49584 | consumed tokens:    101548032 | elapsed time per iteration (s): 15.23 | learning rate: 1.625E-05 | global batch size:    16 | lm loss: 5.930487E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3100/  128728 | consumed samples:        49600 | consumed tokens:    101580800 | elapsed time per iteration (s): 15.23 | learning rate: 1.625E-05 | global batch size:    16 | lm loss: 5.982317E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3101/  128728 | consumed samples:        49616 | consumed tokens:    101613568 | elapsed time per iteration (s): 15.24 | learning rate: 1.626E-05 | global batch size:    16 | lm loss: 5.826386E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3102/  128728 | consumed samples:        49632 | consumed tokens:    101646336 | elapsed time per iteration (s): 15.23 | learning rate: 1.626E-05 | global batch size:    16 | lm loss: 5.526955E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3103/  128728 | consumed samples:        49648 | consumed tokens:    101679104 | elapsed time per iteration (s): 15.24 | learning rate: 1.627E-05 | global batch size:    16 | lm loss: 5.959418E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3104/  128728 | consumed samples:        49664 | consumed tokens:    101711872 | elapsed time per iteration (s): 15.22 | learning rate: 1.627E-05 | global batch size:    16 | lm loss: 5.816753E+00 | grad norm: 2.192 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3105/  128728 | consumed samples:        49680 | consumed tokens:    101744640 | elapsed time per iteration (s): 15.24 | learning rate: 1.628E-05 | global batch size:    16 | lm loss: 5.825230E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3106/  128728 | consumed samples:        49696 | consumed tokens:    101777408 | elapsed time per iteration (s): 15.23 | learning rate: 1.628E-05 | global batch size:    16 | lm loss: 6.096361E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3107/  128728 | consumed samples:        49712 | consumed tokens:    101810176 | elapsed time per iteration (s): 15.23 | learning rate: 1.629E-05 | global batch size:    16 | lm loss: 5.705378E+00 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3108/  128728 | consumed samples:        49728 | consumed tokens:    101842944 | elapsed time per iteration (s): 15.20 | learning rate: 1.629E-05 | global batch size:    16 | lm loss: 5.947734E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3109/  128728 | consumed samples:        49744 | consumed tokens:    101875712 | elapsed time per iteration (s): 15.24 | learning rate: 1.630E-05 | global batch size:    16 | lm loss: 5.886482E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3110/  128728 | consumed samples:        49760 | consumed tokens:    101908480 | elapsed time per iteration (s): 15.17 | learning rate: 1.631E-05 | global batch size:    16 | lm loss: 5.945197E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3111/  128728 | consumed samples:        49776 | consumed tokens:    101941248 | elapsed time per iteration (s): 15.15 | learning rate: 1.631E-05 | global batch size:    16 | lm loss: 5.768273E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3112/  128728 | consumed samples:        49792 | consumed tokens:    101974016 | elapsed time per iteration (s): 15.16 | learning rate: 1.632E-05 | global batch size:    16 | lm loss: 5.848940E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3113/  128728 | consumed samples:        49808 | consumed tokens:    102006784 | elapsed time per iteration (s): 15.18 | learning rate: 1.632E-05 | global batch size:    16 | lm loss: 5.794857E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3114/  128728 | consumed samples:        49824 | consumed tokens:    102039552 | elapsed time per iteration (s): 15.20 | learning rate: 1.633E-05 | global batch size:    16 | lm loss: 5.761194E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3115/  128728 | consumed samples:        49840 | consumed tokens:    102072320 | elapsed time per iteration (s): 15.23 | learning rate: 1.633E-05 | global batch size:    16 | lm loss: 5.966802E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3116/  128728 | consumed samples:        49856 | consumed tokens:    102105088 | elapsed time per iteration (s): 15.23 | learning rate: 1.634E-05 | global batch size:    16 | lm loss: 5.814324E+00 | grad norm: 2.284 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3117/  128728 | consumed samples:        49872 | consumed tokens:    102137856 | elapsed time per iteration (s): 15.23 | learning rate: 1.634E-05 | global batch size:    16 | lm loss: 5.953111E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3118/  128728 | consumed samples:        49888 | consumed tokens:    102170624 | elapsed time per iteration (s): 15.19 | learning rate: 1.635E-05 | global batch size:    16 | lm loss: 5.790831E+00 | grad norm: 0.653 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3119/  128728 | consumed samples:        49904 | consumed tokens:    102203392 | elapsed time per iteration (s): 15.16 | learning rate: 1.635E-05 | global batch size:    16 | lm loss: 5.866699E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3120/  128728 | consumed samples:        49920 | consumed tokens:    102236160 | elapsed time per iteration (s): 15.23 | learning rate: 1.636E-05 | global batch size:    16 | lm loss: 5.997011E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3121/  128728 | consumed samples:        49936 | consumed tokens:    102268928 | elapsed time per iteration (s): 15.19 | learning rate: 1.636E-05 | global batch size:    16 | lm loss: 5.930976E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3122/  128728 | consumed samples:        49952 | consumed tokens:    102301696 | elapsed time per iteration (s): 15.17 | learning rate: 1.637E-05 | global batch size:    16 | lm loss: 5.875608E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3123/  128728 | consumed samples:        49968 | consumed tokens:    102334464 | elapsed time per iteration (s): 15.16 | learning rate: 1.637E-05 | global batch size:    16 | lm loss: 5.796740E+00 | grad norm: 1.342 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3124/  128728 | consumed samples:        49984 | consumed tokens:    102367232 | elapsed time per iteration (s): 15.16 | learning rate: 1.638E-05 | global batch size:    16 | lm loss: 5.692341E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3125/  128728 | consumed samples:        50000 | consumed tokens:    102400000 | elapsed time per iteration (s): 15.20 | learning rate: 1.638E-05 | global batch size:    16 | lm loss: 5.906222E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3126/  128728 | consumed samples:        50016 | consumed tokens:    102432768 | elapsed time per iteration (s): 15.15 | learning rate: 1.639E-05 | global batch size:    16 | lm loss: 5.771677E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3127/  128728 | consumed samples:        50032 | consumed tokens:    102465536 | elapsed time per iteration (s): 15.21 | learning rate: 1.639E-05 | global batch size:    16 | lm loss: 5.853363E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3128/  128728 | consumed samples:        50048 | consumed tokens:    102498304 | elapsed time per iteration (s): 15.13 | learning rate: 1.640E-05 | global batch size:    16 | lm loss: 5.964828E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.10 |
[default7]: iteration     3129/  128728 | consumed samples:        50064 | consumed tokens:    102531072 | elapsed time per iteration (s): 15.21 | learning rate: 1.641E-05 | global batch size:    16 | lm loss: 5.986765E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3130/  128728 | consumed samples:        50080 | consumed tokens:    102563840 | elapsed time per iteration (s): 15.13 | learning rate: 1.641E-05 | global batch size:    16 | lm loss: 5.758943E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.058 | TFLOPs: 8.10 |
[default7]: iteration     3131/  128728 | consumed samples:        50096 | consumed tokens:    102596608 | elapsed time per iteration (s): 15.21 | learning rate: 1.642E-05 | global batch size:    16 | lm loss: 5.953258E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3132/  128728 | consumed samples:        50112 | consumed tokens:    102629376 | elapsed time per iteration (s): 15.20 | learning rate: 1.642E-05 | global batch size:    16 | lm loss: 5.834059E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3133/  128728 | consumed samples:        50128 | consumed tokens:    102662144 | elapsed time per iteration (s): 15.23 | learning rate: 1.643E-05 | global batch size:    16 | lm loss: 5.778453E+00 | grad norm: 0.640 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3134/  128728 | consumed samples:        50144 | consumed tokens:    102694912 | elapsed time per iteration (s): 15.19 | learning rate: 1.643E-05 | global batch size:    16 | lm loss: 5.798711E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3135/  128728 | consumed samples:        50160 | consumed tokens:    102727680 | elapsed time per iteration (s): 15.21 | learning rate: 1.644E-05 | global batch size:    16 | lm loss: 5.807882E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3136/  128728 | consumed samples:        50176 | consumed tokens:    102760448 | elapsed time per iteration (s): 15.14 | learning rate: 1.644E-05 | global batch size:    16 | lm loss: 5.784853E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3137/  128728 | consumed samples:        50192 | consumed tokens:    102793216 | elapsed time per iteration (s): 15.15 | learning rate: 1.645E-05 | global batch size:    16 | lm loss: 5.705042E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3138/  128728 | consumed samples:        50208 | consumed tokens:    102825984 | elapsed time per iteration (s): 15.20 | learning rate: 1.645E-05 | global batch size:    16 | lm loss: 5.907452E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3139/  128728 | consumed samples:        50224 | consumed tokens:    102858752 | elapsed time per iteration (s): 15.13 | learning rate: 1.646E-05 | global batch size:    16 | lm loss: 6.042287E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.10 |
[default7]: iteration     3140/  128728 | consumed samples:        50240 | consumed tokens:    102891520 | elapsed time per iteration (s): 15.23 | learning rate: 1.646E-05 | global batch size:    16 | lm loss: 5.736620E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3141/  128728 | consumed samples:        50256 | consumed tokens:    102924288 | elapsed time per iteration (s): 15.22 | learning rate: 1.647E-05 | global batch size:    16 | lm loss: 6.033116E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3142/  128728 | consumed samples:        50272 | consumed tokens:    102957056 | elapsed time per iteration (s): 15.17 | learning rate: 1.647E-05 | global batch size:    16 | lm loss: 5.729618E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3143/  128728 | consumed samples:        50288 | consumed tokens:    102989824 | elapsed time per iteration (s): 15.19 | learning rate: 1.648E-05 | global batch size:    16 | lm loss: 5.883410E+00 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3144/  128728 | consumed samples:        50304 | consumed tokens:    103022592 | elapsed time per iteration (s): 15.21 | learning rate: 1.648E-05 | global batch size:    16 | lm loss: 5.754305E+00 | grad norm: 0.914 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3145/  128728 | consumed samples:        50320 | consumed tokens:    103055360 | elapsed time per iteration (s): 15.21 | learning rate: 1.649E-05 | global batch size:    16 | lm loss: 5.893435E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3146/  128728 | consumed samples:        50336 | consumed tokens:    103088128 | elapsed time per iteration (s): 15.20 | learning rate: 1.649E-05 | global batch size:    16 | lm loss: 5.840903E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3147/  128728 | consumed samples:        50352 | consumed tokens:    103120896 | elapsed time per iteration (s): 15.25 | learning rate: 1.650E-05 | global batch size:    16 | lm loss: 5.732727E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3148/  128728 | consumed samples:        50368 | consumed tokens:    103153664 | elapsed time per iteration (s): 15.22 | learning rate: 1.650E-05 | global batch size:    16 | lm loss: 6.073945E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3149/  128728 | consumed samples:        50384 | consumed tokens:    103186432 | elapsed time per iteration (s): 15.14 | learning rate: 1.651E-05 | global batch size:    16 | lm loss: 5.885465E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3150/  128728 | consumed samples:        50400 | consumed tokens:    103219200 | elapsed time per iteration (s): 15.18 | learning rate: 1.652E-05 | global batch size:    16 | lm loss: 5.783937E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3151/  128728 | consumed samples:        50416 | consumed tokens:    103251968 | elapsed time per iteration (s): 15.15 | learning rate: 1.652E-05 | global batch size:    16 | lm loss: 5.913184E+00 | grad norm: 0.962 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3152/  128728 | consumed samples:        50432 | consumed tokens:    103284736 | elapsed time per iteration (s): 15.22 | learning rate: 1.653E-05 | global batch size:    16 | lm loss: 5.823668E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3153/  128728 | consumed samples:        50448 | consumed tokens:    103317504 | elapsed time per iteration (s): 15.14 | learning rate: 1.653E-05 | global batch size:    16 | lm loss: 5.867479E+00 | grad norm: 0.936 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3154/  128728 | consumed samples:        50464 | consumed tokens:    103350272 | elapsed time per iteration (s): 15.16 | learning rate: 1.654E-05 | global batch size:    16 | lm loss: 5.772203E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3155/  128728 | consumed samples:        50480 | consumed tokens:    103383040 | elapsed time per iteration (s): 15.16 | learning rate: 1.654E-05 | global batch size:    16 | lm loss: 5.679266E+00 | grad norm: 1.483 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3156/  128728 | consumed samples:        50496 | consumed tokens:    103415808 | elapsed time per iteration (s): 15.22 | learning rate: 1.655E-05 | global batch size:    16 | lm loss: 5.798676E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3157/  128728 | consumed samples:        50512 | consumed tokens:    103448576 | elapsed time per iteration (s): 15.16 | learning rate: 1.655E-05 | global batch size:    16 | lm loss: 5.913177E+00 | grad norm: 1.212 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3158/  128728 | consumed samples:        50528 | consumed tokens:    103481344 | elapsed time per iteration (s): 15.17 | learning rate: 1.656E-05 | global batch size:    16 | lm loss: 5.806971E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3159/  128728 | consumed samples:        50544 | consumed tokens:    103514112 | elapsed time per iteration (s): 15.24 | learning rate: 1.656E-05 | global batch size:    16 | lm loss: 5.890893E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3160/  128728 | consumed samples:        50560 | consumed tokens:    103546880 | elapsed time per iteration (s): 15.22 | learning rate: 1.657E-05 | global batch size:    16 | lm loss: 5.810333E+00 | grad norm: 1.109 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3161/  128728 | consumed samples:        50576 | consumed tokens:    103579648 | elapsed time per iteration (s): 15.14 | learning rate: 1.657E-05 | global batch size:    16 | lm loss: 5.901513E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3162/  128728 | consumed samples:        50592 | consumed tokens:    103612416 | elapsed time per iteration (s): 15.22 | learning rate: 1.658E-05 | global batch size:    16 | lm loss: 5.824885E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3163/  128728 | consumed samples:        50608 | consumed tokens:    103645184 | elapsed time per iteration (s): 15.20 | learning rate: 1.658E-05 | global batch size:    16 | lm loss: 5.806005E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3164/  128728 | consumed samples:        50624 | consumed tokens:    103677952 | elapsed time per iteration (s): 15.21 | learning rate: 1.659E-05 | global batch size:    16 | lm loss: 5.998919E+00 | grad norm: 0.893 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3165/  128728 | consumed samples:        50640 | consumed tokens:    103710720 | elapsed time per iteration (s): 15.20 | learning rate: 1.659E-05 | global batch size:    16 | lm loss: 5.667655E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3166/  128728 | consumed samples:        50656 | consumed tokens:    103743488 | elapsed time per iteration (s): 15.20 | learning rate: 1.660E-05 | global batch size:    16 | lm loss: 5.927030E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3167/  128728 | consumed samples:        50672 | consumed tokens:    103776256 | elapsed time per iteration (s): 15.22 | learning rate: 1.660E-05 | global batch size:    16 | lm loss: 5.922341E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3168/  128728 | consumed samples:        50688 | consumed tokens:    103809024 | elapsed time per iteration (s): 15.20 | learning rate: 1.661E-05 | global batch size:    16 | lm loss: 5.802799E+00 | grad norm: 0.894 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3169/  128728 | consumed samples:        50704 | consumed tokens:    103841792 | elapsed time per iteration (s): 15.20 | learning rate: 1.661E-05 | global batch size:    16 | lm loss: 5.817975E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3170/  128728 | consumed samples:        50720 | consumed tokens:    103874560 | elapsed time per iteration (s): 15.18 | learning rate: 1.662E-05 | global batch size:    16 | lm loss: 6.009351E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3171/  128728 | consumed samples:        50736 | consumed tokens:    103907328 | elapsed time per iteration (s): 15.19 | learning rate: 1.663E-05 | global batch size:    16 | lm loss: 5.650498E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3172/  128728 | consumed samples:        50752 | consumed tokens:    103940096 | elapsed time per iteration (s): 15.17 | learning rate: 1.663E-05 | global batch size:    16 | lm loss: 5.935712E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3173/  128728 | consumed samples:        50768 | consumed tokens:    103972864 | elapsed time per iteration (s): 15.22 | learning rate: 1.664E-05 | global batch size:    16 | lm loss: 5.931666E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3174/  128728 | consumed samples:        50784 | consumed tokens:    104005632 | elapsed time per iteration (s): 15.19 | learning rate: 1.664E-05 | global batch size:    16 | lm loss: 5.748640E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3175/  128728 | consumed samples:        50800 | consumed tokens:    104038400 | elapsed time per iteration (s): 15.19 | learning rate: 1.665E-05 | global batch size:    16 | lm loss: 5.910668E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3176/  128728 | consumed samples:        50816 | consumed tokens:    104071168 | elapsed time per iteration (s): 15.22 | learning rate: 1.665E-05 | global batch size:    16 | lm loss: 5.654323E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3177/  128728 | consumed samples:        50832 | consumed tokens:    104103936 | elapsed time per iteration (s): 15.14 | learning rate: 1.666E-05 | global batch size:    16 | lm loss: 5.842155E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3178/  128728 | consumed samples:        50848 | consumed tokens:    104136704 | elapsed time per iteration (s): 15.21 | learning rate: 1.666E-05 | global batch size:    16 | lm loss: 5.938166E+00 | grad norm: 0.997 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3179/  128728 | consumed samples:        50864 | consumed tokens:    104169472 | elapsed time per iteration (s): 15.21 | learning rate: 1.667E-05 | global batch size:    16 | lm loss: 5.896249E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3180/  128728 | consumed samples:        50880 | consumed tokens:    104202240 | elapsed time per iteration (s): 15.19 | learning rate: 1.667E-05 | global batch size:    16 | lm loss: 5.735763E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3181/  128728 | consumed samples:        50896 | consumed tokens:    104235008 | elapsed time per iteration (s): 15.21 | learning rate: 1.668E-05 | global batch size:    16 | lm loss: 6.049779E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3182/  128728 | consumed samples:        50912 | consumed tokens:    104267776 | elapsed time per iteration (s): 15.21 | learning rate: 1.668E-05 | global batch size:    16 | lm loss: 5.771222E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3183/  128728 | consumed samples:        50928 | consumed tokens:    104300544 | elapsed time per iteration (s): 15.21 | learning rate: 1.669E-05 | global batch size:    16 | lm loss: 6.001236E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3184/  128728 | consumed samples:        50944 | consumed tokens:    104333312 | elapsed time per iteration (s): 15.20 | learning rate: 1.669E-05 | global batch size:    16 | lm loss: 5.754385E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3185/  128728 | consumed samples:        50960 | consumed tokens:    104366080 | elapsed time per iteration (s): 15.21 | learning rate: 1.670E-05 | global batch size:    16 | lm loss: 5.952290E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3186/  128728 | consumed samples:        50976 | consumed tokens:    104398848 | elapsed time per iteration (s): 15.22 | learning rate: 1.670E-05 | global batch size:    16 | lm loss: 5.944228E+00 | grad norm: 1.274 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3187/  128728 | consumed samples:        50992 | consumed tokens:    104431616 | elapsed time per iteration (s): 15.18 | learning rate: 1.671E-05 | global batch size:    16 | lm loss: 5.856114E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3188/  128728 | consumed samples:        51008 | consumed tokens:    104464384 | elapsed time per iteration (s): 15.21 | learning rate: 1.671E-05 | global batch size:    16 | lm loss: 5.799392E+00 | grad norm: 0.639 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3189/  128728 | consumed samples:        51024 | consumed tokens:    104497152 | elapsed time per iteration (s): 15.18 | learning rate: 1.672E-05 | global batch size:    16 | lm loss: 5.693764E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3190/  128728 | consumed samples:        51040 | consumed tokens:    104529920 | elapsed time per iteration (s): 15.22 | learning rate: 1.672E-05 | global batch size:    16 | lm loss: 5.993411E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3191/  128728 | consumed samples:        51056 | consumed tokens:    104562688 | elapsed time per iteration (s): 15.20 | learning rate: 1.673E-05 | global batch size:    16 | lm loss: 5.842443E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3192/  128728 | consumed samples:        51072 | consumed tokens:    104595456 | elapsed time per iteration (s): 15.22 | learning rate: 1.674E-05 | global batch size:    16 | lm loss: 5.879288E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3193/  128728 | consumed samples:        51088 | consumed tokens:    104628224 | elapsed time per iteration (s): 15.21 | learning rate: 1.674E-05 | global batch size:    16 | lm loss: 5.917938E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3194/  128728 | consumed samples:        51104 | consumed tokens:    104660992 | elapsed time per iteration (s): 15.20 | learning rate: 1.675E-05 | global batch size:    16 | lm loss: 5.804705E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3195/  128728 | consumed samples:        51120 | consumed tokens:    104693760 | elapsed time per iteration (s): 15.18 | learning rate: 1.675E-05 | global batch size:    16 | lm loss: 5.770677E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3196/  128728 | consumed samples:        51136 | consumed tokens:    104726528 | elapsed time per iteration (s): 15.22 | learning rate: 1.676E-05 | global batch size:    16 | lm loss: 5.813903E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3197/  128728 | consumed samples:        51152 | consumed tokens:    104759296 | elapsed time per iteration (s): 15.20 | learning rate: 1.676E-05 | global batch size:    16 | lm loss: 5.794953E+00 | grad norm: 1.219 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3198/  128728 | consumed samples:        51168 | consumed tokens:    104792064 | elapsed time per iteration (s): 15.21 | learning rate: 1.677E-05 | global batch size:    16 | lm loss: 5.620133E+00 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3199/  128728 | consumed samples:        51184 | consumed tokens:    104824832 | elapsed time per iteration (s): 15.20 | learning rate: 1.677E-05 | global batch size:    16 | lm loss: 5.942338E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3200/  128728 | consumed samples:        51200 | consumed tokens:    104857600 | elapsed time per iteration (s): 15.23 | learning rate: 1.678E-05 | global batch size:    16 | lm loss: 5.729494E+00 | grad norm: 0.644 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3201/  128728 | consumed samples:        51216 | consumed tokens:    104890368 | elapsed time per iteration (s): 15.19 | learning rate: 1.678E-05 | global batch size:    16 | lm loss: 5.862929E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3202/  128728 | consumed samples:        51232 | consumed tokens:    104923136 | elapsed time per iteration (s): 15.20 | learning rate: 1.679E-05 | global batch size:    16 | lm loss: 5.847036E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3203/  128728 | consumed samples:        51248 | consumed tokens:    104955904 | elapsed time per iteration (s): 15.21 | learning rate: 1.679E-05 | global batch size:    16 | lm loss: 5.800924E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3204/  128728 | consumed samples:        51264 | consumed tokens:    104988672 | elapsed time per iteration (s): 15.16 | learning rate: 1.680E-05 | global batch size:    16 | lm loss: 5.901340E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3205/  128728 | consumed samples:        51280 | consumed tokens:    105021440 | elapsed time per iteration (s): 15.21 | learning rate: 1.680E-05 | global batch size:    16 | lm loss: 5.704348E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3206/  128728 | consumed samples:        51296 | consumed tokens:    105054208 | elapsed time per iteration (s): 15.18 | learning rate: 1.681E-05 | global batch size:    16 | lm loss: 5.754029E+00 | grad norm: 0.658 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3207/  128728 | consumed samples:        51312 | consumed tokens:    105086976 | elapsed time per iteration (s): 15.21 | learning rate: 1.681E-05 | global batch size:    16 | lm loss: 5.820123E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3208/  128728 | consumed samples:        51328 | consumed tokens:    105119744 | elapsed time per iteration (s): 15.22 | learning rate: 1.682E-05 | global batch size:    16 | lm loss: 5.841055E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3209/  128728 | consumed samples:        51344 | consumed tokens:    105152512 | elapsed time per iteration (s): 15.26 | learning rate: 1.682E-05 | global batch size:    16 | lm loss: 5.840108E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3210/  128728 | consumed samples:        51360 | consumed tokens:    105185280 | elapsed time per iteration (s): 15.22 | learning rate: 1.683E-05 | global batch size:    16 | lm loss: 5.684037E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3211/  128728 | consumed samples:        51376 | consumed tokens:    105218048 | elapsed time per iteration (s): 15.22 | learning rate: 1.683E-05 | global batch size:    16 | lm loss: 5.864146E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3212/  128728 | consumed samples:        51392 | consumed tokens:    105250816 | elapsed time per iteration (s): 15.23 | learning rate: 1.684E-05 | global batch size:    16 | lm loss: 5.662052E+00 | grad norm: 1.047 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3213/  128728 | consumed samples:        51408 | consumed tokens:    105283584 | elapsed time per iteration (s): 15.22 | learning rate: 1.685E-05 | global batch size:    16 | lm loss: 5.930824E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3214/  128728 | consumed samples:        51424 | consumed tokens:    105316352 | elapsed time per iteration (s): 15.23 | learning rate: 1.685E-05 | global batch size:    16 | lm loss: 5.820041E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3215/  128728 | consumed samples:        51440 | consumed tokens:    105349120 | elapsed time per iteration (s): 15.24 | learning rate: 1.686E-05 | global batch size:    16 | lm loss: 5.921219E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3216/  128728 | consumed samples:        51456 | consumed tokens:    105381888 | elapsed time per iteration (s): 15.23 | learning rate: 1.686E-05 | global batch size:    16 | lm loss: 5.814280E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3217/  128728 | consumed samples:        51472 | consumed tokens:    105414656 | elapsed time per iteration (s): 15.21 | learning rate: 1.687E-05 | global batch size:    16 | lm loss: 5.856856E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3218/  128728 | consumed samples:        51488 | consumed tokens:    105447424 | elapsed time per iteration (s): 15.23 | learning rate: 1.687E-05 | global batch size:    16 | lm loss: 5.942042E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3219/  128728 | consumed samples:        51504 | consumed tokens:    105480192 | elapsed time per iteration (s): 15.22 | learning rate: 1.688E-05 | global batch size:    16 | lm loss: 5.818819E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3220/  128728 | consumed samples:        51520 | consumed tokens:    105512960 | elapsed time per iteration (s): 15.22 | learning rate: 1.688E-05 | global batch size:    16 | lm loss: 5.934455E+00 | grad norm: 1.517 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3221/  128728 | consumed samples:        51536 | consumed tokens:    105545728 | elapsed time per iteration (s): 15.24 | learning rate: 1.689E-05 | global batch size:    16 | lm loss: 5.632852E+00 | grad norm: 1.036 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3222/  128728 | consumed samples:        51552 | consumed tokens:    105578496 | elapsed time per iteration (s): 15.17 | learning rate: 1.689E-05 | global batch size:    16 | lm loss: 5.690525E+00 | grad norm: 0.658 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3223/  128728 | consumed samples:        51568 | consumed tokens:    105611264 | elapsed time per iteration (s): 15.26 | learning rate: 1.690E-05 | global batch size:    16 | lm loss: 5.435367E+00 | grad norm: 1.042 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     3224/  128728 | consumed samples:        51584 | consumed tokens:    105644032 | elapsed time per iteration (s): 15.21 | learning rate: 1.690E-05 | global batch size:    16 | lm loss: 5.834442E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3225/  128728 | consumed samples:        51600 | consumed tokens:    105676800 | elapsed time per iteration (s): 15.19 | learning rate: 1.691E-05 | global batch size:    16 | lm loss: 5.838341E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3226/  128728 | consumed samples:        51616 | consumed tokens:    105709568 | elapsed time per iteration (s): 15.21 | learning rate: 1.691E-05 | global batch size:    16 | lm loss: 5.809447E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3227/  128728 | consumed samples:        51632 | consumed tokens:    105742336 | elapsed time per iteration (s): 15.22 | learning rate: 1.692E-05 | global batch size:    16 | lm loss: 5.792805E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3228/  128728 | consumed samples:        51648 | consumed tokens:    105775104 | elapsed time per iteration (s): 15.16 | learning rate: 1.692E-05 | global batch size:    16 | lm loss: 5.630265E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3229/  128728 | consumed samples:        51664 | consumed tokens:    105807872 | elapsed time per iteration (s): 15.27 | learning rate: 1.693E-05 | global batch size:    16 | lm loss: 5.785818E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     3230/  128728 | consumed samples:        51680 | consumed tokens:    105840640 | elapsed time per iteration (s): 15.21 | learning rate: 1.693E-05 | global batch size:    16 | lm loss: 5.710336E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3231/  128728 | consumed samples:        51696 | consumed tokens:    105873408 | elapsed time per iteration (s): 15.22 | learning rate: 1.694E-05 | global batch size:    16 | lm loss: 5.774018E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3232/  128728 | consumed samples:        51712 | consumed tokens:    105906176 | elapsed time per iteration (s): 15.21 | learning rate: 1.695E-05 | global batch size:    16 | lm loss: 5.810544E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3233/  128728 | consumed samples:        51728 | consumed tokens:    105938944 | elapsed time per iteration (s): 15.21 | learning rate: 1.695E-05 | global batch size:    16 | lm loss: 5.686558E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3234/  128728 | consumed samples:        51744 | consumed tokens:    105971712 | elapsed time per iteration (s): 15.22 | learning rate: 1.696E-05 | global batch size:    16 | lm loss: 5.808766E+00 | grad norm: 0.935 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3235/  128728 | consumed samples:        51760 | consumed tokens:    106004480 | elapsed time per iteration (s): 15.23 | learning rate: 1.696E-05 | global batch size:    16 | lm loss: 5.933078E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3236/  128728 | consumed samples:        51776 | consumed tokens:    106037248 | elapsed time per iteration (s): 15.20 | learning rate: 1.697E-05 | global batch size:    16 | lm loss: 5.929778E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3237/  128728 | consumed samples:        51792 | consumed tokens:    106070016 | elapsed time per iteration (s): 15.19 | learning rate: 1.697E-05 | global batch size:    16 | lm loss: 5.637609E+00 | grad norm: 0.659 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3238/  128728 | consumed samples:        51808 | consumed tokens:    106102784 | elapsed time per iteration (s): 15.21 | learning rate: 1.698E-05 | global batch size:    16 | lm loss: 5.857882E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3239/  128728 | consumed samples:        51824 | consumed tokens:    106135552 | elapsed time per iteration (s): 15.24 | learning rate: 1.698E-05 | global batch size:    16 | lm loss: 5.865059E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3240/  128728 | consumed samples:        51840 | consumed tokens:    106168320 | elapsed time per iteration (s): 15.18 | learning rate: 1.699E-05 | global batch size:    16 | lm loss: 5.716511E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3241/  128728 | consumed samples:        51856 | consumed tokens:    106201088 | elapsed time per iteration (s): 15.21 | learning rate: 1.699E-05 | global batch size:    16 | lm loss: 5.803041E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3242/  128728 | consumed samples:        51872 | consumed tokens:    106233856 | elapsed time per iteration (s): 15.18 | learning rate: 1.700E-05 | global batch size:    16 | lm loss: 5.904123E+00 | grad norm: 0.651 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3243/  128728 | consumed samples:        51888 | consumed tokens:    106266624 | elapsed time per iteration (s): 15.20 | learning rate: 1.700E-05 | global batch size:    16 | lm loss: 5.810658E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3244/  128728 | consumed samples:        51904 | consumed tokens:    106299392 | elapsed time per iteration (s): 15.23 | learning rate: 1.701E-05 | global batch size:    16 | lm loss: 5.841102E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3245/  128728 | consumed samples:        51920 | consumed tokens:    106332160 | elapsed time per iteration (s): 15.22 | learning rate: 1.701E-05 | global batch size:    16 | lm loss: 5.736031E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3246/  128728 | consumed samples:        51936 | consumed tokens:    106364928 | elapsed time per iteration (s): 15.19 | learning rate: 1.702E-05 | global batch size:    16 | lm loss: 5.761059E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3247/  128728 | consumed samples:        51952 | consumed tokens:    106397696 | elapsed time per iteration (s): 15.24 | learning rate: 1.702E-05 | global batch size:    16 | lm loss: 5.894554E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3248/  128728 | consumed samples:        51968 | consumed tokens:    106430464 | elapsed time per iteration (s): 15.21 | learning rate: 1.703E-05 | global batch size:    16 | lm loss: 5.798692E+00 | grad norm: 0.648 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3249/  128728 | consumed samples:        51984 | consumed tokens:    106463232 | elapsed time per iteration (s): 15.21 | learning rate: 1.703E-05 | global batch size:    16 | lm loss: 5.678707E+00 | grad norm: 1.109 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3250/  128728 | consumed samples:        52000 | consumed tokens:    106496000 | elapsed time per iteration (s): 15.22 | learning rate: 1.704E-05 | global batch size:    16 | lm loss: 5.730203E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3251/  128728 | consumed samples:        52016 | consumed tokens:    106528768 | elapsed time per iteration (s): 15.17 | learning rate: 1.704E-05 | global batch size:    16 | lm loss: 5.578306E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3252/  128728 | consumed samples:        52032 | consumed tokens:    106561536 | elapsed time per iteration (s): 15.17 | learning rate: 1.705E-05 | global batch size:    16 | lm loss: 5.799627E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3253/  128728 | consumed samples:        52048 | consumed tokens:    106594304 | elapsed time per iteration (s): 15.23 | learning rate: 1.706E-05 | global batch size:    16 | lm loss: 5.785791E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3254/  128728 | consumed samples:        52064 | consumed tokens:    106627072 | elapsed time per iteration (s): 15.21 | learning rate: 1.706E-05 | global batch size:    16 | lm loss: 5.783490E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3255/  128728 | consumed samples:        52080 | consumed tokens:    106659840 | elapsed time per iteration (s): 15.16 | learning rate: 1.707E-05 | global batch size:    16 | lm loss: 5.852077E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3256/  128728 | consumed samples:        52096 | consumed tokens:    106692608 | elapsed time per iteration (s): 15.27 | learning rate: 1.707E-05 | global batch size:    16 | lm loss: 6.013089E+00 | grad norm: 0.940 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     3257/  128728 | consumed samples:        52112 | consumed tokens:    106725376 | elapsed time per iteration (s): 15.19 | learning rate: 1.708E-05 | global batch size:    16 | lm loss: 5.874432E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3258/  128728 | consumed samples:        52128 | consumed tokens:    106758144 | elapsed time per iteration (s): 15.23 | learning rate: 1.708E-05 | global batch size:    16 | lm loss: 5.740964E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3259/  128728 | consumed samples:        52144 | consumed tokens:    106790912 | elapsed time per iteration (s): 15.24 | learning rate: 1.709E-05 | global batch size:    16 | lm loss: 5.593841E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3260/  128728 | consumed samples:        52160 | consumed tokens:    106823680 | elapsed time per iteration (s): 15.20 | learning rate: 1.709E-05 | global batch size:    16 | lm loss: 5.618110E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3261/  128728 | consumed samples:        52176 | consumed tokens:    106856448 | elapsed time per iteration (s): 15.20 | learning rate: 1.710E-05 | global batch size:    16 | lm loss: 5.856056E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3262/  128728 | consumed samples:        52192 | consumed tokens:    106889216 | elapsed time per iteration (s): 15.22 | learning rate: 1.710E-05 | global batch size:    16 | lm loss: 5.848229E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3263/  128728 | consumed samples:        52208 | consumed tokens:    106921984 | elapsed time per iteration (s): 15.23 | learning rate: 1.711E-05 | global batch size:    16 | lm loss: 5.979393E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3264/  128728 | consumed samples:        52224 | consumed tokens:    106954752 | elapsed time per iteration (s): 15.22 | learning rate: 1.711E-05 | global batch size:    16 | lm loss: 5.691633E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3265/  128728 | consumed samples:        52240 | consumed tokens:    106987520 | elapsed time per iteration (s): 15.23 | learning rate: 1.712E-05 | global batch size:    16 | lm loss: 5.807378E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3266/  128728 | consumed samples:        52256 | consumed tokens:    107020288 | elapsed time per iteration (s): 15.23 | learning rate: 1.712E-05 | global batch size:    16 | lm loss: 5.705748E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3267/  128728 | consumed samples:        52272 | consumed tokens:    107053056 | elapsed time per iteration (s): 15.25 | learning rate: 1.713E-05 | global batch size:    16 | lm loss: 5.653453E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3268/  128728 | consumed samples:        52288 | consumed tokens:    107085824 | elapsed time per iteration (s): 15.23 | learning rate: 1.713E-05 | global batch size:    16 | lm loss: 5.936657E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3269/  128728 | consumed samples:        52304 | consumed tokens:    107118592 | elapsed time per iteration (s): 15.20 | learning rate: 1.714E-05 | global batch size:    16 | lm loss: 5.710529E+00 | grad norm: 1.321 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3270/  128728 | consumed samples:        52320 | consumed tokens:    107151360 | elapsed time per iteration (s): 15.17 | learning rate: 1.714E-05 | global batch size:    16 | lm loss: 5.695917E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3271/  128728 | consumed samples:        52336 | consumed tokens:    107184128 | elapsed time per iteration (s): 15.24 | learning rate: 1.715E-05 | global batch size:    16 | lm loss: 5.680094E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3272/  128728 | consumed samples:        52352 | consumed tokens:    107216896 | elapsed time per iteration (s): 15.20 | learning rate: 1.715E-05 | global batch size:    16 | lm loss: 5.884488E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3273/  128728 | consumed samples:        52368 | consumed tokens:    107249664 | elapsed time per iteration (s): 15.24 | learning rate: 1.716E-05 | global batch size:    16 | lm loss: 5.788309E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3274/  128728 | consumed samples:        52384 | consumed tokens:    107282432 | elapsed time per iteration (s): 15.24 | learning rate: 1.717E-05 | global batch size:    16 | lm loss: 5.728816E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3275/  128728 | consumed samples:        52400 | consumed tokens:    107315200 | elapsed time per iteration (s): 15.23 | learning rate: 1.717E-05 | global batch size:    16 | lm loss: 6.045995E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3276/  128728 | consumed samples:        52416 | consumed tokens:    107347968 | elapsed time per iteration (s): 15.21 | learning rate: 1.718E-05 | global batch size:    16 | lm loss: 5.863292E+00 | grad norm: 0.659 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3277/  128728 | consumed samples:        52432 | consumed tokens:    107380736 | elapsed time per iteration (s): 15.21 | learning rate: 1.718E-05 | global batch size:    16 | lm loss: 5.678338E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3278/  128728 | consumed samples:        52448 | consumed tokens:    107413504 | elapsed time per iteration (s): 15.16 | learning rate: 1.719E-05 | global batch size:    16 | lm loss: 5.855639E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3279/  128728 | consumed samples:        52464 | consumed tokens:    107446272 | elapsed time per iteration (s): 15.22 | learning rate: 1.719E-05 | global batch size:    16 | lm loss: 5.804471E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3280/  128728 | consumed samples:        52480 | consumed tokens:    107479040 | elapsed time per iteration (s): 15.22 | learning rate: 1.720E-05 | global batch size:    16 | lm loss: 5.617855E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3281/  128728 | consumed samples:        52496 | consumed tokens:    107511808 | elapsed time per iteration (s): 15.19 | learning rate: 1.720E-05 | global batch size:    16 | lm loss: 5.743747E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3282/  128728 | consumed samples:        52512 | consumed tokens:    107544576 | elapsed time per iteration (s): 15.19 | learning rate: 1.721E-05 | global batch size:    16 | lm loss: 5.869383E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3283/  128728 | consumed samples:        52528 | consumed tokens:    107577344 | elapsed time per iteration (s): 15.18 | learning rate: 1.721E-05 | global batch size:    16 | lm loss: 5.538039E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3284/  128728 | consumed samples:        52544 | consumed tokens:    107610112 | elapsed time per iteration (s): 15.23 | learning rate: 1.722E-05 | global batch size:    16 | lm loss: 5.996184E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3285/  128728 | consumed samples:        52560 | consumed tokens:    107642880 | elapsed time per iteration (s): 15.17 | learning rate: 1.722E-05 | global batch size:    16 | lm loss: 5.756711E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3286/  128728 | consumed samples:        52576 | consumed tokens:    107675648 | elapsed time per iteration (s): 15.18 | learning rate: 1.723E-05 | global batch size:    16 | lm loss: 5.927887E+00 | grad norm: 1.021 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3287/  128728 | consumed samples:        52592 | consumed tokens:    107708416 | elapsed time per iteration (s): 15.21 | learning rate: 1.723E-05 | global batch size:    16 | lm loss: 5.704397E+00 | grad norm: 0.635 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3288/  128728 | consumed samples:        52608 | consumed tokens:    107741184 | elapsed time per iteration (s): 15.19 | learning rate: 1.724E-05 | global batch size:    16 | lm loss: 5.545193E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3289/  128728 | consumed samples:        52624 | consumed tokens:    107773952 | elapsed time per iteration (s): 15.20 | learning rate: 1.724E-05 | global batch size:    16 | lm loss: 5.826765E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3290/  128728 | consumed samples:        52640 | consumed tokens:    107806720 | elapsed time per iteration (s): 15.20 | learning rate: 1.725E-05 | global batch size:    16 | lm loss: 5.701634E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3291/  128728 | consumed samples:        52656 | consumed tokens:    107839488 | elapsed time per iteration (s): 15.19 | learning rate: 1.725E-05 | global batch size:    16 | lm loss: 5.741204E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3292/  128728 | consumed samples:        52672 | consumed tokens:    107872256 | elapsed time per iteration (s): 15.21 | learning rate: 1.726E-05 | global batch size:    16 | lm loss: 5.751829E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3293/  128728 | consumed samples:        52688 | consumed tokens:    107905024 | elapsed time per iteration (s): 15.16 | learning rate: 1.726E-05 | global batch size:    16 | lm loss: 5.917647E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3294/  128728 | consumed samples:        52704 | consumed tokens:    107937792 | elapsed time per iteration (s): 15.25 | learning rate: 1.727E-05 | global batch size:    16 | lm loss: 5.593085E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3295/  128728 | consumed samples:        52720 | consumed tokens:    107970560 | elapsed time per iteration (s): 15.19 | learning rate: 1.728E-05 | global batch size:    16 | lm loss: 5.778680E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3296/  128728 | consumed samples:        52736 | consumed tokens:    108003328 | elapsed time per iteration (s): 15.23 | learning rate: 1.728E-05 | global batch size:    16 | lm loss: 5.817701E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3297/  128728 | consumed samples:        52752 | consumed tokens:    108036096 | elapsed time per iteration (s): 15.23 | learning rate: 1.729E-05 | global batch size:    16 | lm loss: 5.779237E+00 | grad norm: 1.102 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3298/  128728 | consumed samples:        52768 | consumed tokens:    108068864 | elapsed time per iteration (s): 15.23 | learning rate: 1.729E-05 | global batch size:    16 | lm loss: 5.603976E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3299/  128728 | consumed samples:        52784 | consumed tokens:    108101632 | elapsed time per iteration (s): 15.23 | learning rate: 1.730E-05 | global batch size:    16 | lm loss: 5.524374E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3300/  128728 | consumed samples:        52800 | consumed tokens:    108134400 | elapsed time per iteration (s): 15.19 | learning rate: 1.730E-05 | global batch size:    16 | lm loss: 5.887682E+00 | grad norm: 0.849 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3301/  128728 | consumed samples:        52816 | consumed tokens:    108167168 | elapsed time per iteration (s): 15.21 | learning rate: 1.731E-05 | global batch size:    16 | lm loss: 5.713980E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3302/  128728 | consumed samples:        52832 | consumed tokens:    108199936 | elapsed time per iteration (s): 15.22 | learning rate: 1.731E-05 | global batch size:    16 | lm loss: 5.805495E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3303/  128728 | consumed samples:        52848 | consumed tokens:    108232704 | elapsed time per iteration (s): 15.22 | learning rate: 1.732E-05 | global batch size:    16 | lm loss: 5.778564E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3304/  128728 | consumed samples:        52864 | consumed tokens:    108265472 | elapsed time per iteration (s): 15.19 | learning rate: 1.732E-05 | global batch size:    16 | lm loss: 5.578158E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3305/  128728 | consumed samples:        52880 | consumed tokens:    108298240 | elapsed time per iteration (s): 15.23 | learning rate: 1.733E-05 | global batch size:    16 | lm loss: 5.771214E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3306/  128728 | consumed samples:        52896 | consumed tokens:    108331008 | elapsed time per iteration (s): 15.20 | learning rate: 1.733E-05 | global batch size:    16 | lm loss: 5.839641E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3307/  128728 | consumed samples:        52912 | consumed tokens:    108363776 | elapsed time per iteration (s): 15.21 | learning rate: 1.734E-05 | global batch size:    16 | lm loss: 5.654119E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3308/  128728 | consumed samples:        52928 | consumed tokens:    108396544 | elapsed time per iteration (s): 15.18 | learning rate: 1.734E-05 | global batch size:    16 | lm loss: 5.570100E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3309/  128728 | consumed samples:        52944 | consumed tokens:    108429312 | elapsed time per iteration (s): 15.18 | learning rate: 1.735E-05 | global batch size:    16 | lm loss: 5.901294E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3310/  128728 | consumed samples:        52960 | consumed tokens:    108462080 | elapsed time per iteration (s): 15.22 | learning rate: 1.735E-05 | global batch size:    16 | lm loss: 5.962369E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3311/  128728 | consumed samples:        52976 | consumed tokens:    108494848 | elapsed time per iteration (s): 15.20 | learning rate: 1.736E-05 | global batch size:    16 | lm loss: 5.723049E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3312/  128728 | consumed samples:        52992 | consumed tokens:    108527616 | elapsed time per iteration (s): 15.27 | learning rate: 1.736E-05 | global batch size:    16 | lm loss: 5.923474E+00 | grad norm: 2.187 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     3313/  128728 | consumed samples:        53008 | consumed tokens:    108560384 | elapsed time per iteration (s): 15.22 | learning rate: 1.737E-05 | global batch size:    16 | lm loss: 5.845633E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3314/  128728 | consumed samples:        53024 | consumed tokens:    108593152 | elapsed time per iteration (s): 15.23 | learning rate: 1.737E-05 | global batch size:    16 | lm loss: 5.823073E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3315/  128728 | consumed samples:        53040 | consumed tokens:    108625920 | elapsed time per iteration (s): 15.22 | learning rate: 1.738E-05 | global batch size:    16 | lm loss: 5.727423E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3316/  128728 | consumed samples:        53056 | consumed tokens:    108658688 | elapsed time per iteration (s): 15.22 | learning rate: 1.739E-05 | global batch size:    16 | lm loss: 5.628917E+00 | grad norm: 0.672 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3317/  128728 | consumed samples:        53072 | consumed tokens:    108691456 | elapsed time per iteration (s): 15.21 | learning rate: 1.739E-05 | global batch size:    16 | lm loss: 5.650199E+00 | grad norm: 0.646 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3318/  128728 | consumed samples:        53088 | consumed tokens:    108724224 | elapsed time per iteration (s): 15.21 | learning rate: 1.740E-05 | global batch size:    16 | lm loss: 5.799413E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3319/  128728 | consumed samples:        53104 | consumed tokens:    108756992 | elapsed time per iteration (s): 15.23 | learning rate: 1.740E-05 | global batch size:    16 | lm loss: 5.819259E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3320/  128728 | consumed samples:        53120 | consumed tokens:    108789760 | elapsed time per iteration (s): 15.16 | learning rate: 1.741E-05 | global batch size:    16 | lm loss: 5.719065E+00 | grad norm: 0.635 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3321/  128728 | consumed samples:        53136 | consumed tokens:    108822528 | elapsed time per iteration (s): 15.21 | learning rate: 1.741E-05 | global batch size:    16 | lm loss: 5.814806E+00 | grad norm: 0.672 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3322/  128728 | consumed samples:        53152 | consumed tokens:    108855296 | elapsed time per iteration (s): 15.23 | learning rate: 1.742E-05 | global batch size:    16 | lm loss: 5.729675E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3323/  128728 | consumed samples:        53168 | consumed tokens:    108888064 | elapsed time per iteration (s): 15.21 | learning rate: 1.742E-05 | global batch size:    16 | lm loss: 5.674429E+00 | grad norm: 1.007 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3324/  128728 | consumed samples:        53184 | consumed tokens:    108920832 | elapsed time per iteration (s): 15.20 | learning rate: 1.743E-05 | global batch size:    16 | lm loss: 5.645885E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3325/  128728 | consumed samples:        53200 | consumed tokens:    108953600 | elapsed time per iteration (s): 15.23 | learning rate: 1.743E-05 | global batch size:    16 | lm loss: 5.516932E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3326/  128728 | consumed samples:        53216 | consumed tokens:    108986368 | elapsed time per iteration (s): 15.15 | learning rate: 1.744E-05 | global batch size:    16 | lm loss: 5.534013E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3327/  128728 | consumed samples:        53232 | consumed tokens:    109019136 | elapsed time per iteration (s): 15.17 | learning rate: 1.744E-05 | global batch size:    16 | lm loss: 5.667064E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3328/  128728 | consumed samples:        53248 | consumed tokens:    109051904 | elapsed time per iteration (s): 15.21 | learning rate: 1.745E-05 | global batch size:    16 | lm loss: 5.748591E+00 | grad norm: 1.036 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3329/  128728 | consumed samples:        53264 | consumed tokens:    109084672 | elapsed time per iteration (s): 15.24 | learning rate: 1.745E-05 | global batch size:    16 | lm loss: 5.727609E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3330/  128728 | consumed samples:        53280 | consumed tokens:    109117440 | elapsed time per iteration (s): 15.22 | learning rate: 1.746E-05 | global batch size:    16 | lm loss: 5.723650E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3331/  128728 | consumed samples:        53296 | consumed tokens:    109150208 | elapsed time per iteration (s): 15.25 | learning rate: 1.746E-05 | global batch size:    16 | lm loss: 5.739835E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3332/  128728 | consumed samples:        53312 | consumed tokens:    109182976 | elapsed time per iteration (s): 15.15 | learning rate: 1.747E-05 | global batch size:    16 | lm loss: 5.628811E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3333/  128728 | consumed samples:        53328 | consumed tokens:    109215744 | elapsed time per iteration (s): 15.25 | learning rate: 1.747E-05 | global batch size:    16 | lm loss: 5.761261E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3334/  128728 | consumed samples:        53344 | consumed tokens:    109248512 | elapsed time per iteration (s): 15.15 | learning rate: 1.748E-05 | global batch size:    16 | lm loss: 5.464535E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3335/  128728 | consumed samples:        53360 | consumed tokens:    109281280 | elapsed time per iteration (s): 15.21 | learning rate: 1.749E-05 | global batch size:    16 | lm loss: 5.644732E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3336/  128728 | consumed samples:        53376 | consumed tokens:    109314048 | elapsed time per iteration (s): 15.20 | learning rate: 1.749E-05 | global batch size:    16 | lm loss: 5.744635E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3337/  128728 | consumed samples:        53392 | consumed tokens:    109346816 | elapsed time per iteration (s): 15.20 | learning rate: 1.750E-05 | global batch size:    16 | lm loss: 5.764827E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3338/  128728 | consumed samples:        53408 | consumed tokens:    109379584 | elapsed time per iteration (s): 15.18 | learning rate: 1.750E-05 | global batch size:    16 | lm loss: 5.594451E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3339/  128728 | consumed samples:        53424 | consumed tokens:    109412352 | elapsed time per iteration (s): 15.22 | learning rate: 1.751E-05 | global batch size:    16 | lm loss: 5.622928E+00 | grad norm: 1.031 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3340/  128728 | consumed samples:        53440 | consumed tokens:    109445120 | elapsed time per iteration (s): 15.24 | learning rate: 1.751E-05 | global batch size:    16 | lm loss: 5.824643E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3341/  128728 | consumed samples:        53456 | consumed tokens:    109477888 | elapsed time per iteration (s): 15.22 | learning rate: 1.752E-05 | global batch size:    16 | lm loss: 5.793392E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3342/  128728 | consumed samples:        53472 | consumed tokens:    109510656 | elapsed time per iteration (s): 15.25 | learning rate: 1.752E-05 | global batch size:    16 | lm loss: 5.710301E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3343/  128728 | consumed samples:        53488 | consumed tokens:    109543424 | elapsed time per iteration (s): 15.23 | learning rate: 1.753E-05 | global batch size:    16 | lm loss: 5.582598E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3344/  128728 | consumed samples:        53504 | consumed tokens:    109576192 | elapsed time per iteration (s): 15.20 | learning rate: 1.753E-05 | global batch size:    16 | lm loss: 5.832360E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3345/  128728 | consumed samples:        53520 | consumed tokens:    109608960 | elapsed time per iteration (s): 15.22 | learning rate: 1.754E-05 | global batch size:    16 | lm loss: 5.602098E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3346/  128728 | consumed samples:        53536 | consumed tokens:    109641728 | elapsed time per iteration (s): 15.23 | learning rate: 1.754E-05 | global batch size:    16 | lm loss: 5.705314E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3347/  128728 | consumed samples:        53552 | consumed tokens:    109674496 | elapsed time per iteration (s): 15.18 | learning rate: 1.755E-05 | global batch size:    16 | lm loss: 5.765421E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3348/  128728 | consumed samples:        53568 | consumed tokens:    109707264 | elapsed time per iteration (s): 15.18 | learning rate: 1.755E-05 | global batch size:    16 | lm loss: 5.589844E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3349/  128728 | consumed samples:        53584 | consumed tokens:    109740032 | elapsed time per iteration (s): 15.21 | learning rate: 1.756E-05 | global batch size:    16 | lm loss: 5.752171E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3350/  128728 | consumed samples:        53600 | consumed tokens:    109772800 | elapsed time per iteration (s): 15.21 | learning rate: 1.756E-05 | global batch size:    16 | lm loss: 5.713757E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3351/  128728 | consumed samples:        53616 | consumed tokens:    109805568 | elapsed time per iteration (s): 15.17 | learning rate: 1.757E-05 | global batch size:    16 | lm loss: 5.712284E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3352/  128728 | consumed samples:        53632 | consumed tokens:    109838336 | elapsed time per iteration (s): 15.20 | learning rate: 1.757E-05 | global batch size:    16 | lm loss: 5.660229E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3353/  128728 | consumed samples:        53648 | consumed tokens:    109871104 | elapsed time per iteration (s): 15.19 | learning rate: 1.758E-05 | global batch size:    16 | lm loss: 5.759288E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3354/  128728 | consumed samples:        53664 | consumed tokens:    109903872 | elapsed time per iteration (s): 15.18 | learning rate: 1.758E-05 | global batch size:    16 | lm loss: 5.624930E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3355/  128728 | consumed samples:        53680 | consumed tokens:    109936640 | elapsed time per iteration (s): 15.22 | learning rate: 1.759E-05 | global batch size:    16 | lm loss: 5.804910E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3356/  128728 | consumed samples:        53696 | consumed tokens:    109969408 | elapsed time per iteration (s): 15.20 | learning rate: 1.760E-05 | global batch size:    16 | lm loss: 5.792589E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3357/  128728 | consumed samples:        53712 | consumed tokens:    110002176 | elapsed time per iteration (s): 15.20 | learning rate: 1.760E-05 | global batch size:    16 | lm loss: 5.710659E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3358/  128728 | consumed samples:        53728 | consumed tokens:    110034944 | elapsed time per iteration (s): 15.23 | learning rate: 1.761E-05 | global batch size:    16 | lm loss: 5.681277E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3359/  128728 | consumed samples:        53744 | consumed tokens:    110067712 | elapsed time per iteration (s): 15.22 | learning rate: 1.761E-05 | global batch size:    16 | lm loss: 5.616888E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3360/  128728 | consumed samples:        53760 | consumed tokens:    110100480 | elapsed time per iteration (s): 15.21 | learning rate: 1.762E-05 | global batch size:    16 | lm loss: 5.545935E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3361/  128728 | consumed samples:        53776 | consumed tokens:    110133248 | elapsed time per iteration (s): 15.22 | learning rate: 1.762E-05 | global batch size:    16 | lm loss: 5.594195E+00 | grad norm: 1.097 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3362/  128728 | consumed samples:        53792 | consumed tokens:    110166016 | elapsed time per iteration (s): 15.20 | learning rate: 1.763E-05 | global batch size:    16 | lm loss: 5.793941E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3363/  128728 | consumed samples:        53808 | consumed tokens:    110198784 | elapsed time per iteration (s): 15.19 | learning rate: 1.763E-05 | global batch size:    16 | lm loss: 5.692922E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3364/  128728 | consumed samples:        53824 | consumed tokens:    110231552 | elapsed time per iteration (s): 15.23 | learning rate: 1.764E-05 | global batch size:    16 | lm loss: 5.684273E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3365/  128728 | consumed samples:        53840 | consumed tokens:    110264320 | elapsed time per iteration (s): 15.23 | learning rate: 1.764E-05 | global batch size:    16 | lm loss: 5.695712E+00 | grad norm: 0.877 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3366/  128728 | consumed samples:        53856 | consumed tokens:    110297088 | elapsed time per iteration (s): 15.17 | learning rate: 1.765E-05 | global batch size:    16 | lm loss: 5.798710E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3367/  128728 | consumed samples:        53872 | consumed tokens:    110329856 | elapsed time per iteration (s): 15.21 | learning rate: 1.765E-05 | global batch size:    16 | lm loss: 5.708490E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3368/  128728 | consumed samples:        53888 | consumed tokens:    110362624 | elapsed time per iteration (s): 15.21 | learning rate: 1.766E-05 | global batch size:    16 | lm loss: 5.760231E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3369/  128728 | consumed samples:        53904 | consumed tokens:    110395392 | elapsed time per iteration (s): 15.22 | learning rate: 1.766E-05 | global batch size:    16 | lm loss: 5.631289E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3370/  128728 | consumed samples:        53920 | consumed tokens:    110428160 | elapsed time per iteration (s): 15.22 | learning rate: 1.767E-05 | global batch size:    16 | lm loss: 5.564578E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3371/  128728 | consumed samples:        53936 | consumed tokens:    110460928 | elapsed time per iteration (s): 15.23 | learning rate: 1.767E-05 | global batch size:    16 | lm loss: 5.699044E+00 | grad norm: 0.646 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3372/  128728 | consumed samples:        53952 | consumed tokens:    110493696 | elapsed time per iteration (s): 15.17 | learning rate: 1.768E-05 | global batch size:    16 | lm loss: 5.595973E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3373/  128728 | consumed samples:        53968 | consumed tokens:    110526464 | elapsed time per iteration (s): 15.26 | learning rate: 1.768E-05 | global batch size:    16 | lm loss: 5.924860E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3374/  128728 | consumed samples:        53984 | consumed tokens:    110559232 | elapsed time per iteration (s): 15.18 | learning rate: 1.769E-05 | global batch size:    16 | lm loss: 5.703710E+00 | grad norm: 0.867 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3375/  128728 | consumed samples:        54000 | consumed tokens:    110592000 | elapsed time per iteration (s): 15.25 | learning rate: 1.769E-05 | global batch size:    16 | lm loss: 5.843274E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3376/  128728 | consumed samples:        54016 | consumed tokens:    110624768 | elapsed time per iteration (s): 15.20 | learning rate: 1.770E-05 | global batch size:    16 | lm loss: 5.493551E+00 | grad norm: 0.862 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3377/  128728 | consumed samples:        54032 | consumed tokens:    110657536 | elapsed time per iteration (s): 15.24 | learning rate: 1.771E-05 | global batch size:    16 | lm loss: 5.871907E+00 | grad norm: 1.331 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3378/  128728 | consumed samples:        54048 | consumed tokens:    110690304 | elapsed time per iteration (s): 15.22 | learning rate: 1.771E-05 | global batch size:    16 | lm loss: 5.754053E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3379/  128728 | consumed samples:        54064 | consumed tokens:    110723072 | elapsed time per iteration (s): 15.18 | learning rate: 1.772E-05 | global batch size:    16 | lm loss: 5.626816E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3380/  128728 | consumed samples:        54080 | consumed tokens:    110755840 | elapsed time per iteration (s): 15.19 | learning rate: 1.772E-05 | global batch size:    16 | lm loss: 5.704596E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3381/  128728 | consumed samples:        54096 | consumed tokens:    110788608 | elapsed time per iteration (s): 15.23 | learning rate: 1.773E-05 | global batch size:    16 | lm loss: 5.738787E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3382/  128728 | consumed samples:        54112 | consumed tokens:    110821376 | elapsed time per iteration (s): 15.22 | learning rate: 1.773E-05 | global batch size:    16 | lm loss: 5.767883E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3383/  128728 | consumed samples:        54128 | consumed tokens:    110854144 | elapsed time per iteration (s): 15.23 | learning rate: 1.774E-05 | global batch size:    16 | lm loss: 5.859027E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3384/  128728 | consumed samples:        54144 | consumed tokens:    110886912 | elapsed time per iteration (s): 15.20 | learning rate: 1.774E-05 | global batch size:    16 | lm loss: 5.796133E+00 | grad norm: 0.639 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3385/  128728 | consumed samples:        54160 | consumed tokens:    110919680 | elapsed time per iteration (s): 15.23 | learning rate: 1.775E-05 | global batch size:    16 | lm loss: 5.630734E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3386/  128728 | consumed samples:        54176 | consumed tokens:    110952448 | elapsed time per iteration (s): 15.21 | learning rate: 1.775E-05 | global batch size:    16 | lm loss: 5.751515E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3387/  128728 | consumed samples:        54192 | consumed tokens:    110985216 | elapsed time per iteration (s): 15.21 | learning rate: 1.776E-05 | global batch size:    16 | lm loss: 5.599256E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3388/  128728 | consumed samples:        54208 | consumed tokens:    111017984 | elapsed time per iteration (s): 15.15 | learning rate: 1.776E-05 | global batch size:    16 | lm loss: 5.792551E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3389/  128728 | consumed samples:        54224 | consumed tokens:    111050752 | elapsed time per iteration (s): 15.20 | learning rate: 1.777E-05 | global batch size:    16 | lm loss: 5.626520E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3390/  128728 | consumed samples:        54240 | consumed tokens:    111083520 | elapsed time per iteration (s): 15.23 | learning rate: 1.777E-05 | global batch size:    16 | lm loss: 5.774959E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3391/  128728 | consumed samples:        54256 | consumed tokens:    111116288 | elapsed time per iteration (s): 15.21 | learning rate: 1.778E-05 | global batch size:    16 | lm loss: 5.683985E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3392/  128728 | consumed samples:        54272 | consumed tokens:    111149056 | elapsed time per iteration (s): 15.17 | learning rate: 1.778E-05 | global batch size:    16 | lm loss: 5.707817E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3393/  128728 | consumed samples:        54288 | consumed tokens:    111181824 | elapsed time per iteration (s): 15.21 | learning rate: 1.779E-05 | global batch size:    16 | lm loss: 5.814923E+00 | grad norm: 1.365 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3394/  128728 | consumed samples:        54304 | consumed tokens:    111214592 | elapsed time per iteration (s): 15.21 | learning rate: 1.779E-05 | global batch size:    16 | lm loss: 5.835570E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3395/  128728 | consumed samples:        54320 | consumed tokens:    111247360 | elapsed time per iteration (s): 15.22 | learning rate: 1.780E-05 | global batch size:    16 | lm loss: 5.720476E+00 | grad norm: 1.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3396/  128728 | consumed samples:        54336 | consumed tokens:    111280128 | elapsed time per iteration (s): 15.17 | learning rate: 1.780E-05 | global batch size:    16 | lm loss: 5.840722E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3397/  128728 | consumed samples:        54352 | consumed tokens:    111312896 | elapsed time per iteration (s): 15.20 | learning rate: 1.781E-05 | global batch size:    16 | lm loss: 5.656087E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3398/  128728 | consumed samples:        54368 | consumed tokens:    111345664 | elapsed time per iteration (s): 15.16 | learning rate: 1.782E-05 | global batch size:    16 | lm loss: 5.573381E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3399/  128728 | consumed samples:        54384 | consumed tokens:    111378432 | elapsed time per iteration (s): 15.17 | learning rate: 1.782E-05 | global batch size:    16 | lm loss: 5.773726E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3400/  128728 | consumed samples:        54400 | consumed tokens:    111411200 | elapsed time per iteration (s): 15.18 | learning rate: 1.783E-05 | global batch size:    16 | lm loss: 5.521105E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3401/  128728 | consumed samples:        54416 | consumed tokens:    111443968 | elapsed time per iteration (s): 15.22 | learning rate: 1.783E-05 | global batch size:    16 | lm loss: 5.594294E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3402/  128728 | consumed samples:        54432 | consumed tokens:    111476736 | elapsed time per iteration (s): 15.19 | learning rate: 1.784E-05 | global batch size:    16 | lm loss: 5.854078E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3403/  128728 | consumed samples:        54448 | consumed tokens:    111509504 | elapsed time per iteration (s): 15.17 | learning rate: 1.784E-05 | global batch size:    16 | lm loss: 5.709444E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3404/  128728 | consumed samples:        54464 | consumed tokens:    111542272 | elapsed time per iteration (s): 15.18 | learning rate: 1.785E-05 | global batch size:    16 | lm loss: 5.785772E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3405/  128728 | consumed samples:        54480 | consumed tokens:    111575040 | elapsed time per iteration (s): 15.22 | learning rate: 1.785E-05 | global batch size:    16 | lm loss: 5.675919E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3406/  128728 | consumed samples:        54496 | consumed tokens:    111607808 | elapsed time per iteration (s): 15.20 | learning rate: 1.786E-05 | global batch size:    16 | lm loss: 5.934880E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3407/  128728 | consumed samples:        54512 | consumed tokens:    111640576 | elapsed time per iteration (s): 15.22 | learning rate: 1.786E-05 | global batch size:    16 | lm loss: 5.878328E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3408/  128728 | consumed samples:        54528 | consumed tokens:    111673344 | elapsed time per iteration (s): 15.21 | learning rate: 1.787E-05 | global batch size:    16 | lm loss: 5.828094E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3409/  128728 | consumed samples:        54544 | consumed tokens:    111706112 | elapsed time per iteration (s): 15.19 | learning rate: 1.787E-05 | global batch size:    16 | lm loss: 5.730283E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3410/  128728 | consumed samples:        54560 | consumed tokens:    111738880 | elapsed time per iteration (s): 15.21 | learning rate: 1.788E-05 | global batch size:    16 | lm loss: 5.648894E+00 | grad norm: 1.208 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3411/  128728 | consumed samples:        54576 | consumed tokens:    111771648 | elapsed time per iteration (s): 15.20 | learning rate: 1.788E-05 | global batch size:    16 | lm loss: 6.132384E+00 | grad norm: 0.874 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3412/  128728 | consumed samples:        54592 | consumed tokens:    111804416 | elapsed time per iteration (s): 15.18 | learning rate: 1.789E-05 | global batch size:    16 | lm loss: 5.648220E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3413/  128728 | consumed samples:        54608 | consumed tokens:    111837184 | elapsed time per iteration (s): 15.19 | learning rate: 1.789E-05 | global batch size:    16 | lm loss: 5.778464E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3414/  128728 | consumed samples:        54624 | consumed tokens:    111869952 | elapsed time per iteration (s): 15.20 | learning rate: 1.790E-05 | global batch size:    16 | lm loss: 5.724689E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3415/  128728 | consumed samples:        54640 | consumed tokens:    111902720 | elapsed time per iteration (s): 15.20 | learning rate: 1.790E-05 | global batch size:    16 | lm loss: 5.589879E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3416/  128728 | consumed samples:        54656 | consumed tokens:    111935488 | elapsed time per iteration (s): 15.24 | learning rate: 1.791E-05 | global batch size:    16 | lm loss: 5.682995E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3417/  128728 | consumed samples:        54672 | consumed tokens:    111968256 | elapsed time per iteration (s): 15.16 | learning rate: 1.791E-05 | global batch size:    16 | lm loss: 5.687815E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3418/  128728 | consumed samples:        54688 | consumed tokens:    112001024 | elapsed time per iteration (s): 15.22 | learning rate: 1.792E-05 | global batch size:    16 | lm loss: 5.820484E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3419/  128728 | consumed samples:        54704 | consumed tokens:    112033792 | elapsed time per iteration (s): 15.23 | learning rate: 1.793E-05 | global batch size:    16 | lm loss: 5.659999E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3420/  128728 | consumed samples:        54720 | consumed tokens:    112066560 | elapsed time per iteration (s): 15.21 | learning rate: 1.793E-05 | global batch size:    16 | lm loss: 5.798374E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3421/  128728 | consumed samples:        54736 | consumed tokens:    112099328 | elapsed time per iteration (s): 15.22 | learning rate: 1.794E-05 | global batch size:    16 | lm loss: 5.579554E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3422/  128728 | consumed samples:        54752 | consumed tokens:    112132096 | elapsed time per iteration (s): 15.21 | learning rate: 1.794E-05 | global batch size:    16 | lm loss: 5.739928E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3423/  128728 | consumed samples:        54768 | consumed tokens:    112164864 | elapsed time per iteration (s): 15.17 | learning rate: 1.795E-05 | global batch size:    16 | lm loss: 5.720255E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3424/  128728 | consumed samples:        54784 | consumed tokens:    112197632 | elapsed time per iteration (s): 15.25 | learning rate: 1.795E-05 | global batch size:    16 | lm loss: 5.507630E+00 | grad norm: 1.070 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3425/  128728 | consumed samples:        54800 | consumed tokens:    112230400 | elapsed time per iteration (s): 15.19 | learning rate: 1.796E-05 | global batch size:    16 | lm loss: 5.621741E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3426/  128728 | consumed samples:        54816 | consumed tokens:    112263168 | elapsed time per iteration (s): 15.20 | learning rate: 1.796E-05 | global batch size:    16 | lm loss: 5.538146E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3427/  128728 | consumed samples:        54832 | consumed tokens:    112295936 | elapsed time per iteration (s): 15.21 | learning rate: 1.797E-05 | global batch size:    16 | lm loss: 5.712105E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3428/  128728 | consumed samples:        54848 | consumed tokens:    112328704 | elapsed time per iteration (s): 15.23 | learning rate: 1.797E-05 | global batch size:    16 | lm loss: 5.487374E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3429/  128728 | consumed samples:        54864 | consumed tokens:    112361472 | elapsed time per iteration (s): 15.19 | learning rate: 1.798E-05 | global batch size:    16 | lm loss: 5.644139E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3430/  128728 | consumed samples:        54880 | consumed tokens:    112394240 | elapsed time per iteration (s): 15.17 | learning rate: 1.798E-05 | global batch size:    16 | lm loss: 5.514249E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3431/  128728 | consumed samples:        54896 | consumed tokens:    112427008 | elapsed time per iteration (s): 15.18 | learning rate: 1.799E-05 | global batch size:    16 | lm loss: 5.665630E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3432/  128728 | consumed samples:        54912 | consumed tokens:    112459776 | elapsed time per iteration (s): 15.15 | learning rate: 1.799E-05 | global batch size:    16 | lm loss: 5.801665E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3433/  128728 | consumed samples:        54928 | consumed tokens:    112492544 | elapsed time per iteration (s): 15.15 | learning rate: 1.800E-05 | global batch size:    16 | lm loss: 5.669302E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3434/  128728 | consumed samples:        54944 | consumed tokens:    112525312 | elapsed time per iteration (s): 15.18 | learning rate: 1.800E-05 | global batch size:    16 | lm loss: 5.777668E+00 | grad norm: 0.964 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3435/  128728 | consumed samples:        54960 | consumed tokens:    112558080 | elapsed time per iteration (s): 15.13 | learning rate: 1.801E-05 | global batch size:    16 | lm loss: 5.705936E+00 | grad norm: 0.943 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.058 | TFLOPs: 8.10 |
[default7]: iteration     3436/  128728 | consumed samples:        54976 | consumed tokens:    112590848 | elapsed time per iteration (s): 15.20 | learning rate: 1.801E-05 | global batch size:    16 | lm loss: 5.854589E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3437/  128728 | consumed samples:        54992 | consumed tokens:    112623616 | elapsed time per iteration (s): 15.22 | learning rate: 1.802E-05 | global batch size:    16 | lm loss: 5.623005E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3438/  128728 | consumed samples:        55008 | consumed tokens:    112656384 | elapsed time per iteration (s): 15.21 | learning rate: 1.803E-05 | global batch size:    16 | lm loss: 5.733920E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3439/  128728 | consumed samples:        55024 | consumed tokens:    112689152 | elapsed time per iteration (s): 15.14 | learning rate: 1.803E-05 | global batch size:    16 | lm loss: 5.607145E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3440/  128728 | consumed samples:        55040 | consumed tokens:    112721920 | elapsed time per iteration (s): 15.18 | learning rate: 1.804E-05 | global batch size:    16 | lm loss: 5.568397E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3441/  128728 | consumed samples:        55056 | consumed tokens:    112754688 | elapsed time per iteration (s): 15.22 | learning rate: 1.804E-05 | global batch size:    16 | lm loss: 5.497924E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3442/  128728 | consumed samples:        55072 | consumed tokens:    112787456 | elapsed time per iteration (s): 15.20 | learning rate: 1.805E-05 | global batch size:    16 | lm loss: 5.711787E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3443/  128728 | consumed samples:        55088 | consumed tokens:    112820224 | elapsed time per iteration (s): 15.19 | learning rate: 1.805E-05 | global batch size:    16 | lm loss: 5.645088E+00 | grad norm: 0.989 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3444/  128728 | consumed samples:        55104 | consumed tokens:    112852992 | elapsed time per iteration (s): 15.18 | learning rate: 1.806E-05 | global batch size:    16 | lm loss: 5.776569E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3445/  128728 | consumed samples:        55120 | consumed tokens:    112885760 | elapsed time per iteration (s): 15.16 | learning rate: 1.806E-05 | global batch size:    16 | lm loss: 5.663031E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3446/  128728 | consumed samples:        55136 | consumed tokens:    112918528 | elapsed time per iteration (s): 15.22 | learning rate: 1.807E-05 | global batch size:    16 | lm loss: 5.596757E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3447/  128728 | consumed samples:        55152 | consumed tokens:    112951296 | elapsed time per iteration (s): 15.15 | learning rate: 1.807E-05 | global batch size:    16 | lm loss: 5.633924E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3448/  128728 | consumed samples:        55168 | consumed tokens:    112984064 | elapsed time per iteration (s): 15.22 | learning rate: 1.808E-05 | global batch size:    16 | lm loss: 5.418813E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3449/  128728 | consumed samples:        55184 | consumed tokens:    113016832 | elapsed time per iteration (s): 15.20 | learning rate: 1.808E-05 | global batch size:    16 | lm loss: 5.588249E+00 | grad norm: 1.016 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3450/  128728 | consumed samples:        55200 | consumed tokens:    113049600 | elapsed time per iteration (s): 15.19 | learning rate: 1.809E-05 | global batch size:    16 | lm loss: 5.400003E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3451/  128728 | consumed samples:        55216 | consumed tokens:    113082368 | elapsed time per iteration (s): 15.22 | learning rate: 1.809E-05 | global batch size:    16 | lm loss: 5.908926E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3452/  128728 | consumed samples:        55232 | consumed tokens:    113115136 | elapsed time per iteration (s): 15.25 | learning rate: 1.810E-05 | global batch size:    16 | lm loss: 5.507290E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3453/  128728 | consumed samples:        55248 | consumed tokens:    113147904 | elapsed time per iteration (s): 15.19 | learning rate: 1.810E-05 | global batch size:    16 | lm loss: 5.697307E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3454/  128728 | consumed samples:        55264 | consumed tokens:    113180672 | elapsed time per iteration (s): 15.22 | learning rate: 1.811E-05 | global batch size:    16 | lm loss: 5.761248E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3455/  128728 | consumed samples:        55280 | consumed tokens:    113213440 | elapsed time per iteration (s): 15.22 | learning rate: 1.811E-05 | global batch size:    16 | lm loss: 5.412930E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3456/  128728 | consumed samples:        55296 | consumed tokens:    113246208 | elapsed time per iteration (s): 15.21 | learning rate: 1.812E-05 | global batch size:    16 | lm loss: 5.534837E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3457/  128728 | consumed samples:        55312 | consumed tokens:    113278976 | elapsed time per iteration (s): 15.23 | learning rate: 1.812E-05 | global batch size:    16 | lm loss: 5.676351E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3458/  128728 | consumed samples:        55328 | consumed tokens:    113311744 | elapsed time per iteration (s): 15.21 | learning rate: 1.813E-05 | global batch size:    16 | lm loss: 5.914691E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3459/  128728 | consumed samples:        55344 | consumed tokens:    113344512 | elapsed time per iteration (s): 15.23 | learning rate: 1.814E-05 | global batch size:    16 | lm loss: 5.779829E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3460/  128728 | consumed samples:        55360 | consumed tokens:    113377280 | elapsed time per iteration (s): 15.21 | learning rate: 1.814E-05 | global batch size:    16 | lm loss: 5.488255E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3461/  128728 | consumed samples:        55376 | consumed tokens:    113410048 | elapsed time per iteration (s): 15.23 | learning rate: 1.815E-05 | global batch size:    16 | lm loss: 5.597379E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3462/  128728 | consumed samples:        55392 | consumed tokens:    113442816 | elapsed time per iteration (s): 15.26 | learning rate: 1.815E-05 | global batch size:    16 | lm loss: 5.796825E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     3463/  128728 | consumed samples:        55408 | consumed tokens:    113475584 | elapsed time per iteration (s): 15.15 | learning rate: 1.816E-05 | global batch size:    16 | lm loss: 5.453174E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3464/  128728 | consumed samples:        55424 | consumed tokens:    113508352 | elapsed time per iteration (s): 15.20 | learning rate: 1.816E-05 | global batch size:    16 | lm loss: 5.592092E+00 | grad norm: 0.862 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3465/  128728 | consumed samples:        55440 | consumed tokens:    113541120 | elapsed time per iteration (s): 15.20 | learning rate: 1.817E-05 | global batch size:    16 | lm loss: 5.629677E+00 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3466/  128728 | consumed samples:        55456 | consumed tokens:    113573888 | elapsed time per iteration (s): 15.24 | learning rate: 1.817E-05 | global batch size:    16 | lm loss: 5.776768E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3467/  128728 | consumed samples:        55472 | consumed tokens:    113606656 | elapsed time per iteration (s): 15.17 | learning rate: 1.818E-05 | global batch size:    16 | lm loss: 5.656150E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3468/  128728 | consumed samples:        55488 | consumed tokens:    113639424 | elapsed time per iteration (s): 15.15 | learning rate: 1.818E-05 | global batch size:    16 | lm loss: 5.554830E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3469/  128728 | consumed samples:        55504 | consumed tokens:    113672192 | elapsed time per iteration (s): 15.22 | learning rate: 1.819E-05 | global batch size:    16 | lm loss: 5.850750E+00 | grad norm: 0.862 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3470/  128728 | consumed samples:        55520 | consumed tokens:    113704960 | elapsed time per iteration (s): 15.21 | learning rate: 1.819E-05 | global batch size:    16 | lm loss: 5.848739E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3471/  128728 | consumed samples:        55536 | consumed tokens:    113737728 | elapsed time per iteration (s): 15.14 | learning rate: 1.820E-05 | global batch size:    16 | lm loss: 5.411209E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3472/  128728 | consumed samples:        55552 | consumed tokens:    113770496 | elapsed time per iteration (s): 15.24 | learning rate: 1.820E-05 | global batch size:    16 | lm loss: 5.765627E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3473/  128728 | consumed samples:        55568 | consumed tokens:    113803264 | elapsed time per iteration (s): 15.21 | learning rate: 1.821E-05 | global batch size:    16 | lm loss: 5.575092E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3474/  128728 | consumed samples:        55584 | consumed tokens:    113836032 | elapsed time per iteration (s): 15.21 | learning rate: 1.821E-05 | global batch size:    16 | lm loss: 5.591868E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3475/  128728 | consumed samples:        55600 | consumed tokens:    113868800 | elapsed time per iteration (s): 15.21 | learning rate: 1.822E-05 | global batch size:    16 | lm loss: 5.551509E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3476/  128728 | consumed samples:        55616 | consumed tokens:    113901568 | elapsed time per iteration (s): 15.22 | learning rate: 1.822E-05 | global batch size:    16 | lm loss: 5.394422E+00 | grad norm: 0.999 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3477/  128728 | consumed samples:        55632 | consumed tokens:    113934336 | elapsed time per iteration (s): 15.22 | learning rate: 1.823E-05 | global batch size:    16 | lm loss: 5.498854E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3478/  128728 | consumed samples:        55648 | consumed tokens:    113967104 | elapsed time per iteration (s): 15.20 | learning rate: 1.823E-05 | global batch size:    16 | lm loss: 5.861041E+00 | grad norm: 1.413 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3479/  128728 | consumed samples:        55664 | consumed tokens:    113999872 | elapsed time per iteration (s): 15.17 | learning rate: 1.824E-05 | global batch size:    16 | lm loss: 5.653027E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3480/  128728 | consumed samples:        55680 | consumed tokens:    114032640 | elapsed time per iteration (s): 15.22 | learning rate: 1.825E-05 | global batch size:    16 | lm loss: 5.562919E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3481/  128728 | consumed samples:        55696 | consumed tokens:    114065408 | elapsed time per iteration (s): 15.21 | learning rate: 1.825E-05 | global batch size:    16 | lm loss: 5.663836E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3482/  128728 | consumed samples:        55712 | consumed tokens:    114098176 | elapsed time per iteration (s): 15.23 | learning rate: 1.826E-05 | global batch size:    16 | lm loss: 5.682405E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3483/  128728 | consumed samples:        55728 | consumed tokens:    114130944 | elapsed time per iteration (s): 15.22 | learning rate: 1.826E-05 | global batch size:    16 | lm loss: 5.507264E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3484/  128728 | consumed samples:        55744 | consumed tokens:    114163712 | elapsed time per iteration (s): 15.18 | learning rate: 1.827E-05 | global batch size:    16 | lm loss: 5.668527E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3485/  128728 | consumed samples:        55760 | consumed tokens:    114196480 | elapsed time per iteration (s): 15.22 | learning rate: 1.827E-05 | global batch size:    16 | lm loss: 5.564321E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3486/  128728 | consumed samples:        55776 | consumed tokens:    114229248 | elapsed time per iteration (s): 15.24 | learning rate: 1.828E-05 | global batch size:    16 | lm loss: 5.737549E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3487/  128728 | consumed samples:        55792 | consumed tokens:    114262016 | elapsed time per iteration (s): 15.24 | learning rate: 1.828E-05 | global batch size:    16 | lm loss: 5.537987E+00 | grad norm: 1.367 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3488/  128728 | consumed samples:        55808 | consumed tokens:    114294784 | elapsed time per iteration (s): 15.22 | learning rate: 1.829E-05 | global batch size:    16 | lm loss: 5.651535E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3489/  128728 | consumed samples:        55824 | consumed tokens:    114327552 | elapsed time per iteration (s): 15.21 | learning rate: 1.829E-05 | global batch size:    16 | lm loss: 5.642838E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3490/  128728 | consumed samples:        55840 | consumed tokens:    114360320 | elapsed time per iteration (s): 15.19 | learning rate: 1.830E-05 | global batch size:    16 | lm loss: 5.894348E+00 | grad norm: 1.402 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3491/  128728 | consumed samples:        55856 | consumed tokens:    114393088 | elapsed time per iteration (s): 15.24 | learning rate: 1.830E-05 | global batch size:    16 | lm loss: 5.590985E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3492/  128728 | consumed samples:        55872 | consumed tokens:    114425856 | elapsed time per iteration (s): 15.21 | learning rate: 1.831E-05 | global batch size:    16 | lm loss: 5.752702E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3493/  128728 | consumed samples:        55888 | consumed tokens:    114458624 | elapsed time per iteration (s): 15.18 | learning rate: 1.831E-05 | global batch size:    16 | lm loss: 5.723320E+00 | grad norm: 0.849 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3494/  128728 | consumed samples:        55904 | consumed tokens:    114491392 | elapsed time per iteration (s): 15.21 | learning rate: 1.832E-05 | global batch size:    16 | lm loss: 5.537277E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3495/  128728 | consumed samples:        55920 | consumed tokens:    114524160 | elapsed time per iteration (s): 15.23 | learning rate: 1.832E-05 | global batch size:    16 | lm loss: 5.881509E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3496/  128728 | consumed samples:        55936 | consumed tokens:    114556928 | elapsed time per iteration (s): 15.22 | learning rate: 1.833E-05 | global batch size:    16 | lm loss: 5.464675E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3497/  128728 | consumed samples:        55952 | consumed tokens:    114589696 | elapsed time per iteration (s): 15.16 | learning rate: 1.833E-05 | global batch size:    16 | lm loss: 5.424148E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3498/  128728 | consumed samples:        55968 | consumed tokens:    114622464 | elapsed time per iteration (s): 15.20 | learning rate: 1.834E-05 | global batch size:    16 | lm loss: 5.588657E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3499/  128728 | consumed samples:        55984 | consumed tokens:    114655232 | elapsed time per iteration (s): 15.17 | learning rate: 1.834E-05 | global batch size:    16 | lm loss: 5.764312E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3500/  128728 | consumed samples:        56000 | consumed tokens:    114688000 | elapsed time per iteration (s): 15.19 | learning rate: 1.835E-05 | global batch size:    16 | lm loss: 5.390745E+00 | grad norm: 0.626 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default0]:saving checkpoint at iteration    3500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:[2022-03-03 20:47:38,235] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/mp_rank_00_model_states.pt
[default1]:[2022-03-03 20:47:38,515] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/mp_rank_01_model_states.pt
[default0]:[2022-03-03 20:47:52,222] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt
[default5]:[2022-03-03 20:47:52,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt
[default4]:[2022-03-03 20:47:52,450] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt
[default4]:[2022-03-03 20:47:52,448] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 20:47:52,517] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt
[default6]:[2022-03-03 20:47:52,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt
[default5]:[2022-03-03 20:47:52,805] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt
[default2]:[2022-03-03 20:47:52,889] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt
[default0]:[2022-03-03 20:47:52,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 20:47:53,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt
[default7]:[2022-03-03 20:47:53,093] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt
[default7]:[2022-03-03 20:47:53,089] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt
[default5]:[2022-03-03 20:47:53,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 20:47:53,259] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt
[default7]:[2022-03-03 20:47:53,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt
[default0]:[2022-03-03 20:47:53,288] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt
[default0]:[2022-03-03 20:47:53,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt
[default6]:[2022-03-03 20:47:53,347] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 20:47:53,474] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt
[default5]:[2022-03-03 20:47:53,514] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 20:47:53,504] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt
[default3]:[2022-03-03 20:47:53,473] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt
[default6]:[2022-03-03 20:47:53,513] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt
[default2]:[2022-03-03 20:47:53,555] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt
[default1]:[2022-03-03 20:47:53,606] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt
[default1]:[2022-03-03 20:47:53,617] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 20:47:53,779] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt
[default1]:[2022-03-03 20:47:53,875] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt
[default3]:[2022-03-03 20:47:53,946] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt
[default5]:[2022-03-03 20:47:54,075] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 20:47:54,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt
[default4]:[2022-03-03 20:47:54,112] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt
[default2]:[2022-03-03 20:47:54,201] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt
[default4]:[2022-03-03 20:47:54,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt
[default0]:[2022-03-03 20:47:54,233] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
[default1]:[2022-03-03 20:47:54,280] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt
[default5]:[2022-03-03 20:47:54,211] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt
[default6]:[2022-03-03 20:47:54,294] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt
[default7]:[2022-03-03 20:47:54,296] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt
[default3]:[2022-03-03 20:47:54,376] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 20:47:54,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt
[default4]:[2022-03-03 20:47:54,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt
[default0]:[2022-03-03 20:47:54,378] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 20:47:54,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt
[default1]:[2022-03-03 20:47:54,387] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 20:47:54,372] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt
[default4]:[2022-03-03 20:47:54,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
[default7]:[2022-03-03 20:47:54,534] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt
[default5]:[2022-03-03 20:47:54,630] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt
[default0]:[2022-03-03 20:47:54,590] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt
[default4]:[2022-03-03 20:47:54,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 20:47:54,691] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt
[default5]:[2022-03-03 20:47:54,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt
[default6]:[2022-03-03 20:47:54,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt
[default3]:[2022-03-03 20:47:54,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 20:47:54,752] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 20:47:54,810] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt
[default2]:[2022-03-03 20:47:54,906] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt
[default2]:[2022-03-03 20:47:54,949] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt
[default3]:[2022-03-03 20:47:54,899] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt
[default0]:[2022-03-03 20:47:55,048] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt
[default6]:[2022-03-03 20:47:55,105] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt
[default6]:[2022-03-03 20:47:55,094] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 20:47:55,185] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt
[default1]:[2022-03-03 20:47:55,214] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt
[default2]:[2022-03-03 20:47:55,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt
[default4]:[2022-03-03 20:47:55,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt
[default5]:[2022-03-03 20:47:55,304] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt
[default7]:[2022-03-03 20:47:55,305] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 20:47:55,317] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt
[default2]:[2022-03-03 20:47:55,472] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt
[default1]:[2022-03-03 20:47:55,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt
[default6]:[2022-03-03 20:47:55,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt
[default4]:[2022-03-03 20:47:55,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt
[default7]:[2022-03-03 20:47:55,711] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 20:47:55,771] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 20:47:55,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt
[default6]:[2022-03-03 20:47:55,832] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt
[default0]:[2022-03-03 20:47:55,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt
[default3]:[2022-03-03 20:47:56,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 20:47:56,082] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt
[default1]:[2022-03-03 20:47:56,097] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt
[default0]:[2022-03-03 20:47:56,117] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt
[default4]:[2022-03-03 20:47:56,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt
[default3]:[2022-03-03 20:47:56,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt
[default1]:[2022-03-03 20:47:56,201] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt
[default6]:[2022-03-03 20:47:56,259] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt
[default0]:[2022-03-03 20:47:56,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt
[default5]:[2022-03-03 20:47:56,385] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt
[default2]:[2022-03-03 20:47:56,512] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt
[default7]:[2022-03-03 20:47:56,556] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 20:47:56,578] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt
[default5]:[2022-03-03 20:47:56,791] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
[default6]:[2022-03-03 20:47:56,855] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt
[default4]:[2022-03-03 20:47:56,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt
[default1]:[2022-03-03 20:47:57,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt
[default2]:[2022-03-03 20:47:57,129] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 20:47:57,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt
[default3]:[2022-03-03 20:47:57,159] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 20:47:57,190] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt
[default1]:[2022-03-03 20:47:57,219] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt
[default7]:[2022-03-03 20:47:57,312] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt
[default4]:[2022-03-03 20:47:57,300] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt
[default1]:[2022-03-03 20:47:57,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt
[default0]:[2022-03-03 20:47:57,288] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt
[default3]:[2022-03-03 20:47:57,365] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt
[default5]:[2022-03-03 20:47:57,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt
[default2]:[2022-03-03 20:47:57,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 20:47:57,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
[default7]:[2022-03-03 20:47:57,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 20:47:57,510] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt
[default6]:[2022-03-03 20:47:57,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 20:47:57,491] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt
[default1]:[2022-03-03 20:47:57,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt
[default2]:[2022-03-03 20:47:57,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt
[default3]:[2022-03-03 20:47:57,768] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt
[default4]:[2022-03-03 20:47:57,774] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt
[default7]:[2022-03-03 20:47:57,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt
[default6]:[2022-03-03 20:47:57,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 20:47:57,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt
[default4]:[2022-03-03 20:47:57,955] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt
[default2]:[2022-03-03 20:47:58,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt
[default0]:[2022-03-03 20:47:58,056] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt
[default0]:[2022-03-03 20:47:58,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 20:47:58,162] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt
[default1]:[2022-03-03 20:47:58,125] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt
[default3]:[2022-03-03 20:47:58,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt
[default1]:[2022-03-03 20:47:58,219] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt
[default3]:[2022-03-03 20:47:58,181] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt
[default2]:[2022-03-03 20:47:58,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt
[default0]:[2022-03-03 20:47:58,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt
[default3]:[2022-03-03 20:47:58,285] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt
[default7]:[2022-03-03 20:47:58,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt
[default1]:[2022-03-03 20:47:58,324] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt
[default0]:[2022-03-03 20:47:58,311] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt
[default6]:[2022-03-03 20:47:58,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt
[default4]:[2022-03-03 20:47:58,415] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt
[default0]:[2022-03-03 20:47:58,429] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt
[default2]:[2022-03-03 20:47:58,495] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt
[default1]:[2022-03-03 20:47:58,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 20:47:58,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 20:47:58,539] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt
[default4]:[2022-03-03 20:47:58,407] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt
[default5]:[2022-03-03 20:47:58,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt
[default5]:[2022-03-03 20:47:58,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt
[default5]:[2022-03-03 20:47:58,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt
[default7]:[2022-03-03 20:47:58,512] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt
[default7]:[2022-03-03 20:47:58,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt
[default0]:[2022-03-03 20:47:58,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
[default6]:[2022-03-03 20:47:58,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 20:47:58,707] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt
[default0]:[2022-03-03 20:47:58,711] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt
[default7]:[2022-03-03 20:47:58,821] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt
[default6]:[2022-03-03 20:47:58,818] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt
[default5]:[2022-03-03 20:47:58,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 20:47:58,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt
[default7]:[2022-03-03 20:47:59,001] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt
[default0]:[2022-03-03 20:47:59,008] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 20:47:58,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 20:47:58,986] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt
[default7]:[2022-03-03 20:47:59,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt
[default2]:[2022-03-03 20:47:59,047] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 20:47:59,098] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 20:47:59,073] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt
[default0]:[2022-03-03 20:47:59,063] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt
[default4]:[2022-03-03 20:47:59,158] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt
[default4]:[2022-03-03 20:47:59,194] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt
[default6]:[2022-03-03 20:47:59,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt
[default5]:[2022-03-03 20:47:59,164] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt
[default3]:[2022-03-03 20:47:59,156] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt
[default3]:[2022-03-03 20:47:59,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt
[default5]:[2022-03-03 20:47:59,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt
[default6]:[2022-03-03 20:47:59,405] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt
[default7]:[2022-03-03 20:47:59,337] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt
[default1]:[2022-03-03 20:47:59,386] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 20:47:59,362] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt
[default3]:[2022-03-03 20:47:59,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt
[default0]:[2022-03-03 20:47:59,413] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt
[default4]:[2022-03-03 20:47:59,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt
[default0]:[2022-03-03 20:47:59,504] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt
[default2]:[2022-03-03 20:47:59,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt
[default6]:[2022-03-03 20:47:59,560] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt
[default4]:[2022-03-03 20:47:59,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt
[default5]:[2022-03-03 20:47:59,625] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 20:47:59,646] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt
[default4]:[2022-03-03 20:47:59,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt
[default1]:[2022-03-03 20:47:59,613] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 20:47:59,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt
[default4]:[2022-03-03 20:47:59,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt
[default1]:[2022-03-03 20:47:59,611] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 20:47:59,786] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 20:47:59,778] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt
[default3]:[2022-03-03 20:47:59,802] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 20:47:59,768] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt
[default2]:[2022-03-03 20:47:59,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 20:47:59,795] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt
[default2]:[2022-03-03 20:47:59,832] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt
[default6]:[2022-03-03 20:47:59,782] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt
[default5]:[2022-03-03 20:47:59,895] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt
[default3]:[2022-03-03 20:47:59,920] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 20:47:59,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt
[default0]:[2022-03-03 20:47:59,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 20:48:00,023] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 20:48:00,019] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 20:48:00,016] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt
[default3]:[2022-03-03 20:48:00,047] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt
[default0]:[2022-03-03 20:48:00,124] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt
[default6]:[2022-03-03 20:48:00,103] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt
[default0]:[2022-03-03 20:48:00,110] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 20:48:00,090] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt
[default0]:[2022-03-03 20:48:00,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 20:48:00,234] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt
[default3]:[2022-03-03 20:48:00,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 20:48:00,309] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt
[default4]:[2022-03-03 20:48:00,297] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt
[default5]:[2022-03-03 20:48:00,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt
[default2]:[2022-03-03 20:48:00,360] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 20:48:00,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt
[default3]:[2022-03-03 20:48:00,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt
[default7]:[2022-03-03 20:48:00,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt
[default3]:[2022-03-03 20:48:00,552] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt
[default6]:[2022-03-03 20:48:00,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt
[default1]:[2022-03-03 20:48:00,579] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt
[default2]:[2022-03-03 20:48:00,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt
[default5]:[2022-03-03 20:48:00,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 20:48:00,663] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt
[default5]:[2022-03-03 20:48:00,651] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt
[default3]:[2022-03-03 20:48:00,705] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 20:48:00,634] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt
[default7]:[2022-03-03 20:48:00,647] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt
[default4]:[2022-03-03 20:48:00,744] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt
[default1]:[2022-03-03 20:48:00,759] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt
[default0]:[2022-03-03 20:48:00,808] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 20:48:00,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt
[default4]:[2022-03-03 20:48:00,859] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt
[default5]:[2022-03-03 20:48:00,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 20:48:00,935] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt
[default5]:[2022-03-03 20:48:00,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt
[default1]:[2022-03-03 20:48:00,974] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt
[default5]:[2022-03-03 20:48:01,037] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt
[default0]:[2022-03-03 20:48:00,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt
[default4]:[2022-03-03 20:48:00,988] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt
[default4]:[2022-03-03 20:48:01,093] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt
[default2]:[2022-03-03 20:48:01,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt
[default4]:[2022-03-03 20:48:01,177] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt
[default7]:[2022-03-03 20:48:01,105] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 20:48:01,182] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt
[default5]:[2022-03-03 20:48:01,163] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 20:48:01,199] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt
[default3]:[2022-03-03 20:48:01,265] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt
[default4]:[2022-03-03 20:48:01,328] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
[default2]:[2022-03-03 20:48:01,380] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt
[default0]:[2022-03-03 20:48:01,358] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt
[default3]:[2022-03-03 20:48:01,467] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 20:48:01,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt
[default6]:[2022-03-03 20:48:01,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 20:48:01,500] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt
[default0]:[2022-03-03 20:48:01,514] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt
[default5]:[2022-03-03 20:48:01,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt
[default6]:[2022-03-03 20:48:01,576] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt
[default2]:[2022-03-03 20:48:01,641] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt
[default3]:[2022-03-03 20:48:01,579] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt
[default1]:[2022-03-03 20:48:01,624] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt
[default4]:[2022-03-03 20:48:01,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt
[default6]:[2022-03-03 20:48:01,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt
[default6]:[2022-03-03 20:48:01,661] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt
[default6]:[2022-03-03 20:48:01,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt
[default5]:[2022-03-03 20:48:01,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt
[default4]:[2022-03-03 20:48:01,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 20:48:01,713] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt
[default7]:[2022-03-03 20:48:01,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt
[default7]:[2022-03-03 20:48:01,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt
[default3]:[2022-03-03 20:48:01,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt
[default1]:[2022-03-03 20:48:01,966] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 20:48:02,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt
[default7]:[2022-03-03 20:48:02,037] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt
[default6]:[2022-03-03 20:48:01,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt
[default7]:[2022-03-03 20:48:02,139] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt
[default5]:[2022-03-03 20:48:02,150] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt
[default2]:[2022-03-03 20:48:02,155] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 20:48:02,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt
[default0]:[2022-03-03 20:48:02,270] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt
[default1]:[2022-03-03 20:48:02,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt
[default1]:[2022-03-03 20:48:02,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt
[default3]:[2022-03-03 20:48:02,403] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt
[default6]:[2022-03-03 20:48:02,498] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt
[default2]:[2022-03-03 20:48:02,614] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt
[default7]:[2022-03-03 20:48:02,657] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt
[default7]:[2022-03-03 20:48:02,625] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt
[default3]:[2022-03-03 20:48:02,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt
[default1]:[2022-03-03 20:48:02,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt
[default1]:[2022-03-03 20:48:02,870] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 20:48:02,873] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt
[default2]:[2022-03-03 20:48:02,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt
[default5]:[2022-03-03 20:48:03,024] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt
[default2]:[2022-03-03 20:48:03,081] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt
[default1]:[2022-03-03 20:48:03,131] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 20:48:03,122] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt
[default3]:[2022-03-03 20:48:03,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 20:48:03,091] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt
[default7]:[2022-03-03 20:48:03,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt
[default3]:[2022-03-03 20:48:03,400] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt
[default6]:[2022-03-03 20:48:03,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt
[default4]:[2022-03-03 20:48:03,486] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt
[default6]:[2022-03-03 20:48:03,492] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 20:48:03,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt
[default1]:[2022-03-03 20:48:03,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 20:48:03,512] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt
[default2]:[2022-03-03 20:48:03,587] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt
[default6]:[2022-03-03 20:48:03,681] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt
[default4]:[2022-03-03 20:48:03,699] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt
[default6]:[2022-03-03 20:48:03,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt
[default7]:[2022-03-03 20:48:03,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt
[default1]:[2022-03-03 20:48:03,764] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 20:48:03,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
[default1]:[2022-03-03 20:48:03,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt
[default3]:[2022-03-03 20:48:03,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt
[default7]:[2022-03-03 20:48:03,717] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt
[default7]:[2022-03-03 20:48:03,833] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt
[default2]:[2022-03-03 20:48:03,789] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt
[default0]:[2022-03-03 20:48:03,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[default2]:[2022-03-03 20:48:03,915] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt
[default6]:[2022-03-03 20:48:03,897] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt
[default5]:[2022-03-03 20:48:03,891] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 20:48:03,923] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt
[default7]:[2022-03-03 20:48:03,960] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 20:48:03,983] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 20:48:04,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt
[default3]:[2022-03-03 20:48:04,172] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 20:48:04,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt
[default0]:[2022-03-03 20:48:04,216] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 20:48:04,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt
[default4]:[2022-03-03 20:48:04,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt
[default3]:[2022-03-03 20:48:04,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt
[default2]:[2022-03-03 20:48:04,374] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 20:48:04,453] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 20:48:04,515] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt
[default2]:[2022-03-03 20:48:04,547] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt
[default3]:[2022-03-03 20:48:04,725] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt
[default2]:[2022-03-03 20:48:04,838] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt
[default6]:[2022-03-03 20:48:04,947] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt
[default7]:[2022-03-03 20:48:04,889] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt
[default3]:[2022-03-03 20:48:04,955] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 20:48:04,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt
[default6]:[2022-03-03 20:48:04,918] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt
[default0]:[2022-03-03 20:48:04,947] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt
[default1]:[2022-03-03 20:48:05,108] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt
[default7]:[2022-03-03 20:48:05,031] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 20:48:05,133] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 20:48:05,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 20:48:05,284] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt
[default1]:[2022-03-03 20:48:05,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt
[default7]:[2022-03-03 20:48:05,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt
[default0]:[2022-03-03 20:48:05,702] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt
[default0]:[2022-03-03 20:48:05,687] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 20:48:05,887] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt
[default1]:[2022-03-03 20:48:06,170] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt
[default5]:[2022-03-03 20:48:06,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 20:48:06,298] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt
[default6]:[2022-03-03 20:48:06,371] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt
[default7]:[2022-03-03 20:48:06,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt
[default5]:[2022-03-03 20:48:06,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 20:48:06,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt
[default4]:[2022-03-03 20:48:06,812] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt
[default0]:[2022-03-03 20:48:06,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 20:48:06,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 20:48:06,963] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt
[default2]:[2022-03-03 20:48:06,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt
[default1]:[2022-03-03 20:48:07,008] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt
[default0]:[2022-03-03 20:48:07,017] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt
[default7]:[2022-03-03 20:48:07,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 20:48:07,306] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt
[default5]:[2022-03-03 20:48:07,452] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt
[default4]:[2022-03-03 20:48:07,456] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt
[default5]:[2022-03-03 20:48:07,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt
[default4]:[2022-03-03 20:48:07,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 20:48:08,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 20:48:08,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt
[default4]:[2022-03-03 20:48:09,368] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 20:48:09,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt
[default7]:[2022-03-03 20:48:09,833] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt
[default6]:[2022-03-03 20:48:09,912] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt
[default4]:[2022-03-03 20:48:10,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
[default5]:[2022-03-03 20:48:10,996] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt
[default7]:time (ms) | save-checkpoint: 42646.90
[default0]:  successfully saved checkpoint at iteration    3500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default7]: iteration     3501/  128728 | consumed samples:        56016 | consumed tokens:    114720768 | elapsed time per iteration (s): 57.87 | learning rate: 1.836E-05 | global batch size:    16 | lm loss: 5.723885E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.276 | TFLOPs: 2.12 |
[default7]: iteration     3502/  128728 | consumed samples:        56032 | consumed tokens:    114753536 | elapsed time per iteration (s): 15.19 | learning rate: 1.836E-05 | global batch size:    16 | lm loss: 5.632663E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3503/  128728 | consumed samples:        56048 | consumed tokens:    114786304 | elapsed time per iteration (s): 15.18 | learning rate: 1.837E-05 | global batch size:    16 | lm loss: 5.605666E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3504/  128728 | consumed samples:        56064 | consumed tokens:    114819072 | elapsed time per iteration (s): 15.21 | learning rate: 1.837E-05 | global batch size:    16 | lm loss: 5.401419E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3505/  128728 | consumed samples:        56080 | consumed tokens:    114851840 | elapsed time per iteration (s): 15.18 | learning rate: 1.838E-05 | global batch size:    16 | lm loss: 5.347599E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3506/  128728 | consumed samples:        56096 | consumed tokens:    114884608 | elapsed time per iteration (s): 15.22 | learning rate: 1.838E-05 | global batch size:    16 | lm loss: 5.724100E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3507/  128728 | consumed samples:        56112 | consumed tokens:    114917376 | elapsed time per iteration (s): 15.20 | learning rate: 1.839E-05 | global batch size:    16 | lm loss: 5.753415E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3508/  128728 | consumed samples:        56128 | consumed tokens:    114950144 | elapsed time per iteration (s): 15.20 | learning rate: 1.839E-05 | global batch size:    16 | lm loss: 5.617841E+00 | grad norm: 0.639 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3509/  128728 | consumed samples:        56144 | consumed tokens:    114982912 | elapsed time per iteration (s): 15.22 | learning rate: 1.840E-05 | global batch size:    16 | lm loss: 5.481423E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3510/  128728 | consumed samples:        56160 | consumed tokens:    115015680 | elapsed time per iteration (s): 15.21 | learning rate: 1.840E-05 | global batch size:    16 | lm loss: 5.751305E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3511/  128728 | consumed samples:        56176 | consumed tokens:    115048448 | elapsed time per iteration (s): 15.19 | learning rate: 1.841E-05 | global batch size:    16 | lm loss: 5.546777E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3512/  128728 | consumed samples:        56192 | consumed tokens:    115081216 | elapsed time per iteration (s): 15.21 | learning rate: 1.841E-05 | global batch size:    16 | lm loss: 5.549484E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3513/  128728 | consumed samples:        56208 | consumed tokens:    115113984 | elapsed time per iteration (s): 15.19 | learning rate: 1.842E-05 | global batch size:    16 | lm loss: 5.823066E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3514/  128728 | consumed samples:        56224 | consumed tokens:    115146752 | elapsed time per iteration (s): 15.21 | learning rate: 1.842E-05 | global batch size:    16 | lm loss: 5.627325E+00 | grad norm: 0.633 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3515/  128728 | consumed samples:        56240 | consumed tokens:    115179520 | elapsed time per iteration (s): 15.20 | learning rate: 1.843E-05 | global batch size:    16 | lm loss: 5.750383E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3516/  128728 | consumed samples:        56256 | consumed tokens:    115212288 | elapsed time per iteration (s): 15.20 | learning rate: 1.843E-05 | global batch size:    16 | lm loss: 5.526446E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3517/  128728 | consumed samples:        56272 | consumed tokens:    115245056 | elapsed time per iteration (s): 15.23 | learning rate: 1.844E-05 | global batch size:    16 | lm loss: 5.721191E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3518/  128728 | consumed samples:        56288 | consumed tokens:    115277824 | elapsed time per iteration (s): 15.19 | learning rate: 1.844E-05 | global batch size:    16 | lm loss: 5.594088E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3519/  128728 | consumed samples:        56304 | consumed tokens:    115310592 | elapsed time per iteration (s): 15.24 | learning rate: 1.845E-05 | global batch size:    16 | lm loss: 5.617053E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3520/  128728 | consumed samples:        56320 | consumed tokens:    115343360 | elapsed time per iteration (s): 15.24 | learning rate: 1.845E-05 | global batch size:    16 | lm loss: 5.854468E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3521/  128728 | consumed samples:        56336 | consumed tokens:    115376128 | elapsed time per iteration (s): 15.23 | learning rate: 1.846E-05 | global batch size:    16 | lm loss: 5.725595E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3522/  128728 | consumed samples:        56352 | consumed tokens:    115408896 | elapsed time per iteration (s): 15.22 | learning rate: 1.847E-05 | global batch size:    16 | lm loss: 5.628036E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3523/  128728 | consumed samples:        56368 | consumed tokens:    115441664 | elapsed time per iteration (s): 15.24 | learning rate: 1.847E-05 | global batch size:    16 | lm loss: 5.498308E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3524/  128728 | consumed samples:        56384 | consumed tokens:    115474432 | elapsed time per iteration (s): 15.19 | learning rate: 1.848E-05 | global batch size:    16 | lm loss: 5.595693E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3525/  128728 | consumed samples:        56400 | consumed tokens:    115507200 | elapsed time per iteration (s): 15.21 | learning rate: 1.848E-05 | global batch size:    16 | lm loss: 5.580093E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3526/  128728 | consumed samples:        56416 | consumed tokens:    115539968 | elapsed time per iteration (s): 15.23 | learning rate: 1.849E-05 | global batch size:    16 | lm loss: 5.568558E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3527/  128728 | consumed samples:        56432 | consumed tokens:    115572736 | elapsed time per iteration (s): 15.24 | learning rate: 1.849E-05 | global batch size:    16 | lm loss: 5.609416E+00 | grad norm: 0.660 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3528/  128728 | consumed samples:        56448 | consumed tokens:    115605504 | elapsed time per iteration (s): 15.18 | learning rate: 1.850E-05 | global batch size:    16 | lm loss: 5.554018E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3529/  128728 | consumed samples:        56464 | consumed tokens:    115638272 | elapsed time per iteration (s): 15.16 | learning rate: 1.850E-05 | global batch size:    16 | lm loss: 5.508449E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3530/  128728 | consumed samples:        56480 | consumed tokens:    115671040 | elapsed time per iteration (s): 15.23 | learning rate: 1.851E-05 | global batch size:    16 | lm loss: 5.647694E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3531/  128728 | consumed samples:        56496 | consumed tokens:    115703808 | elapsed time per iteration (s): 15.18 | learning rate: 1.851E-05 | global batch size:    16 | lm loss: 5.667585E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3532/  128728 | consumed samples:        56512 | consumed tokens:    115736576 | elapsed time per iteration (s): 15.20 | learning rate: 1.852E-05 | global batch size:    16 | lm loss: 5.511345E+00 | grad norm: 0.908 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3533/  128728 | consumed samples:        56528 | consumed tokens:    115769344 | elapsed time per iteration (s): 15.18 | learning rate: 1.852E-05 | global batch size:    16 | lm loss: 5.457089E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3534/  128728 | consumed samples:        56544 | consumed tokens:    115802112 | elapsed time per iteration (s): 15.18 | learning rate: 1.853E-05 | global batch size:    16 | lm loss: 5.487876E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3535/  128728 | consumed samples:        56560 | consumed tokens:    115834880 | elapsed time per iteration (s): 15.24 | learning rate: 1.853E-05 | global batch size:    16 | lm loss: 5.762383E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3536/  128728 | consumed samples:        56576 | consumed tokens:    115867648 | elapsed time per iteration (s): 15.20 | learning rate: 1.854E-05 | global batch size:    16 | lm loss: 5.579982E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3537/  128728 | consumed samples:        56592 | consumed tokens:    115900416 | elapsed time per iteration (s): 15.24 | learning rate: 1.854E-05 | global batch size:    16 | lm loss: 5.651605E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3538/  128728 | consumed samples:        56608 | consumed tokens:    115933184 | elapsed time per iteration (s): 15.20 | learning rate: 1.855E-05 | global batch size:    16 | lm loss: 5.665345E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3539/  128728 | consumed samples:        56624 | consumed tokens:    115965952 | elapsed time per iteration (s): 15.18 | learning rate: 1.855E-05 | global batch size:    16 | lm loss: 5.426301E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3540/  128728 | consumed samples:        56640 | consumed tokens:    115998720 | elapsed time per iteration (s): 15.20 | learning rate: 1.856E-05 | global batch size:    16 | lm loss: 5.570403E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3541/  128728 | consumed samples:        56656 | consumed tokens:    116031488 | elapsed time per iteration (s): 15.18 | learning rate: 1.857E-05 | global batch size:    16 | lm loss: 5.609330E+00 | grad norm: 0.893 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3542/  128728 | consumed samples:        56672 | consumed tokens:    116064256 | elapsed time per iteration (s): 15.21 | learning rate: 1.857E-05 | global batch size:    16 | lm loss: 5.696312E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3543/  128728 | consumed samples:        56688 | consumed tokens:    116097024 | elapsed time per iteration (s): 15.18 | learning rate: 1.858E-05 | global batch size:    16 | lm loss: 5.457860E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3544/  128728 | consumed samples:        56704 | consumed tokens:    116129792 | elapsed time per iteration (s): 15.22 | learning rate: 1.858E-05 | global batch size:    16 | lm loss: 5.377970E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3545/  128728 | consumed samples:        56720 | consumed tokens:    116162560 | elapsed time per iteration (s): 15.14 | learning rate: 1.859E-05 | global batch size:    16 | lm loss: 5.565271E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3546/  128728 | consumed samples:        56736 | consumed tokens:    116195328 | elapsed time per iteration (s): 15.19 | learning rate: 1.859E-05 | global batch size:    16 | lm loss: 5.541815E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3547/  128728 | consumed samples:        56752 | consumed tokens:    116228096 | elapsed time per iteration (s): 15.22 | learning rate: 1.860E-05 | global batch size:    16 | lm loss: 5.579144E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3548/  128728 | consumed samples:        56768 | consumed tokens:    116260864 | elapsed time per iteration (s): 15.24 | learning rate: 1.860E-05 | global batch size:    16 | lm loss: 5.499104E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3549/  128728 | consumed samples:        56784 | consumed tokens:    116293632 | elapsed time per iteration (s): 15.21 | learning rate: 1.861E-05 | global batch size:    16 | lm loss: 5.444351E+00 | grad norm: 1.107 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3550/  128728 | consumed samples:        56800 | consumed tokens:    116326400 | elapsed time per iteration (s): 15.25 | learning rate: 1.861E-05 | global batch size:    16 | lm loss: 5.384247E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     3551/  128728 | consumed samples:        56816 | consumed tokens:    116359168 | elapsed time per iteration (s): 15.25 | learning rate: 1.862E-05 | global batch size:    16 | lm loss: 5.644943E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3552/  128728 | consumed samples:        56832 | consumed tokens:    116391936 | elapsed time per iteration (s): 15.14 | learning rate: 1.862E-05 | global batch size:    16 | lm loss: 5.620580E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3553/  128728 | consumed samples:        56848 | consumed tokens:    116424704 | elapsed time per iteration (s): 15.19 | learning rate: 1.863E-05 | global batch size:    16 | lm loss: 5.781569E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3554/  128728 | consumed samples:        56864 | consumed tokens:    116457472 | elapsed time per iteration (s): 15.17 | learning rate: 1.863E-05 | global batch size:    16 | lm loss: 5.655607E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3555/  128728 | consumed samples:        56880 | consumed tokens:    116490240 | elapsed time per iteration (s): 15.15 | learning rate: 1.864E-05 | global batch size:    16 | lm loss: 5.440409E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3556/  128728 | consumed samples:        56896 | consumed tokens:    116523008 | elapsed time per iteration (s): 15.21 | learning rate: 1.864E-05 | global batch size:    16 | lm loss: 5.547821E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3557/  128728 | consumed samples:        56912 | consumed tokens:    116555776 | elapsed time per iteration (s): 15.21 | learning rate: 1.865E-05 | global batch size:    16 | lm loss: 5.477743E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3558/  128728 | consumed samples:        56928 | consumed tokens:    116588544 | elapsed time per iteration (s): 15.18 | learning rate: 1.865E-05 | global batch size:    16 | lm loss: 5.752525E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3559/  128728 | consumed samples:        56944 | consumed tokens:    116621312 | elapsed time per iteration (s): 15.21 | learning rate: 1.866E-05 | global batch size:    16 | lm loss: 5.561419E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3560/  128728 | consumed samples:        56960 | consumed tokens:    116654080 | elapsed time per iteration (s): 15.21 | learning rate: 1.866E-05 | global batch size:    16 | lm loss: 5.594239E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3561/  128728 | consumed samples:        56976 | consumed tokens:    116686848 | elapsed time per iteration (s): 15.22 | learning rate: 1.867E-05 | global batch size:    16 | lm loss: 5.820959E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3562/  128728 | consumed samples:        56992 | consumed tokens:    116719616 | elapsed time per iteration (s): 15.22 | learning rate: 1.868E-05 | global batch size:    16 | lm loss: 5.637988E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3563/  128728 | consumed samples:        57008 | consumed tokens:    116752384 | elapsed time per iteration (s): 15.21 | learning rate: 1.868E-05 | global batch size:    16 | lm loss: 5.698305E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3564/  128728 | consumed samples:        57024 | consumed tokens:    116785152 | elapsed time per iteration (s): 15.19 | learning rate: 1.869E-05 | global batch size:    16 | lm loss: 5.562874E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3565/  128728 | consumed samples:        57040 | consumed tokens:    116817920 | elapsed time per iteration (s): 15.22 | learning rate: 1.869E-05 | global batch size:    16 | lm loss: 5.334938E+00 | grad norm: 1.181 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3566/  128728 | consumed samples:        57056 | consumed tokens:    116850688 | elapsed time per iteration (s): 15.19 | learning rate: 1.870E-05 | global batch size:    16 | lm loss: 5.632464E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3567/  128728 | consumed samples:        57072 | consumed tokens:    116883456 | elapsed time per iteration (s): 15.17 | learning rate: 1.870E-05 | global batch size:    16 | lm loss: 5.528159E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3568/  128728 | consumed samples:        57088 | consumed tokens:    116916224 | elapsed time per iteration (s): 15.23 | learning rate: 1.871E-05 | global batch size:    16 | lm loss: 5.648876E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3569/  128728 | consumed samples:        57104 | consumed tokens:    116948992 | elapsed time per iteration (s): 15.22 | learning rate: 1.871E-05 | global batch size:    16 | lm loss: 5.652656E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3570/  128728 | consumed samples:        57120 | consumed tokens:    116981760 | elapsed time per iteration (s): 15.17 | learning rate: 1.872E-05 | global batch size:    16 | lm loss: 5.550305E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3571/  128728 | consumed samples:        57136 | consumed tokens:    117014528 | elapsed time per iteration (s): 15.19 | learning rate: 1.872E-05 | global batch size:    16 | lm loss: 5.244218E+00 | grad norm: 0.892 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3572/  128728 | consumed samples:        57152 | consumed tokens:    117047296 | elapsed time per iteration (s): 15.23 | learning rate: 1.873E-05 | global batch size:    16 | lm loss: 5.495933E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3573/  128728 | consumed samples:        57168 | consumed tokens:    117080064 | elapsed time per iteration (s): 15.20 | learning rate: 1.873E-05 | global batch size:    16 | lm loss: 5.597926E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3574/  128728 | consumed samples:        57184 | consumed tokens:    117112832 | elapsed time per iteration (s): 15.24 | learning rate: 1.874E-05 | global batch size:    16 | lm loss: 5.457273E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3575/  128728 | consumed samples:        57200 | consumed tokens:    117145600 | elapsed time per iteration (s): 15.14 | learning rate: 1.874E-05 | global batch size:    16 | lm loss: 5.458507E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3576/  128728 | consumed samples:        57216 | consumed tokens:    117178368 | elapsed time per iteration (s): 15.24 | learning rate: 1.875E-05 | global batch size:    16 | lm loss: 5.373246E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3577/  128728 | consumed samples:        57232 | consumed tokens:    117211136 | elapsed time per iteration (s): 15.15 | learning rate: 1.875E-05 | global batch size:    16 | lm loss: 5.753658E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3578/  128728 | consumed samples:        57248 | consumed tokens:    117243904 | elapsed time per iteration (s): 15.24 | learning rate: 1.876E-05 | global batch size:    16 | lm loss: 5.383017E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3579/  128728 | consumed samples:        57264 | consumed tokens:    117276672 | elapsed time per iteration (s): 15.23 | learning rate: 1.876E-05 | global batch size:    16 | lm loss: 5.562088E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3580/  128728 | consumed samples:        57280 | consumed tokens:    117309440 | elapsed time per iteration (s): 15.22 | learning rate: 1.877E-05 | global batch size:    16 | lm loss: 5.501311E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3581/  128728 | consumed samples:        57296 | consumed tokens:    117342208 | elapsed time per iteration (s): 15.23 | learning rate: 1.877E-05 | global batch size:    16 | lm loss: 5.498442E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3582/  128728 | consumed samples:        57312 | consumed tokens:    117374976 | elapsed time per iteration (s): 15.20 | learning rate: 1.878E-05 | global batch size:    16 | lm loss: 5.639647E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3583/  128728 | consumed samples:        57328 | consumed tokens:    117407744 | elapsed time per iteration (s): 15.21 | learning rate: 1.879E-05 | global batch size:    16 | lm loss: 5.382240E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3584/  128728 | consumed samples:        57344 | consumed tokens:    117440512 | elapsed time per iteration (s): 15.22 | learning rate: 1.879E-05 | global batch size:    16 | lm loss: 5.583954E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3585/  128728 | consumed samples:        57360 | consumed tokens:    117473280 | elapsed time per iteration (s): 15.23 | learning rate: 1.880E-05 | global batch size:    16 | lm loss: 5.507063E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3586/  128728 | consumed samples:        57376 | consumed tokens:    117506048 | elapsed time per iteration (s): 15.20 | learning rate: 1.880E-05 | global batch size:    16 | lm loss: 5.361601E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3587/  128728 | consumed samples:        57392 | consumed tokens:    117538816 | elapsed time per iteration (s): 15.17 | learning rate: 1.881E-05 | global batch size:    16 | lm loss: 5.580978E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3588/  128728 | consumed samples:        57408 | consumed tokens:    117571584 | elapsed time per iteration (s): 15.22 | learning rate: 1.881E-05 | global batch size:    16 | lm loss: 5.566353E+00 | grad norm: 1.166 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3589/  128728 | consumed samples:        57424 | consumed tokens:    117604352 | elapsed time per iteration (s): 15.21 | learning rate: 1.882E-05 | global batch size:    16 | lm loss: 5.816396E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3590/  128728 | consumed samples:        57440 | consumed tokens:    117637120 | elapsed time per iteration (s): 15.17 | learning rate: 1.882E-05 | global batch size:    16 | lm loss: 5.590647E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3591/  128728 | consumed samples:        57456 | consumed tokens:    117669888 | elapsed time per iteration (s): 15.21 | learning rate: 1.883E-05 | global batch size:    16 | lm loss: 5.503424E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3592/  128728 | consumed samples:        57472 | consumed tokens:    117702656 | elapsed time per iteration (s): 15.20 | learning rate: 1.883E-05 | global batch size:    16 | lm loss: 5.546864E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3593/  128728 | consumed samples:        57488 | consumed tokens:    117735424 | elapsed time per iteration (s): 15.17 | learning rate: 1.884E-05 | global batch size:    16 | lm loss: 5.547342E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3594/  128728 | consumed samples:        57504 | consumed tokens:    117768192 | elapsed time per iteration (s): 15.21 | learning rate: 1.884E-05 | global batch size:    16 | lm loss: 5.557863E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3595/  128728 | consumed samples:        57520 | consumed tokens:    117800960 | elapsed time per iteration (s): 15.20 | learning rate: 1.885E-05 | global batch size:    16 | lm loss: 5.362972E+00 | grad norm: 1.171 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3596/  128728 | consumed samples:        57536 | consumed tokens:    117833728 | elapsed time per iteration (s): 15.22 | learning rate: 1.885E-05 | global batch size:    16 | lm loss: 5.553192E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3597/  128728 | consumed samples:        57552 | consumed tokens:    117866496 | elapsed time per iteration (s): 15.23 | learning rate: 1.886E-05 | global batch size:    16 | lm loss: 5.183071E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3598/  128728 | consumed samples:        57568 | consumed tokens:    117899264 | elapsed time per iteration (s): 15.22 | learning rate: 1.886E-05 | global batch size:    16 | lm loss: 5.619958E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3599/  128728 | consumed samples:        57584 | consumed tokens:    117932032 | elapsed time per iteration (s): 15.24 | learning rate: 1.887E-05 | global batch size:    16 | lm loss: 5.533691E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3600/  128728 | consumed samples:        57600 | consumed tokens:    117964800 | elapsed time per iteration (s): 15.23 | learning rate: 1.887E-05 | global batch size:    16 | lm loss: 5.799161E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3601/  128728 | consumed samples:        57616 | consumed tokens:    117997568 | elapsed time per iteration (s): 15.19 | learning rate: 1.888E-05 | global batch size:    16 | lm loss: 5.572991E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3602/  128728 | consumed samples:        57632 | consumed tokens:    118030336 | elapsed time per iteration (s): 15.22 | learning rate: 1.888E-05 | global batch size:    16 | lm loss: 5.398020E+00 | grad norm: 0.659 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3603/  128728 | consumed samples:        57648 | consumed tokens:    118063104 | elapsed time per iteration (s): 15.19 | learning rate: 1.889E-05 | global batch size:    16 | lm loss: 5.634165E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3604/  128728 | consumed samples:        57664 | consumed tokens:    118095872 | elapsed time per iteration (s): 15.21 | learning rate: 1.890E-05 | global batch size:    16 | lm loss: 5.592557E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3605/  128728 | consumed samples:        57680 | consumed tokens:    118128640 | elapsed time per iteration (s): 15.15 | learning rate: 1.890E-05 | global batch size:    16 | lm loss: 5.571424E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3606/  128728 | consumed samples:        57696 | consumed tokens:    118161408 | elapsed time per iteration (s): 15.17 | learning rate: 1.891E-05 | global batch size:    16 | lm loss: 5.605553E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3607/  128728 | consumed samples:        57712 | consumed tokens:    118194176 | elapsed time per iteration (s): 15.20 | learning rate: 1.891E-05 | global batch size:    16 | lm loss: 5.751481E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3608/  128728 | consumed samples:        57728 | consumed tokens:    118226944 | elapsed time per iteration (s): 15.20 | learning rate: 1.892E-05 | global batch size:    16 | lm loss: 5.682261E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3609/  128728 | consumed samples:        57744 | consumed tokens:    118259712 | elapsed time per iteration (s): 15.17 | learning rate: 1.892E-05 | global batch size:    16 | lm loss: 5.607131E+00 | grad norm: 1.145 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3610/  128728 | consumed samples:        57760 | consumed tokens:    118292480 | elapsed time per iteration (s): 15.20 | learning rate: 1.893E-05 | global batch size:    16 | lm loss: 5.541697E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3611/  128728 | consumed samples:        57776 | consumed tokens:    118325248 | elapsed time per iteration (s): 15.24 | learning rate: 1.893E-05 | global batch size:    16 | lm loss: 5.603507E+00 | grad norm: 1.274 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3612/  128728 | consumed samples:        57792 | consumed tokens:    118358016 | elapsed time per iteration (s): 15.22 | learning rate: 1.894E-05 | global batch size:    16 | lm loss: 5.721704E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3613/  128728 | consumed samples:        57808 | consumed tokens:    118390784 | elapsed time per iteration (s): 15.23 | learning rate: 1.894E-05 | global batch size:    16 | lm loss: 5.661789E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3614/  128728 | consumed samples:        57824 | consumed tokens:    118423552 | elapsed time per iteration (s): 15.23 | learning rate: 1.895E-05 | global batch size:    16 | lm loss: 5.765802E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3615/  128728 | consumed samples:        57840 | consumed tokens:    118456320 | elapsed time per iteration (s): 15.22 | learning rate: 1.895E-05 | global batch size:    16 | lm loss: 5.475472E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3616/  128728 | consumed samples:        57856 | consumed tokens:    118489088 | elapsed time per iteration (s): 15.22 | learning rate: 1.896E-05 | global batch size:    16 | lm loss: 5.469672E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3617/  128728 | consumed samples:        57872 | consumed tokens:    118521856 | elapsed time per iteration (s): 15.22 | learning rate: 1.896E-05 | global batch size:    16 | lm loss: 5.555403E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3618/  128728 | consumed samples:        57888 | consumed tokens:    118554624 | elapsed time per iteration (s): 15.21 | learning rate: 1.897E-05 | global batch size:    16 | lm loss: 5.757840E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3619/  128728 | consumed samples:        57904 | consumed tokens:    118587392 | elapsed time per iteration (s): 15.20 | learning rate: 1.897E-05 | global batch size:    16 | lm loss: 5.454224E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3620/  128728 | consumed samples:        57920 | consumed tokens:    118620160 | elapsed time per iteration (s): 15.24 | learning rate: 1.898E-05 | global batch size:    16 | lm loss: 5.460718E+00 | grad norm: 1.001 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3621/  128728 | consumed samples:        57936 | consumed tokens:    118652928 | elapsed time per iteration (s): 15.23 | learning rate: 1.898E-05 | global batch size:    16 | lm loss: 5.752840E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3622/  128728 | consumed samples:        57952 | consumed tokens:    118685696 | elapsed time per iteration (s): 15.22 | learning rate: 1.899E-05 | global batch size:    16 | lm loss: 5.772221E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3623/  128728 | consumed samples:        57968 | consumed tokens:    118718464 | elapsed time per iteration (s): 15.20 | learning rate: 1.900E-05 | global batch size:    16 | lm loss: 5.500217E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3624/  128728 | consumed samples:        57984 | consumed tokens:    118751232 | elapsed time per iteration (s): 15.20 | learning rate: 1.900E-05 | global batch size:    16 | lm loss: 5.437232E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3625/  128728 | consumed samples:        58000 | consumed tokens:    118784000 | elapsed time per iteration (s): 15.23 | learning rate: 1.901E-05 | global batch size:    16 | lm loss: 5.481465E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3626/  128728 | consumed samples:        58016 | consumed tokens:    118816768 | elapsed time per iteration (s): 15.24 | learning rate: 1.901E-05 | global batch size:    16 | lm loss: 5.507442E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3627/  128728 | consumed samples:        58032 | consumed tokens:    118849536 | elapsed time per iteration (s): 15.24 | learning rate: 1.902E-05 | global batch size:    16 | lm loss: 5.689624E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3628/  128728 | consumed samples:        58048 | consumed tokens:    118882304 | elapsed time per iteration (s): 15.21 | learning rate: 1.902E-05 | global batch size:    16 | lm loss: 5.502779E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3629/  128728 | consumed samples:        58064 | consumed tokens:    118915072 | elapsed time per iteration (s): 15.20 | learning rate: 1.903E-05 | global batch size:    16 | lm loss: 5.628727E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3630/  128728 | consumed samples:        58080 | consumed tokens:    118947840 | elapsed time per iteration (s): 15.20 | learning rate: 1.903E-05 | global batch size:    16 | lm loss: 5.490268E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3631/  128728 | consumed samples:        58096 | consumed tokens:    118980608 | elapsed time per iteration (s): 15.25 | learning rate: 1.904E-05 | global batch size:    16 | lm loss: 5.512156E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3632/  128728 | consumed samples:        58112 | consumed tokens:    119013376 | elapsed time per iteration (s): 15.21 | learning rate: 1.904E-05 | global batch size:    16 | lm loss: 5.381227E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3633/  128728 | consumed samples:        58128 | consumed tokens:    119046144 | elapsed time per iteration (s): 15.19 | learning rate: 1.905E-05 | global batch size:    16 | lm loss: 5.426307E+00 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3634/  128728 | consumed samples:        58144 | consumed tokens:    119078912 | elapsed time per iteration (s): 15.27 | learning rate: 1.905E-05 | global batch size:    16 | lm loss: 5.770047E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     3635/  128728 | consumed samples:        58160 | consumed tokens:    119111680 | elapsed time per iteration (s): 15.20 | learning rate: 1.906E-05 | global batch size:    16 | lm loss: 5.419781E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3636/  128728 | consumed samples:        58176 | consumed tokens:    119144448 | elapsed time per iteration (s): 15.22 | learning rate: 1.906E-05 | global batch size:    16 | lm loss: 5.744234E+00 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3637/  128728 | consumed samples:        58192 | consumed tokens:    119177216 | elapsed time per iteration (s): 15.25 | learning rate: 1.907E-05 | global batch size:    16 | lm loss: 5.680465E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3638/  128728 | consumed samples:        58208 | consumed tokens:    119209984 | elapsed time per iteration (s): 15.21 | learning rate: 1.907E-05 | global batch size:    16 | lm loss: 5.462650E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3639/  128728 | consumed samples:        58224 | consumed tokens:    119242752 | elapsed time per iteration (s): 15.23 | learning rate: 1.908E-05 | global batch size:    16 | lm loss: 5.425622E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3640/  128728 | consumed samples:        58240 | consumed tokens:    119275520 | elapsed time per iteration (s): 15.20 | learning rate: 1.908E-05 | global batch size:    16 | lm loss: 5.565685E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3641/  128728 | consumed samples:        58256 | consumed tokens:    119308288 | elapsed time per iteration (s): 15.24 | learning rate: 1.909E-05 | global batch size:    16 | lm loss: 5.555475E+00 | grad norm: 1.392 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3642/  128728 | consumed samples:        58272 | consumed tokens:    119341056 | elapsed time per iteration (s): 15.23 | learning rate: 1.909E-05 | global batch size:    16 | lm loss: 5.856975E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3643/  128728 | consumed samples:        58288 | consumed tokens:    119373824 | elapsed time per iteration (s): 15.16 | learning rate: 1.910E-05 | global batch size:    16 | lm loss: 5.520800E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3644/  128728 | consumed samples:        58304 | consumed tokens:    119406592 | elapsed time per iteration (s): 15.17 | learning rate: 1.911E-05 | global batch size:    16 | lm loss: 5.231161E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3645/  128728 | consumed samples:        58320 | consumed tokens:    119439360 | elapsed time per iteration (s): 15.20 | learning rate: 1.911E-05 | global batch size:    16 | lm loss: 5.715312E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3646/  128728 | consumed samples:        58336 | consumed tokens:    119472128 | elapsed time per iteration (s): 15.21 | learning rate: 1.912E-05 | global batch size:    16 | lm loss: 5.264447E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3647/  128728 | consumed samples:        58352 | consumed tokens:    119504896 | elapsed time per iteration (s): 15.23 | learning rate: 1.912E-05 | global batch size:    16 | lm loss: 5.607737E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3648/  128728 | consumed samples:        58368 | consumed tokens:    119537664 | elapsed time per iteration (s): 15.16 | learning rate: 1.913E-05 | global batch size:    16 | lm loss: 5.569743E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3649/  128728 | consumed samples:        58384 | consumed tokens:    119570432 | elapsed time per iteration (s): 15.18 | learning rate: 1.913E-05 | global batch size:    16 | lm loss: 5.634804E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3650/  128728 | consumed samples:        58400 | consumed tokens:    119603200 | elapsed time per iteration (s): 15.20 | learning rate: 1.914E-05 | global batch size:    16 | lm loss: 5.602137E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3651/  128728 | consumed samples:        58416 | consumed tokens:    119635968 | elapsed time per iteration (s): 15.22 | learning rate: 1.914E-05 | global batch size:    16 | lm loss: 5.597826E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3652/  128728 | consumed samples:        58432 | consumed tokens:    119668736 | elapsed time per iteration (s): 15.21 | learning rate: 1.915E-05 | global batch size:    16 | lm loss: 5.697678E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3653/  128728 | consumed samples:        58448 | consumed tokens:    119701504 | elapsed time per iteration (s): 15.24 | learning rate: 1.915E-05 | global batch size:    16 | lm loss: 6.026344E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3654/  128728 | consumed samples:        58464 | consumed tokens:    119734272 | elapsed time per iteration (s): 15.15 | learning rate: 1.916E-05 | global batch size:    16 | lm loss: 5.696335E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3655/  128728 | consumed samples:        58480 | consumed tokens:    119767040 | elapsed time per iteration (s): 15.23 | learning rate: 1.916E-05 | global batch size:    16 | lm loss: 5.686172E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3656/  128728 | consumed samples:        58496 | consumed tokens:    119799808 | elapsed time per iteration (s): 15.23 | learning rate: 1.917E-05 | global batch size:    16 | lm loss: 5.681462E+00 | grad norm: 1.222 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3657/  128728 | consumed samples:        58512 | consumed tokens:    119832576 | elapsed time per iteration (s): 15.22 | learning rate: 1.917E-05 | global batch size:    16 | lm loss: 5.347002E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3658/  128728 | consumed samples:        58528 | consumed tokens:    119865344 | elapsed time per iteration (s): 15.20 | learning rate: 1.918E-05 | global batch size:    16 | lm loss: 5.446877E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3659/  128728 | consumed samples:        58544 | consumed tokens:    119898112 | elapsed time per iteration (s): 15.22 | learning rate: 1.918E-05 | global batch size:    16 | lm loss: 5.406514E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3660/  128728 | consumed samples:        58560 | consumed tokens:    119930880 | elapsed time per iteration (s): 15.20 | learning rate: 1.919E-05 | global batch size:    16 | lm loss: 5.452915E+00 | grad norm: 1.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3661/  128728 | consumed samples:        58576 | consumed tokens:    119963648 | elapsed time per iteration (s): 15.20 | learning rate: 1.919E-05 | global batch size:    16 | lm loss: 5.661624E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3662/  128728 | consumed samples:        58592 | consumed tokens:    119996416 | elapsed time per iteration (s): 15.23 | learning rate: 1.920E-05 | global batch size:    16 | lm loss: 5.449157E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3663/  128728 | consumed samples:        58608 | consumed tokens:    120029184 | elapsed time per iteration (s): 15.23 | learning rate: 1.920E-05 | global batch size:    16 | lm loss: 5.559745E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3664/  128728 | consumed samples:        58624 | consumed tokens:    120061952 | elapsed time per iteration (s): 15.21 | learning rate: 1.921E-05 | global batch size:    16 | lm loss: 5.657228E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3665/  128728 | consumed samples:        58640 | consumed tokens:    120094720 | elapsed time per iteration (s): 15.20 | learning rate: 1.922E-05 | global batch size:    16 | lm loss: 5.547557E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3666/  128728 | consumed samples:        58656 | consumed tokens:    120127488 | elapsed time per iteration (s): 15.19 | learning rate: 1.922E-05 | global batch size:    16 | lm loss: 5.483784E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3667/  128728 | consumed samples:        58672 | consumed tokens:    120160256 | elapsed time per iteration (s): 15.23 | learning rate: 1.923E-05 | global batch size:    16 | lm loss: 5.720974E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3668/  128728 | consumed samples:        58688 | consumed tokens:    120193024 | elapsed time per iteration (s): 15.13 | learning rate: 1.923E-05 | global batch size:    16 | lm loss: 5.629973E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3669/  128728 | consumed samples:        58704 | consumed tokens:    120225792 | elapsed time per iteration (s): 15.25 | learning rate: 1.924E-05 | global batch size:    16 | lm loss: 5.490295E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3670/  128728 | consumed samples:        58720 | consumed tokens:    120258560 | elapsed time per iteration (s): 15.20 | learning rate: 1.924E-05 | global batch size:    16 | lm loss: 5.663823E+00 | grad norm: 0.654 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3671/  128728 | consumed samples:        58736 | consumed tokens:    120291328 | elapsed time per iteration (s): 15.23 | learning rate: 1.925E-05 | global batch size:    16 | lm loss: 5.565134E+00 | grad norm: 1.630 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3672/  128728 | consumed samples:        58752 | consumed tokens:    120324096 | elapsed time per iteration (s): 15.27 | learning rate: 1.925E-05 | global batch size:    16 | lm loss: 5.505857E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     3673/  128728 | consumed samples:        58768 | consumed tokens:    120356864 | elapsed time per iteration (s): 15.25 | learning rate: 1.926E-05 | global batch size:    16 | lm loss: 5.505276E+00 | grad norm: 1.118 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3674/  128728 | consumed samples:        58784 | consumed tokens:    120389632 | elapsed time per iteration (s): 15.21 | learning rate: 1.926E-05 | global batch size:    16 | lm loss: 5.554258E+00 | grad norm: 1.052 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3675/  128728 | consumed samples:        58800 | consumed tokens:    120422400 | elapsed time per iteration (s): 15.26 | learning rate: 1.927E-05 | global batch size:    16 | lm loss: 5.709059E+00 | grad norm: 1.228 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     3676/  128728 | consumed samples:        58816 | consumed tokens:    120455168 | elapsed time per iteration (s): 15.23 | learning rate: 1.927E-05 | global batch size:    16 | lm loss: 5.620901E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3677/  128728 | consumed samples:        58832 | consumed tokens:    120487936 | elapsed time per iteration (s): 15.18 | learning rate: 1.928E-05 | global batch size:    16 | lm loss: 5.432440E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3678/  128728 | consumed samples:        58848 | consumed tokens:    120520704 | elapsed time per iteration (s): 15.22 | learning rate: 1.928E-05 | global batch size:    16 | lm loss: 5.577560E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3679/  128728 | consumed samples:        58864 | consumed tokens:    120553472 | elapsed time per iteration (s): 15.23 | learning rate: 1.929E-05 | global batch size:    16 | lm loss: 5.770396E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3680/  128728 | consumed samples:        58880 | consumed tokens:    120586240 | elapsed time per iteration (s): 15.18 | learning rate: 1.929E-05 | global batch size:    16 | lm loss: 5.468989E+00 | grad norm: 1.225 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3681/  128728 | consumed samples:        58896 | consumed tokens:    120619008 | elapsed time per iteration (s): 15.23 | learning rate: 1.930E-05 | global batch size:    16 | lm loss: 5.550355E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3682/  128728 | consumed samples:        58912 | consumed tokens:    120651776 | elapsed time per iteration (s): 15.23 | learning rate: 1.930E-05 | global batch size:    16 | lm loss: 5.747350E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3683/  128728 | consumed samples:        58928 | consumed tokens:    120684544 | elapsed time per iteration (s): 15.23 | learning rate: 1.931E-05 | global batch size:    16 | lm loss: 5.518972E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3684/  128728 | consumed samples:        58944 | consumed tokens:    120717312 | elapsed time per iteration (s): 15.16 | learning rate: 1.931E-05 | global batch size:    16 | lm loss: 5.709407E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3685/  128728 | consumed samples:        58960 | consumed tokens:    120750080 | elapsed time per iteration (s): 15.18 | learning rate: 1.932E-05 | global batch size:    16 | lm loss: 5.524895E+00 | grad norm: 1.254 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3686/  128728 | consumed samples:        58976 | consumed tokens:    120782848 | elapsed time per iteration (s): 15.22 | learning rate: 1.933E-05 | global batch size:    16 | lm loss: 5.538244E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3687/  128728 | consumed samples:        58992 | consumed tokens:    120815616 | elapsed time per iteration (s): 15.22 | learning rate: 1.933E-05 | global batch size:    16 | lm loss: 5.499589E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3688/  128728 | consumed samples:        59008 | consumed tokens:    120848384 | elapsed time per iteration (s): 15.27 | learning rate: 1.934E-05 | global batch size:    16 | lm loss: 5.613136E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     3689/  128728 | consumed samples:        59024 | consumed tokens:    120881152 | elapsed time per iteration (s): 15.22 | learning rate: 1.934E-05 | global batch size:    16 | lm loss: 5.585117E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3690/  128728 | consumed samples:        59040 | consumed tokens:    120913920 | elapsed time per iteration (s): 15.26 | learning rate: 1.935E-05 | global batch size:    16 | lm loss: 5.569749E+00 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     3691/  128728 | consumed samples:        59056 | consumed tokens:    120946688 | elapsed time per iteration (s): 15.23 | learning rate: 1.935E-05 | global batch size:    16 | lm loss: 5.599214E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3692/  128728 | consumed samples:        59072 | consumed tokens:    120979456 | elapsed time per iteration (s): 15.25 | learning rate: 1.936E-05 | global batch size:    16 | lm loss: 5.365727E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3693/  128728 | consumed samples:        59088 | consumed tokens:    121012224 | elapsed time per iteration (s): 15.19 | learning rate: 1.936E-05 | global batch size:    16 | lm loss: 5.710306E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3694/  128728 | consumed samples:        59104 | consumed tokens:    121044992 | elapsed time per iteration (s): 15.23 | learning rate: 1.937E-05 | global batch size:    16 | lm loss: 5.315215E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3695/  128728 | consumed samples:        59120 | consumed tokens:    121077760 | elapsed time per iteration (s): 15.23 | learning rate: 1.937E-05 | global batch size:    16 | lm loss: 5.607258E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3696/  128728 | consumed samples:        59136 | consumed tokens:    121110528 | elapsed time per iteration (s): 15.19 | learning rate: 1.938E-05 | global batch size:    16 | lm loss: 5.551528E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3697/  128728 | consumed samples:        59152 | consumed tokens:    121143296 | elapsed time per iteration (s): 15.24 | learning rate: 1.938E-05 | global batch size:    16 | lm loss: 5.566436E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3698/  128728 | consumed samples:        59168 | consumed tokens:    121176064 | elapsed time per iteration (s): 15.20 | learning rate: 1.939E-05 | global batch size:    16 | lm loss: 5.264910E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3699/  128728 | consumed samples:        59184 | consumed tokens:    121208832 | elapsed time per iteration (s): 15.19 | learning rate: 1.939E-05 | global batch size:    16 | lm loss: 5.505784E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3700/  128728 | consumed samples:        59200 | consumed tokens:    121241600 | elapsed time per iteration (s): 15.21 | learning rate: 1.940E-05 | global batch size:    16 | lm loss: 5.399070E+00 | grad norm: 0.827 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3701/  128728 | consumed samples:        59216 | consumed tokens:    121274368 | elapsed time per iteration (s): 15.25 | learning rate: 1.940E-05 | global batch size:    16 | lm loss: 5.653021E+00 | grad norm: 1.537 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3702/  128728 | consumed samples:        59232 | consumed tokens:    121307136 | elapsed time per iteration (s): 15.25 | learning rate: 1.941E-05 | global batch size:    16 | lm loss: 5.541692E+00 | grad norm: 1.023 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3703/  128728 | consumed samples:        59248 | consumed tokens:    121339904 | elapsed time per iteration (s): 15.16 | learning rate: 1.941E-05 | global batch size:    16 | lm loss: 5.357801E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3704/  128728 | consumed samples:        59264 | consumed tokens:    121372672 | elapsed time per iteration (s): 15.23 | learning rate: 1.942E-05 | global batch size:    16 | lm loss: 5.619140E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3705/  128728 | consumed samples:        59280 | consumed tokens:    121405440 | elapsed time per iteration (s): 15.18 | learning rate: 1.942E-05 | global batch size:    16 | lm loss: 5.628579E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3706/  128728 | consumed samples:        59296 | consumed tokens:    121438208 | elapsed time per iteration (s): 15.22 | learning rate: 1.943E-05 | global batch size:    16 | lm loss: 5.582848E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3707/  128728 | consumed samples:        59312 | consumed tokens:    121470976 | elapsed time per iteration (s): 15.23 | learning rate: 1.944E-05 | global batch size:    16 | lm loss: 5.396257E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3708/  128728 | consumed samples:        59328 | consumed tokens:    121503744 | elapsed time per iteration (s): 15.23 | learning rate: 1.944E-05 | global batch size:    16 | lm loss: 5.520443E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3709/  128728 | consumed samples:        59344 | consumed tokens:    121536512 | elapsed time per iteration (s): 15.25 | learning rate: 1.945E-05 | global batch size:    16 | lm loss: 5.484709E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3710/  128728 | consumed samples:        59360 | consumed tokens:    121569280 | elapsed time per iteration (s): 15.17 | learning rate: 1.945E-05 | global batch size:    16 | lm loss: 5.345929E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3711/  128728 | consumed samples:        59376 | consumed tokens:    121602048 | elapsed time per iteration (s): 15.24 | learning rate: 1.946E-05 | global batch size:    16 | lm loss: 5.598679E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3712/  128728 | consumed samples:        59392 | consumed tokens:    121634816 | elapsed time per iteration (s): 15.22 | learning rate: 1.946E-05 | global batch size:    16 | lm loss: 5.456141E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3713/  128728 | consumed samples:        59408 | consumed tokens:    121667584 | elapsed time per iteration (s): 15.30 | learning rate: 1.947E-05 | global batch size:    16 | lm loss: 5.482525E+00 | grad norm: 1.067 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     3714/  128728 | consumed samples:        59424 | consumed tokens:    121700352 | elapsed time per iteration (s): 15.25 | learning rate: 1.947E-05 | global batch size:    16 | lm loss: 5.399364E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3715/  128728 | consumed samples:        59440 | consumed tokens:    121733120 | elapsed time per iteration (s): 15.28 | learning rate: 1.948E-05 | global batch size:    16 | lm loss: 5.519878E+00 | grad norm: 1.093 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     3716/  128728 | consumed samples:        59456 | consumed tokens:    121765888 | elapsed time per iteration (s): 15.20 | learning rate: 1.948E-05 | global batch size:    16 | lm loss: 5.493938E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3717/  128728 | consumed samples:        59472 | consumed tokens:    121798656 | elapsed time per iteration (s): 15.22 | learning rate: 1.949E-05 | global batch size:    16 | lm loss: 5.431820E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3718/  128728 | consumed samples:        59488 | consumed tokens:    121831424 | elapsed time per iteration (s): 15.22 | learning rate: 1.949E-05 | global batch size:    16 | lm loss: 5.457542E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3719/  128728 | consumed samples:        59504 | consumed tokens:    121864192 | elapsed time per iteration (s): 15.21 | learning rate: 1.950E-05 | global batch size:    16 | lm loss: 5.463506E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3720/  128728 | consumed samples:        59520 | consumed tokens:    121896960 | elapsed time per iteration (s): 15.21 | learning rate: 1.950E-05 | global batch size:    16 | lm loss: 5.468750E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3721/  128728 | consumed samples:        59536 | consumed tokens:    121929728 | elapsed time per iteration (s): 15.21 | learning rate: 1.951E-05 | global batch size:    16 | lm loss: 5.492259E+00 | grad norm: 1.078 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3722/  128728 | consumed samples:        59552 | consumed tokens:    121962496 | elapsed time per iteration (s): 15.24 | learning rate: 1.951E-05 | global batch size:    16 | lm loss: 5.689316E+00 | grad norm: 2.225 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3723/  128728 | consumed samples:        59568 | consumed tokens:    121995264 | elapsed time per iteration (s): 15.20 | learning rate: 1.952E-05 | global batch size:    16 | lm loss: 5.625181E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3724/  128728 | consumed samples:        59584 | consumed tokens:    122028032 | elapsed time per iteration (s): 15.23 | learning rate: 1.952E-05 | global batch size:    16 | lm loss: 5.407440E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3725/  128728 | consumed samples:        59600 | consumed tokens:    122060800 | elapsed time per iteration (s): 15.24 | learning rate: 1.953E-05 | global batch size:    16 | lm loss: 5.463798E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3726/  128728 | consumed samples:        59616 | consumed tokens:    122093568 | elapsed time per iteration (s): 15.18 | learning rate: 1.954E-05 | global batch size:    16 | lm loss: 5.562949E+00 | grad norm: 0.991 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3727/  128728 | consumed samples:        59632 | consumed tokens:    122126336 | elapsed time per iteration (s): 15.23 | learning rate: 1.954E-05 | global batch size:    16 | lm loss: 5.715884E+00 | grad norm: 1.345 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3728/  128728 | consumed samples:        59648 | consumed tokens:    122159104 | elapsed time per iteration (s): 15.23 | learning rate: 1.955E-05 | global batch size:    16 | lm loss: 5.560648E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3729/  128728 | consumed samples:        59664 | consumed tokens:    122191872 | elapsed time per iteration (s): 15.21 | learning rate: 1.955E-05 | global batch size:    16 | lm loss: 5.648838E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3730/  128728 | consumed samples:        59680 | consumed tokens:    122224640 | elapsed time per iteration (s): 15.23 | learning rate: 1.956E-05 | global batch size:    16 | lm loss: 5.286009E+00 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3731/  128728 | consumed samples:        59696 | consumed tokens:    122257408 | elapsed time per iteration (s): 15.24 | learning rate: 1.956E-05 | global batch size:    16 | lm loss: 5.616099E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3732/  128728 | consumed samples:        59712 | consumed tokens:    122290176 | elapsed time per iteration (s): 15.19 | learning rate: 1.957E-05 | global batch size:    16 | lm loss: 5.448312E+00 | grad norm: 1.079 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3733/  128728 | consumed samples:        59728 | consumed tokens:    122322944 | elapsed time per iteration (s): 15.27 | learning rate: 1.957E-05 | global batch size:    16 | lm loss: 5.397096E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     3734/  128728 | consumed samples:        59744 | consumed tokens:    122355712 | elapsed time per iteration (s): 15.17 | learning rate: 1.958E-05 | global batch size:    16 | lm loss: 5.431047E+00 | grad norm: 1.160 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3735/  128728 | consumed samples:        59760 | consumed tokens:    122388480 | elapsed time per iteration (s): 15.19 | learning rate: 1.958E-05 | global batch size:    16 | lm loss: 5.394694E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3736/  128728 | consumed samples:        59776 | consumed tokens:    122421248 | elapsed time per iteration (s): 15.22 | learning rate: 1.959E-05 | global batch size:    16 | lm loss: 5.561465E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3737/  128728 | consumed samples:        59792 | consumed tokens:    122454016 | elapsed time per iteration (s): 15.24 | learning rate: 1.959E-05 | global batch size:    16 | lm loss: 5.591651E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3738/  128728 | consumed samples:        59808 | consumed tokens:    122486784 | elapsed time per iteration (s): 15.15 | learning rate: 1.960E-05 | global batch size:    16 | lm loss: 5.337072E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3739/  128728 | consumed samples:        59824 | consumed tokens:    122519552 | elapsed time per iteration (s): 15.19 | learning rate: 1.960E-05 | global batch size:    16 | lm loss: 5.194335E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3740/  128728 | consumed samples:        59840 | consumed tokens:    122552320 | elapsed time per iteration (s): 15.23 | learning rate: 1.961E-05 | global batch size:    16 | lm loss: 5.511204E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3741/  128728 | consumed samples:        59856 | consumed tokens:    122585088 | elapsed time per iteration (s): 15.22 | learning rate: 1.961E-05 | global batch size:    16 | lm loss: 5.479012E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3742/  128728 | consumed samples:        59872 | consumed tokens:    122617856 | elapsed time per iteration (s): 15.22 | learning rate: 1.962E-05 | global batch size:    16 | lm loss: 5.472262E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3743/  128728 | consumed samples:        59888 | consumed tokens:    122650624 | elapsed time per iteration (s): 15.22 | learning rate: 1.962E-05 | global batch size:    16 | lm loss: 5.323508E+00 | grad norm: 0.973 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3744/  128728 | consumed samples:        59904 | consumed tokens:    122683392 | elapsed time per iteration (s): 15.24 | learning rate: 1.963E-05 | global batch size:    16 | lm loss: 5.851313E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3745/  128728 | consumed samples:        59920 | consumed tokens:    122716160 | elapsed time per iteration (s): 15.25 | learning rate: 1.963E-05 | global batch size:    16 | lm loss: 5.437391E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3746/  128728 | consumed samples:        59936 | consumed tokens:    122748928 | elapsed time per iteration (s): 15.20 | learning rate: 1.964E-05 | global batch size:    16 | lm loss: 5.474227E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3747/  128728 | consumed samples:        59952 | consumed tokens:    122781696 | elapsed time per iteration (s): 15.18 | learning rate: 1.965E-05 | global batch size:    16 | lm loss: 5.767534E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3748/  128728 | consumed samples:        59968 | consumed tokens:    122814464 | elapsed time per iteration (s): 15.23 | learning rate: 1.965E-05 | global batch size:    16 | lm loss: 5.451702E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3749/  128728 | consumed samples:        59984 | consumed tokens:    122847232 | elapsed time per iteration (s): 15.24 | learning rate: 1.966E-05 | global batch size:    16 | lm loss: 5.430769E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3750/  128728 | consumed samples:        60000 | consumed tokens:    122880000 | elapsed time per iteration (s): 15.23 | learning rate: 1.966E-05 | global batch size:    16 | lm loss: 5.410946E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3751/  128728 | consumed samples:        60016 | consumed tokens:    122912768 | elapsed time per iteration (s): 15.21 | learning rate: 1.967E-05 | global batch size:    16 | lm loss: 5.695611E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3752/  128728 | consumed samples:        60032 | consumed tokens:    122945536 | elapsed time per iteration (s): 15.21 | learning rate: 1.967E-05 | global batch size:    16 | lm loss: 5.481835E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3753/  128728 | consumed samples:        60048 | consumed tokens:    122978304 | elapsed time per iteration (s): 15.23 | learning rate: 1.968E-05 | global batch size:    16 | lm loss: 5.562807E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3754/  128728 | consumed samples:        60064 | consumed tokens:    123011072 | elapsed time per iteration (s): 15.25 | learning rate: 1.968E-05 | global batch size:    16 | lm loss: 5.459926E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3755/  128728 | consumed samples:        60080 | consumed tokens:    123043840 | elapsed time per iteration (s): 15.24 | learning rate: 1.969E-05 | global batch size:    16 | lm loss: 5.412202E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3756/  128728 | consumed samples:        60096 | consumed tokens:    123076608 | elapsed time per iteration (s): 15.19 | learning rate: 1.969E-05 | global batch size:    16 | lm loss: 5.587904E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3757/  128728 | consumed samples:        60112 | consumed tokens:    123109376 | elapsed time per iteration (s): 15.23 | learning rate: 1.970E-05 | global batch size:    16 | lm loss: 5.509131E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3758/  128728 | consumed samples:        60128 | consumed tokens:    123142144 | elapsed time per iteration (s): 15.23 | learning rate: 1.970E-05 | global batch size:    16 | lm loss: 5.465936E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3759/  128728 | consumed samples:        60144 | consumed tokens:    123174912 | elapsed time per iteration (s): 15.23 | learning rate: 1.971E-05 | global batch size:    16 | lm loss: 5.438951E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3760/  128728 | consumed samples:        60160 | consumed tokens:    123207680 | elapsed time per iteration (s): 15.24 | learning rate: 1.971E-05 | global batch size:    16 | lm loss: 5.498137E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3761/  128728 | consumed samples:        60176 | consumed tokens:    123240448 | elapsed time per iteration (s): 15.24 | learning rate: 1.972E-05 | global batch size:    16 | lm loss: 5.450524E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3762/  128728 | consumed samples:        60192 | consumed tokens:    123273216 | elapsed time per iteration (s): 15.24 | learning rate: 1.972E-05 | global batch size:    16 | lm loss: 5.642553E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3763/  128728 | consumed samples:        60208 | consumed tokens:    123305984 | elapsed time per iteration (s): 15.26 | learning rate: 1.973E-05 | global batch size:    16 | lm loss: 5.109872E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     3764/  128728 | consumed samples:        60224 | consumed tokens:    123338752 | elapsed time per iteration (s): 15.22 | learning rate: 1.973E-05 | global batch size:    16 | lm loss: 5.403108E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3765/  128728 | consumed samples:        60240 | consumed tokens:    123371520 | elapsed time per iteration (s): 15.17 | learning rate: 1.974E-05 | global batch size:    16 | lm loss: 5.276863E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3766/  128728 | consumed samples:        60256 | consumed tokens:    123404288 | elapsed time per iteration (s): 15.18 | learning rate: 1.974E-05 | global batch size:    16 | lm loss: 5.535725E+00 | grad norm: 1.284 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3767/  128728 | consumed samples:        60272 | consumed tokens:    123437056 | elapsed time per iteration (s): 15.21 | learning rate: 1.975E-05 | global batch size:    16 | lm loss: 5.349348E+00 | grad norm: 0.883 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3768/  128728 | consumed samples:        60288 | consumed tokens:    123469824 | elapsed time per iteration (s): 15.18 | learning rate: 1.976E-05 | global batch size:    16 | lm loss: 5.511031E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3769/  128728 | consumed samples:        60304 | consumed tokens:    123502592 | elapsed time per iteration (s): 15.23 | learning rate: 1.976E-05 | global batch size:    16 | lm loss: 5.505275E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3770/  128728 | consumed samples:        60320 | consumed tokens:    123535360 | elapsed time per iteration (s): 15.22 | learning rate: 1.977E-05 | global batch size:    16 | lm loss: 5.564199E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3771/  128728 | consumed samples:        60336 | consumed tokens:    123568128 | elapsed time per iteration (s): 15.20 | learning rate: 1.977E-05 | global batch size:    16 | lm loss: 5.466618E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3772/  128728 | consumed samples:        60352 | consumed tokens:    123600896 | elapsed time per iteration (s): 15.25 | learning rate: 1.978E-05 | global batch size:    16 | lm loss: 5.439900E+00 | grad norm: 2.077 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3773/  128728 | consumed samples:        60368 | consumed tokens:    123633664 | elapsed time per iteration (s): 15.23 | learning rate: 1.978E-05 | global batch size:    16 | lm loss: 5.633871E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3774/  128728 | consumed samples:        60384 | consumed tokens:    123666432 | elapsed time per iteration (s): 15.23 | learning rate: 1.979E-05 | global batch size:    16 | lm loss: 5.420053E+00 | grad norm: 1.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3775/  128728 | consumed samples:        60400 | consumed tokens:    123699200 | elapsed time per iteration (s): 15.24 | learning rate: 1.979E-05 | global batch size:    16 | lm loss: 5.728425E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3776/  128728 | consumed samples:        60416 | consumed tokens:    123731968 | elapsed time per iteration (s): 15.23 | learning rate: 1.980E-05 | global batch size:    16 | lm loss: 5.415603E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3777/  128728 | consumed samples:        60432 | consumed tokens:    123764736 | elapsed time per iteration (s): 15.21 | learning rate: 1.980E-05 | global batch size:    16 | lm loss: 5.579483E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3778/  128728 | consumed samples:        60448 | consumed tokens:    123797504 | elapsed time per iteration (s): 15.25 | learning rate: 1.981E-05 | global batch size:    16 | lm loss: 5.535165E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3779/  128728 | consumed samples:        60464 | consumed tokens:    123830272 | elapsed time per iteration (s): 15.25 | learning rate: 1.981E-05 | global batch size:    16 | lm loss: 5.334061E+00 | grad norm: 0.883 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3780/  128728 | consumed samples:        60480 | consumed tokens:    123863040 | elapsed time per iteration (s): 15.21 | learning rate: 1.982E-05 | global batch size:    16 | lm loss: 5.297910E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3781/  128728 | consumed samples:        60496 | consumed tokens:    123895808 | elapsed time per iteration (s): 15.21 | learning rate: 1.982E-05 | global batch size:    16 | lm loss: 5.702374E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3782/  128728 | consumed samples:        60512 | consumed tokens:    123928576 | elapsed time per iteration (s): 15.20 | learning rate: 1.983E-05 | global batch size:    16 | lm loss: 5.441247E+00 | grad norm: 0.649 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3783/  128728 | consumed samples:        60528 | consumed tokens:    123961344 | elapsed time per iteration (s): 15.22 | learning rate: 1.983E-05 | global batch size:    16 | lm loss: 5.571175E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3784/  128728 | consumed samples:        60544 | consumed tokens:    123994112 | elapsed time per iteration (s): 15.22 | learning rate: 1.984E-05 | global batch size:    16 | lm loss: 5.425476E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3785/  128728 | consumed samples:        60560 | consumed tokens:    124026880 | elapsed time per iteration (s): 15.19 | learning rate: 1.984E-05 | global batch size:    16 | lm loss: 5.498517E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3786/  128728 | consumed samples:        60576 | consumed tokens:    124059648 | elapsed time per iteration (s): 15.23 | learning rate: 1.985E-05 | global batch size:    16 | lm loss: 5.395185E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3787/  128728 | consumed samples:        60592 | consumed tokens:    124092416 | elapsed time per iteration (s): 15.22 | learning rate: 1.985E-05 | global batch size:    16 | lm loss: 5.543329E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3788/  128728 | consumed samples:        60608 | consumed tokens:    124125184 | elapsed time per iteration (s): 15.21 | learning rate: 1.986E-05 | global batch size:    16 | lm loss: 5.629831E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3789/  128728 | consumed samples:        60624 | consumed tokens:    124157952 | elapsed time per iteration (s): 15.21 | learning rate: 1.987E-05 | global batch size:    16 | lm loss: 5.335208E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3790/  128728 | consumed samples:        60640 | consumed tokens:    124190720 | elapsed time per iteration (s): 15.20 | learning rate: 1.987E-05 | global batch size:    16 | lm loss: 5.583022E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3791/  128728 | consumed samples:        60656 | consumed tokens:    124223488 | elapsed time per iteration (s): 15.21 | learning rate: 1.988E-05 | global batch size:    16 | lm loss: 5.452947E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3792/  128728 | consumed samples:        60672 | consumed tokens:    124256256 | elapsed time per iteration (s): 15.25 | learning rate: 1.988E-05 | global batch size:    16 | lm loss: 5.283163E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3793/  128728 | consumed samples:        60688 | consumed tokens:    124289024 | elapsed time per iteration (s): 15.23 | learning rate: 1.989E-05 | global batch size:    16 | lm loss: 5.357467E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3794/  128728 | consumed samples:        60704 | consumed tokens:    124321792 | elapsed time per iteration (s): 15.21 | learning rate: 1.989E-05 | global batch size:    16 | lm loss: 5.540434E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3795/  128728 | consumed samples:        60720 | consumed tokens:    124354560 | elapsed time per iteration (s): 15.23 | learning rate: 1.990E-05 | global batch size:    16 | lm loss: 5.701981E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3796/  128728 | consumed samples:        60736 | consumed tokens:    124387328 | elapsed time per iteration (s): 15.22 | learning rate: 1.990E-05 | global batch size:    16 | lm loss: 5.400178E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3797/  128728 | consumed samples:        60752 | consumed tokens:    124420096 | elapsed time per iteration (s): 15.16 | learning rate: 1.991E-05 | global batch size:    16 | lm loss: 5.381162E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3798/  128728 | consumed samples:        60768 | consumed tokens:    124452864 | elapsed time per iteration (s): 15.21 | learning rate: 1.991E-05 | global batch size:    16 | lm loss: 5.394807E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3799/  128728 | consumed samples:        60784 | consumed tokens:    124485632 | elapsed time per iteration (s): 15.17 | learning rate: 1.992E-05 | global batch size:    16 | lm loss: 5.298486E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3800/  128728 | consumed samples:        60800 | consumed tokens:    124518400 | elapsed time per iteration (s): 15.24 | learning rate: 1.992E-05 | global batch size:    16 | lm loss: 5.496459E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3801/  128728 | consumed samples:        60816 | consumed tokens:    124551168 | elapsed time per iteration (s): 15.18 | learning rate: 1.993E-05 | global batch size:    16 | lm loss: 5.387410E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3802/  128728 | consumed samples:        60832 | consumed tokens:    124583936 | elapsed time per iteration (s): 15.23 | learning rate: 1.993E-05 | global batch size:    16 | lm loss: 5.404246E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3803/  128728 | consumed samples:        60848 | consumed tokens:    124616704 | elapsed time per iteration (s): 15.22 | learning rate: 1.994E-05 | global batch size:    16 | lm loss: 5.481224E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3804/  128728 | consumed samples:        60864 | consumed tokens:    124649472 | elapsed time per iteration (s): 15.27 | learning rate: 1.994E-05 | global batch size:    16 | lm loss: 5.301341E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     3805/  128728 | consumed samples:        60880 | consumed tokens:    124682240 | elapsed time per iteration (s): 15.24 | learning rate: 1.995E-05 | global batch size:    16 | lm loss: 5.260728E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3806/  128728 | consumed samples:        60896 | consumed tokens:    124715008 | elapsed time per iteration (s): 15.25 | learning rate: 1.995E-05 | global batch size:    16 | lm loss: 5.525875E+00 | grad norm: 1.345 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3807/  128728 | consumed samples:        60912 | consumed tokens:    124747776 | elapsed time per iteration (s): 15.19 | learning rate: 1.996E-05 | global batch size:    16 | lm loss: 5.592893E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3808/  128728 | consumed samples:        60928 | consumed tokens:    124780544 | elapsed time per iteration (s): 15.22 | learning rate: 1.996E-05 | global batch size:    16 | lm loss: 5.427948E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3809/  128728 | consumed samples:        60944 | consumed tokens:    124813312 | elapsed time per iteration (s): 15.15 | learning rate: 1.997E-05 | global batch size:    16 | lm loss: 5.401147E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3810/  128728 | consumed samples:        60960 | consumed tokens:    124846080 | elapsed time per iteration (s): 15.23 | learning rate: 1.998E-05 | global batch size:    16 | lm loss: 5.241078E+00 | grad norm: 1.217 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3811/  128728 | consumed samples:        60976 | consumed tokens:    124878848 | elapsed time per iteration (s): 15.17 | learning rate: 1.998E-05 | global batch size:    16 | lm loss: 5.158630E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3812/  128728 | consumed samples:        60992 | consumed tokens:    124911616 | elapsed time per iteration (s): 15.21 | learning rate: 1.999E-05 | global batch size:    16 | lm loss: 5.613994E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3813/  128728 | consumed samples:        61008 | consumed tokens:    124944384 | elapsed time per iteration (s): 15.22 | learning rate: 1.999E-05 | global batch size:    16 | lm loss: 5.171216E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3814/  128728 | consumed samples:        61024 | consumed tokens:    124977152 | elapsed time per iteration (s): 15.19 | learning rate: 2.000E-05 | global batch size:    16 | lm loss: 5.270428E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3815/  128728 | consumed samples:        61040 | consumed tokens:    125009920 | elapsed time per iteration (s): 15.25 | learning rate: 2.000E-05 | global batch size:    16 | lm loss: 5.501937E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     3816/  128728 | consumed samples:        61056 | consumed tokens:    125042688 | elapsed time per iteration (s): 15.23 | learning rate: 2.001E-05 | global batch size:    16 | lm loss: 5.503111E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3817/  128728 | consumed samples:        61072 | consumed tokens:    125075456 | elapsed time per iteration (s): 15.22 | learning rate: 2.001E-05 | global batch size:    16 | lm loss: 5.680742E+00 | grad norm: 0.965 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3818/  128728 | consumed samples:        61088 | consumed tokens:    125108224 | elapsed time per iteration (s): 15.22 | learning rate: 2.002E-05 | global batch size:    16 | lm loss: 5.501068E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3819/  128728 | consumed samples:        61104 | consumed tokens:    125140992 | elapsed time per iteration (s): 15.26 | learning rate: 2.002E-05 | global batch size:    16 | lm loss: 5.319207E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     3820/  128728 | consumed samples:        61120 | consumed tokens:    125173760 | elapsed time per iteration (s): 15.21 | learning rate: 2.003E-05 | global batch size:    16 | lm loss: 5.308980E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3821/  128728 | consumed samples:        61136 | consumed tokens:    125206528 | elapsed time per iteration (s): 15.20 | learning rate: 2.003E-05 | global batch size:    16 | lm loss: 5.577042E+00 | grad norm: 1.032 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3822/  128728 | consumed samples:        61152 | consumed tokens:    125239296 | elapsed time per iteration (s): 15.22 | learning rate: 2.004E-05 | global batch size:    16 | lm loss: 5.287234E+00 | grad norm: 1.092 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3823/  128728 | consumed samples:        61168 | consumed tokens:    125272064 | elapsed time per iteration (s): 15.17 | learning rate: 2.004E-05 | global batch size:    16 | lm loss: 5.414005E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3824/  128728 | consumed samples:        61184 | consumed tokens:    125304832 | elapsed time per iteration (s): 15.26 | learning rate: 2.005E-05 | global batch size:    16 | lm loss: 5.606541E+00 | grad norm: 1.333 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3825/  128728 | consumed samples:        61200 | consumed tokens:    125337600 | elapsed time per iteration (s): 15.23 | learning rate: 2.005E-05 | global batch size:    16 | lm loss: 5.391608E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3826/  128728 | consumed samples:        61216 | consumed tokens:    125370368 | elapsed time per iteration (s): 15.24 | learning rate: 2.006E-05 | global batch size:    16 | lm loss: 5.659523E+00 | grad norm: 2.248 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3827/  128728 | consumed samples:        61232 | consumed tokens:    125403136 | elapsed time per iteration (s): 15.22 | learning rate: 2.006E-05 | global batch size:    16 | lm loss: 5.057670E+00 | grad norm: 0.931 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3828/  128728 | consumed samples:        61248 | consumed tokens:    125435904 | elapsed time per iteration (s): 15.23 | learning rate: 2.007E-05 | global batch size:    16 | lm loss: 5.481532E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3829/  128728 | consumed samples:        61264 | consumed tokens:    125468672 | elapsed time per iteration (s): 15.19 | learning rate: 2.008E-05 | global batch size:    16 | lm loss: 5.234412E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3830/  128728 | consumed samples:        61280 | consumed tokens:    125501440 | elapsed time per iteration (s): 15.16 | learning rate: 2.008E-05 | global batch size:    16 | lm loss: 5.504411E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3831/  128728 | consumed samples:        61296 | consumed tokens:    125534208 | elapsed time per iteration (s): 15.15 | learning rate: 2.009E-05 | global batch size:    16 | lm loss: 5.468637E+00 | grad norm: 0.931 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3832/  128728 | consumed samples:        61312 | consumed tokens:    125566976 | elapsed time per iteration (s): 15.23 | learning rate: 2.009E-05 | global batch size:    16 | lm loss: 5.480287E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3833/  128728 | consumed samples:        61328 | consumed tokens:    125599744 | elapsed time per iteration (s): 15.18 | learning rate: 2.010E-05 | global batch size:    16 | lm loss: 5.492439E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3834/  128728 | consumed samples:        61344 | consumed tokens:    125632512 | elapsed time per iteration (s): 15.23 | learning rate: 2.010E-05 | global batch size:    16 | lm loss: 5.287287E+00 | grad norm: 0.970 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3835/  128728 | consumed samples:        61360 | consumed tokens:    125665280 | elapsed time per iteration (s): 15.22 | learning rate: 2.011E-05 | global batch size:    16 | lm loss: 5.399631E+00 | grad norm: 1.285 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3836/  128728 | consumed samples:        61376 | consumed tokens:    125698048 | elapsed time per iteration (s): 15.20 | learning rate: 2.011E-05 | global batch size:    16 | lm loss: 5.347549E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3837/  128728 | consumed samples:        61392 | consumed tokens:    125730816 | elapsed time per iteration (s): 15.18 | learning rate: 2.012E-05 | global batch size:    16 | lm loss: 5.494516E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3838/  128728 | consumed samples:        61408 | consumed tokens:    125763584 | elapsed time per iteration (s): 15.27 | learning rate: 2.012E-05 | global batch size:    16 | lm loss: 5.462282E+00 | grad norm: 1.199 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     3839/  128728 | consumed samples:        61424 | consumed tokens:    125796352 | elapsed time per iteration (s): 15.24 | learning rate: 2.013E-05 | global batch size:    16 | lm loss: 5.329695E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3840/  128728 | consumed samples:        61440 | consumed tokens:    125829120 | elapsed time per iteration (s): 15.26 | learning rate: 2.013E-05 | global batch size:    16 | lm loss: 5.455020E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3841/  128728 | consumed samples:        61456 | consumed tokens:    125861888 | elapsed time per iteration (s): 15.25 | learning rate: 2.014E-05 | global batch size:    16 | lm loss: 5.388807E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     3842/  128728 | consumed samples:        61472 | consumed tokens:    125894656 | elapsed time per iteration (s): 15.19 | learning rate: 2.014E-05 | global batch size:    16 | lm loss: 5.453071E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3843/  128728 | consumed samples:        61488 | consumed tokens:    125927424 | elapsed time per iteration (s): 15.20 | learning rate: 2.015E-05 | global batch size:    16 | lm loss: 5.550716E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3844/  128728 | consumed samples:        61504 | consumed tokens:    125960192 | elapsed time per iteration (s): 15.20 | learning rate: 2.015E-05 | global batch size:    16 | lm loss: 5.434635E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3845/  128728 | consumed samples:        61520 | consumed tokens:    125992960 | elapsed time per iteration (s): 15.15 | learning rate: 2.016E-05 | global batch size:    16 | lm loss: 5.393171E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3846/  128728 | consumed samples:        61536 | consumed tokens:    126025728 | elapsed time per iteration (s): 15.16 | learning rate: 2.016E-05 | global batch size:    16 | lm loss: 5.437396E+00 | grad norm: 1.099 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3847/  128728 | consumed samples:        61552 | consumed tokens:    126058496 | elapsed time per iteration (s): 15.20 | learning rate: 2.017E-05 | global batch size:    16 | lm loss: 5.379783E+00 | grad norm: 1.075 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3848/  128728 | consumed samples:        61568 | consumed tokens:    126091264 | elapsed time per iteration (s): 15.22 | learning rate: 2.017E-05 | global batch size:    16 | lm loss: 5.551754E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3849/  128728 | consumed samples:        61584 | consumed tokens:    126124032 | elapsed time per iteration (s): 15.22 | learning rate: 2.018E-05 | global batch size:    16 | lm loss: 5.263428E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3850/  128728 | consumed samples:        61600 | consumed tokens:    126156800 | elapsed time per iteration (s): 15.22 | learning rate: 2.019E-05 | global batch size:    16 | lm loss: 5.389133E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3851/  128728 | consumed samples:        61616 | consumed tokens:    126189568 | elapsed time per iteration (s): 15.20 | learning rate: 2.019E-05 | global batch size:    16 | lm loss: 5.425191E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3852/  128728 | consumed samples:        61632 | consumed tokens:    126222336 | elapsed time per iteration (s): 15.22 | learning rate: 2.020E-05 | global batch size:    16 | lm loss: 5.259414E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3853/  128728 | consumed samples:        61648 | consumed tokens:    126255104 | elapsed time per iteration (s): 15.20 | learning rate: 2.020E-05 | global batch size:    16 | lm loss: 5.419950E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3854/  128728 | consumed samples:        61664 | consumed tokens:    126287872 | elapsed time per iteration (s): 15.18 | learning rate: 2.021E-05 | global batch size:    16 | lm loss: 5.455901E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3855/  128728 | consumed samples:        61680 | consumed tokens:    126320640 | elapsed time per iteration (s): 15.22 | learning rate: 2.021E-05 | global batch size:    16 | lm loss: 5.723430E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3856/  128728 | consumed samples:        61696 | consumed tokens:    126353408 | elapsed time per iteration (s): 15.14 | learning rate: 2.022E-05 | global batch size:    16 | lm loss: 5.380040E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3857/  128728 | consumed samples:        61712 | consumed tokens:    126386176 | elapsed time per iteration (s): 15.18 | learning rate: 2.022E-05 | global batch size:    16 | lm loss: 5.547056E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3858/  128728 | consumed samples:        61728 | consumed tokens:    126418944 | elapsed time per iteration (s): 15.17 | learning rate: 2.023E-05 | global batch size:    16 | lm loss: 5.517189E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3859/  128728 | consumed samples:        61744 | consumed tokens:    126451712 | elapsed time per iteration (s): 15.19 | learning rate: 2.023E-05 | global batch size:    16 | lm loss: 5.323791E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3860/  128728 | consumed samples:        61760 | consumed tokens:    126484480 | elapsed time per iteration (s): 15.19 | learning rate: 2.024E-05 | global batch size:    16 | lm loss: 5.446847E+00 | grad norm: 1.169 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3861/  128728 | consumed samples:        61776 | consumed tokens:    126517248 | elapsed time per iteration (s): 15.13 | learning rate: 2.024E-05 | global batch size:    16 | lm loss: 5.215536E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3862/  128728 | consumed samples:        61792 | consumed tokens:    126550016 | elapsed time per iteration (s): 15.16 | learning rate: 2.025E-05 | global batch size:    16 | lm loss: 5.761042E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3863/  128728 | consumed samples:        61808 | consumed tokens:    126582784 | elapsed time per iteration (s): 15.22 | learning rate: 2.025E-05 | global batch size:    16 | lm loss: 5.237271E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3864/  128728 | consumed samples:        61824 | consumed tokens:    126615552 | elapsed time per iteration (s): 15.23 | learning rate: 2.026E-05 | global batch size:    16 | lm loss: 5.645336E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3865/  128728 | consumed samples:        61840 | consumed tokens:    126648320 | elapsed time per iteration (s): 15.25 | learning rate: 2.026E-05 | global batch size:    16 | lm loss: 5.387892E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3866/  128728 | consumed samples:        61856 | consumed tokens:    126681088 | elapsed time per iteration (s): 15.24 | learning rate: 2.027E-05 | global batch size:    16 | lm loss: 5.583735E+00 | grad norm: 1.042 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3867/  128728 | consumed samples:        61872 | consumed tokens:    126713856 | elapsed time per iteration (s): 15.21 | learning rate: 2.027E-05 | global batch size:    16 | lm loss: 5.244458E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3868/  128728 | consumed samples:        61888 | consumed tokens:    126746624 | elapsed time per iteration (s): 15.24 | learning rate: 2.028E-05 | global batch size:    16 | lm loss: 5.565816E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3869/  128728 | consumed samples:        61904 | consumed tokens:    126779392 | elapsed time per iteration (s): 15.19 | learning rate: 2.028E-05 | global batch size:    16 | lm loss: 5.393667E+00 | grad norm: 1.035 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3870/  128728 | consumed samples:        61920 | consumed tokens:    126812160 | elapsed time per iteration (s): 15.17 | learning rate: 2.029E-05 | global batch size:    16 | lm loss: 5.407505E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3871/  128728 | consumed samples:        61936 | consumed tokens:    126844928 | elapsed time per iteration (s): 15.22 | learning rate: 2.030E-05 | global batch size:    16 | lm loss: 5.168123E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3872/  128728 | consumed samples:        61952 | consumed tokens:    126877696 | elapsed time per iteration (s): 15.21 | learning rate: 2.030E-05 | global batch size:    16 | lm loss: 5.608961E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3873/  128728 | consumed samples:        61968 | consumed tokens:    126910464 | elapsed time per iteration (s): 15.17 | learning rate: 2.031E-05 | global batch size:    16 | lm loss: 5.526161E+00 | grad norm: 1.332 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3874/  128728 | consumed samples:        61984 | consumed tokens:    126943232 | elapsed time per iteration (s): 15.21 | learning rate: 2.031E-05 | global batch size:    16 | lm loss: 5.512238E+00 | grad norm: 2.111 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3875/  128728 | consumed samples:        62000 | consumed tokens:    126976000 | elapsed time per iteration (s): 15.24 | learning rate: 2.032E-05 | global batch size:    16 | lm loss: 5.310292E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3876/  128728 | consumed samples:        62016 | consumed tokens:    127008768 | elapsed time per iteration (s): 15.22 | learning rate: 2.032E-05 | global batch size:    16 | lm loss: 5.546309E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3877/  128728 | consumed samples:        62032 | consumed tokens:    127041536 | elapsed time per iteration (s): 15.24 | learning rate: 2.033E-05 | global batch size:    16 | lm loss: 5.386329E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3878/  128728 | consumed samples:        62048 | consumed tokens:    127074304 | elapsed time per iteration (s): 15.21 | learning rate: 2.033E-05 | global batch size:    16 | lm loss: 5.407649E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3879/  128728 | consumed samples:        62064 | consumed tokens:    127107072 | elapsed time per iteration (s): 15.23 | learning rate: 2.034E-05 | global batch size:    16 | lm loss: 5.325084E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3880/  128728 | consumed samples:        62080 | consumed tokens:    127139840 | elapsed time per iteration (s): 15.25 | learning rate: 2.034E-05 | global batch size:    16 | lm loss: 5.383338E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3881/  128728 | consumed samples:        62096 | consumed tokens:    127172608 | elapsed time per iteration (s): 15.19 | learning rate: 2.035E-05 | global batch size:    16 | lm loss: 5.435583E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3882/  128728 | consumed samples:        62112 | consumed tokens:    127205376 | elapsed time per iteration (s): 15.22 | learning rate: 2.035E-05 | global batch size:    16 | lm loss: 5.391198E+00 | grad norm: 1.092 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3883/  128728 | consumed samples:        62128 | consumed tokens:    127238144 | elapsed time per iteration (s): 15.24 | learning rate: 2.036E-05 | global batch size:    16 | lm loss: 5.385926E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3884/  128728 | consumed samples:        62144 | consumed tokens:    127270912 | elapsed time per iteration (s): 15.22 | learning rate: 2.036E-05 | global batch size:    16 | lm loss: 5.435524E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3885/  128728 | consumed samples:        62160 | consumed tokens:    127303680 | elapsed time per iteration (s): 15.19 | learning rate: 2.037E-05 | global batch size:    16 | lm loss: 5.325030E+00 | grad norm: 1.327 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3886/  128728 | consumed samples:        62176 | consumed tokens:    127336448 | elapsed time per iteration (s): 15.21 | learning rate: 2.037E-05 | global batch size:    16 | lm loss: 5.474463E+00 | grad norm: 0.860 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3887/  128728 | consumed samples:        62192 | consumed tokens:    127369216 | elapsed time per iteration (s): 15.21 | learning rate: 2.038E-05 | global batch size:    16 | lm loss: 5.445851E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3888/  128728 | consumed samples:        62208 | consumed tokens:    127401984 | elapsed time per iteration (s): 15.22 | learning rate: 2.038E-05 | global batch size:    16 | lm loss: 5.609439E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3889/  128728 | consumed samples:        62224 | consumed tokens:    127434752 | elapsed time per iteration (s): 15.23 | learning rate: 2.039E-05 | global batch size:    16 | lm loss: 5.400331E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3890/  128728 | consumed samples:        62240 | consumed tokens:    127467520 | elapsed time per iteration (s): 15.23 | learning rate: 2.039E-05 | global batch size:    16 | lm loss: 5.481973E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3891/  128728 | consumed samples:        62256 | consumed tokens:    127500288 | elapsed time per iteration (s): 15.18 | learning rate: 2.040E-05 | global batch size:    16 | lm loss: 5.350195E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3892/  128728 | consumed samples:        62272 | consumed tokens:    127533056 | elapsed time per iteration (s): 15.27 | learning rate: 2.041E-05 | global batch size:    16 | lm loss: 5.500158E+00 | grad norm: 2.969 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     3893/  128728 | consumed samples:        62288 | consumed tokens:    127565824 | elapsed time per iteration (s): 15.24 | learning rate: 2.041E-05 | global batch size:    16 | lm loss: 5.532178E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3894/  128728 | consumed samples:        62304 | consumed tokens:    127598592 | elapsed time per iteration (s): 15.25 | learning rate: 2.042E-05 | global batch size:    16 | lm loss: 5.326356E+00 | grad norm: 1.144 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3895/  128728 | consumed samples:        62320 | consumed tokens:    127631360 | elapsed time per iteration (s): 15.22 | learning rate: 2.042E-05 | global batch size:    16 | lm loss: 5.424766E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3896/  128728 | consumed samples:        62336 | consumed tokens:    127664128 | elapsed time per iteration (s): 15.22 | learning rate: 2.043E-05 | global batch size:    16 | lm loss: 5.275890E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3897/  128728 | consumed samples:        62352 | consumed tokens:    127696896 | elapsed time per iteration (s): 15.23 | learning rate: 2.043E-05 | global batch size:    16 | lm loss: 5.232322E+00 | grad norm: 1.560 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3898/  128728 | consumed samples:        62368 | consumed tokens:    127729664 | elapsed time per iteration (s): 15.19 | learning rate: 2.044E-05 | global batch size:    16 | lm loss: 5.657388E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3899/  128728 | consumed samples:        62384 | consumed tokens:    127762432 | elapsed time per iteration (s): 15.20 | learning rate: 2.044E-05 | global batch size:    16 | lm loss: 5.394963E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3900/  128728 | consumed samples:        62400 | consumed tokens:    127795200 | elapsed time per iteration (s): 15.25 | learning rate: 2.045E-05 | global batch size:    16 | lm loss: 5.370610E+00 | grad norm: 1.172 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3901/  128728 | consumed samples:        62416 | consumed tokens:    127827968 | elapsed time per iteration (s): 15.24 | learning rate: 2.045E-05 | global batch size:    16 | lm loss: 5.365441E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3902/  128728 | consumed samples:        62432 | consumed tokens:    127860736 | elapsed time per iteration (s): 15.24 | learning rate: 2.046E-05 | global batch size:    16 | lm loss: 5.406076E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3903/  128728 | consumed samples:        62448 | consumed tokens:    127893504 | elapsed time per iteration (s): 15.25 | learning rate: 2.046E-05 | global batch size:    16 | lm loss: 5.409226E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3904/  128728 | consumed samples:        62464 | consumed tokens:    127926272 | elapsed time per iteration (s): 15.30 | learning rate: 2.047E-05 | global batch size:    16 | lm loss: 5.347217E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.046 | TFLOPs: 8.01 |
[default7]: iteration     3905/  128728 | consumed samples:        62480 | consumed tokens:    127959040 | elapsed time per iteration (s): 15.21 | learning rate: 2.047E-05 | global batch size:    16 | lm loss: 5.564732E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3906/  128728 | consumed samples:        62496 | consumed tokens:    127991808 | elapsed time per iteration (s): 15.21 | learning rate: 2.048E-05 | global batch size:    16 | lm loss: 5.865915E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3907/  128728 | consumed samples:        62512 | consumed tokens:    128024576 | elapsed time per iteration (s): 15.19 | learning rate: 2.048E-05 | global batch size:    16 | lm loss: 4.977544E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3908/  128728 | consumed samples:        62528 | consumed tokens:    128057344 | elapsed time per iteration (s): 15.19 | learning rate: 2.049E-05 | global batch size:    16 | lm loss: 5.338539E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3909/  128728 | consumed samples:        62544 | consumed tokens:    128090112 | elapsed time per iteration (s): 15.25 | learning rate: 2.049E-05 | global batch size:    16 | lm loss: 5.439590E+00 | grad norm: 1.099 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3910/  128728 | consumed samples:        62560 | consumed tokens:    128122880 | elapsed time per iteration (s): 15.26 | learning rate: 2.050E-05 | global batch size:    16 | lm loss: 5.531397E+00 | grad norm: 1.197 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3911/  128728 | consumed samples:        62576 | consumed tokens:    128155648 | elapsed time per iteration (s): 15.25 | learning rate: 2.050E-05 | global batch size:    16 | lm loss: 5.485893E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3912/  128728 | consumed samples:        62592 | consumed tokens:    128188416 | elapsed time per iteration (s): 15.23 | learning rate: 2.051E-05 | global batch size:    16 | lm loss: 5.491755E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3913/  128728 | consumed samples:        62608 | consumed tokens:    128221184 | elapsed time per iteration (s): 15.23 | learning rate: 2.052E-05 | global batch size:    16 | lm loss: 5.505841E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3914/  128728 | consumed samples:        62624 | consumed tokens:    128253952 | elapsed time per iteration (s): 15.24 | learning rate: 2.052E-05 | global batch size:    16 | lm loss: 5.293841E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3915/  128728 | consumed samples:        62640 | consumed tokens:    128286720 | elapsed time per iteration (s): 15.23 | learning rate: 2.053E-05 | global batch size:    16 | lm loss: 5.651334E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3916/  128728 | consumed samples:        62656 | consumed tokens:    128319488 | elapsed time per iteration (s): 15.24 | learning rate: 2.053E-05 | global batch size:    16 | lm loss: 5.581880E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3917/  128728 | consumed samples:        62672 | consumed tokens:    128352256 | elapsed time per iteration (s): 15.22 | learning rate: 2.054E-05 | global batch size:    16 | lm loss: 5.295708E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3918/  128728 | consumed samples:        62688 | consumed tokens:    128385024 | elapsed time per iteration (s): 15.21 | learning rate: 2.054E-05 | global batch size:    16 | lm loss: 5.468671E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3919/  128728 | consumed samples:        62704 | consumed tokens:    128417792 | elapsed time per iteration (s): 15.23 | learning rate: 2.055E-05 | global batch size:    16 | lm loss: 5.449612E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3920/  128728 | consumed samples:        62720 | consumed tokens:    128450560 | elapsed time per iteration (s): 15.18 | learning rate: 2.055E-05 | global batch size:    16 | lm loss: 5.470665E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3921/  128728 | consumed samples:        62736 | consumed tokens:    128483328 | elapsed time per iteration (s): 15.17 | learning rate: 2.056E-05 | global batch size:    16 | lm loss: 5.540703E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3922/  128728 | consumed samples:        62752 | consumed tokens:    128516096 | elapsed time per iteration (s): 15.17 | learning rate: 2.056E-05 | global batch size:    16 | lm loss: 5.231455E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3923/  128728 | consumed samples:        62768 | consumed tokens:    128548864 | elapsed time per iteration (s): 15.21 | learning rate: 2.057E-05 | global batch size:    16 | lm loss: 5.513610E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3924/  128728 | consumed samples:        62784 | consumed tokens:    128581632 | elapsed time per iteration (s): 15.19 | learning rate: 2.057E-05 | global batch size:    16 | lm loss: 5.542394E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3925/  128728 | consumed samples:        62800 | consumed tokens:    128614400 | elapsed time per iteration (s): 15.21 | learning rate: 2.058E-05 | global batch size:    16 | lm loss: 5.609309E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3926/  128728 | consumed samples:        62816 | consumed tokens:    128647168 | elapsed time per iteration (s): 15.19 | learning rate: 2.058E-05 | global batch size:    16 | lm loss: 5.394788E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3927/  128728 | consumed samples:        62832 | consumed tokens:    128679936 | elapsed time per iteration (s): 15.23 | learning rate: 2.059E-05 | global batch size:    16 | lm loss: 5.177278E+00 | grad norm: 0.852 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3928/  128728 | consumed samples:        62848 | consumed tokens:    128712704 | elapsed time per iteration (s): 15.19 | learning rate: 2.059E-05 | global batch size:    16 | lm loss: 5.202007E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3929/  128728 | consumed samples:        62864 | consumed tokens:    128745472 | elapsed time per iteration (s): 15.19 | learning rate: 2.060E-05 | global batch size:    16 | lm loss: 5.402996E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     3930/  128728 | consumed samples:        62880 | consumed tokens:    128778240 | elapsed time per iteration (s): 15.17 | learning rate: 2.060E-05 | global batch size:    16 | lm loss: 5.260327E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3931/  128728 | consumed samples:        62896 | consumed tokens:    128811008 | elapsed time per iteration (s): 15.22 | learning rate: 2.061E-05 | global batch size:    16 | lm loss: 5.570568E+00 | grad norm: 3.273 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3932/  128728 | consumed samples:        62912 | consumed tokens:    128843776 | elapsed time per iteration (s): 15.21 | learning rate: 2.062E-05 | global batch size:    16 | lm loss: 5.326445E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3933/  128728 | consumed samples:        62928 | consumed tokens:    128876544 | elapsed time per iteration (s): 15.21 | learning rate: 2.062E-05 | global batch size:    16 | lm loss: 5.454953E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3934/  128728 | consumed samples:        62944 | consumed tokens:    128909312 | elapsed time per iteration (s): 15.22 | learning rate: 2.063E-05 | global batch size:    16 | lm loss: 5.214005E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3935/  128728 | consumed samples:        62960 | consumed tokens:    128942080 | elapsed time per iteration (s): 15.21 | learning rate: 2.063E-05 | global batch size:    16 | lm loss: 5.422129E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3936/  128728 | consumed samples:        62976 | consumed tokens:    128974848 | elapsed time per iteration (s): 15.23 | learning rate: 2.064E-05 | global batch size:    16 | lm loss: 5.433270E+00 | grad norm: 0.944 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3937/  128728 | consumed samples:        62992 | consumed tokens:    129007616 | elapsed time per iteration (s): 15.20 | learning rate: 2.064E-05 | global batch size:    16 | lm loss: 5.334735E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3938/  128728 | consumed samples:        63008 | consumed tokens:    129040384 | elapsed time per iteration (s): 15.19 | learning rate: 2.065E-05 | global batch size:    16 | lm loss: 5.458637E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3939/  128728 | consumed samples:        63024 | consumed tokens:    129073152 | elapsed time per iteration (s): 15.25 | learning rate: 2.065E-05 | global batch size:    16 | lm loss: 5.288385E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3940/  128728 | consumed samples:        63040 | consumed tokens:    129105920 | elapsed time per iteration (s): 15.21 | learning rate: 2.066E-05 | global batch size:    16 | lm loss: 5.444874E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3941/  128728 | consumed samples:        63056 | consumed tokens:    129138688 | elapsed time per iteration (s): 15.21 | learning rate: 2.066E-05 | global batch size:    16 | lm loss: 5.580392E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3942/  128728 | consumed samples:        63072 | consumed tokens:    129171456 | elapsed time per iteration (s): 15.18 | learning rate: 2.067E-05 | global batch size:    16 | lm loss: 5.633109E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3943/  128728 | consumed samples:        63088 | consumed tokens:    129204224 | elapsed time per iteration (s): 15.21 | learning rate: 2.067E-05 | global batch size:    16 | lm loss: 5.486689E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3944/  128728 | consumed samples:        63104 | consumed tokens:    129236992 | elapsed time per iteration (s): 15.21 | learning rate: 2.068E-05 | global batch size:    16 | lm loss: 5.653194E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3945/  128728 | consumed samples:        63120 | consumed tokens:    129269760 | elapsed time per iteration (s): 15.14 | learning rate: 2.068E-05 | global batch size:    16 | lm loss: 5.570617E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3946/  128728 | consumed samples:        63136 | consumed tokens:    129302528 | elapsed time per iteration (s): 15.23 | learning rate: 2.069E-05 | global batch size:    16 | lm loss: 5.431407E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3947/  128728 | consumed samples:        63152 | consumed tokens:    129335296 | elapsed time per iteration (s): 15.23 | learning rate: 2.069E-05 | global batch size:    16 | lm loss: 5.536205E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3948/  128728 | consumed samples:        63168 | consumed tokens:    129368064 | elapsed time per iteration (s): 15.14 | learning rate: 2.070E-05 | global batch size:    16 | lm loss: 5.436441E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3949/  128728 | consumed samples:        63184 | consumed tokens:    129400832 | elapsed time per iteration (s): 15.24 | learning rate: 2.070E-05 | global batch size:    16 | lm loss: 5.337091E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3950/  128728 | consumed samples:        63200 | consumed tokens:    129433600 | elapsed time per iteration (s): 15.23 | learning rate: 2.071E-05 | global batch size:    16 | lm loss: 5.656445E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3951/  128728 | consumed samples:        63216 | consumed tokens:    129466368 | elapsed time per iteration (s): 15.23 | learning rate: 2.071E-05 | global batch size:    16 | lm loss: 5.297698E+00 | grad norm: 0.972 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3952/  128728 | consumed samples:        63232 | consumed tokens:    129499136 | elapsed time per iteration (s): 15.18 | learning rate: 2.072E-05 | global batch size:    16 | lm loss: 5.657709E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3953/  128728 | consumed samples:        63248 | consumed tokens:    129531904 | elapsed time per iteration (s): 15.19 | learning rate: 2.073E-05 | global batch size:    16 | lm loss: 5.463843E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3954/  128728 | consumed samples:        63264 | consumed tokens:    129564672 | elapsed time per iteration (s): 15.20 | learning rate: 2.073E-05 | global batch size:    16 | lm loss: 5.530795E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3955/  128728 | consumed samples:        63280 | consumed tokens:    129597440 | elapsed time per iteration (s): 15.20 | learning rate: 2.074E-05 | global batch size:    16 | lm loss: 5.277174E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3956/  128728 | consumed samples:        63296 | consumed tokens:    129630208 | elapsed time per iteration (s): 15.19 | learning rate: 2.074E-05 | global batch size:    16 | lm loss: 5.323586E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3957/  128728 | consumed samples:        63312 | consumed tokens:    129662976 | elapsed time per iteration (s): 15.22 | learning rate: 2.075E-05 | global batch size:    16 | lm loss: 5.472128E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3958/  128728 | consumed samples:        63328 | consumed tokens:    129695744 | elapsed time per iteration (s): 15.20 | learning rate: 2.075E-05 | global batch size:    16 | lm loss: 5.385518E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3959/  128728 | consumed samples:        63344 | consumed tokens:    129728512 | elapsed time per iteration (s): 15.23 | learning rate: 2.076E-05 | global batch size:    16 | lm loss: 5.426952E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3960/  128728 | consumed samples:        63360 | consumed tokens:    129761280 | elapsed time per iteration (s): 15.22 | learning rate: 2.076E-05 | global batch size:    16 | lm loss: 5.452140E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3961/  128728 | consumed samples:        63376 | consumed tokens:    129794048 | elapsed time per iteration (s): 15.23 | learning rate: 2.077E-05 | global batch size:    16 | lm loss: 5.372558E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3962/  128728 | consumed samples:        63392 | consumed tokens:    129826816 | elapsed time per iteration (s): 15.25 | learning rate: 2.077E-05 | global batch size:    16 | lm loss: 5.433863E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3963/  128728 | consumed samples:        63408 | consumed tokens:    129859584 | elapsed time per iteration (s): 15.20 | learning rate: 2.078E-05 | global batch size:    16 | lm loss: 5.048560E+00 | grad norm: 0.987 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3964/  128728 | consumed samples:        63424 | consumed tokens:    129892352 | elapsed time per iteration (s): 15.14 | learning rate: 2.078E-05 | global batch size:    16 | lm loss: 5.615795E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     3965/  128728 | consumed samples:        63440 | consumed tokens:    129925120 | elapsed time per iteration (s): 15.22 | learning rate: 2.079E-05 | global batch size:    16 | lm loss: 5.405645E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3966/  128728 | consumed samples:        63456 | consumed tokens:    129957888 | elapsed time per iteration (s): 15.20 | learning rate: 2.079E-05 | global batch size:    16 | lm loss: 5.304680E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3967/  128728 | consumed samples:        63472 | consumed tokens:    129990656 | elapsed time per iteration (s): 15.22 | learning rate: 2.080E-05 | global batch size:    16 | lm loss: 5.625960E+00 | grad norm: 1.147 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3968/  128728 | consumed samples:        63488 | consumed tokens:    130023424 | elapsed time per iteration (s): 15.21 | learning rate: 2.080E-05 | global batch size:    16 | lm loss: 5.581823E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3969/  128728 | consumed samples:        63504 | consumed tokens:    130056192 | elapsed time per iteration (s): 15.18 | learning rate: 2.081E-05 | global batch size:    16 | lm loss: 5.444682E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3970/  128728 | consumed samples:        63520 | consumed tokens:    130088960 | elapsed time per iteration (s): 15.21 | learning rate: 2.081E-05 | global batch size:    16 | lm loss: 5.335429E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3971/  128728 | consumed samples:        63536 | consumed tokens:    130121728 | elapsed time per iteration (s): 15.19 | learning rate: 2.082E-05 | global batch size:    16 | lm loss: 5.558789E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3972/  128728 | consumed samples:        63552 | consumed tokens:    130154496 | elapsed time per iteration (s): 15.16 | learning rate: 2.082E-05 | global batch size:    16 | lm loss: 5.333210E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     3973/  128728 | consumed samples:        63568 | consumed tokens:    130187264 | elapsed time per iteration (s): 15.20 | learning rate: 2.083E-05 | global batch size:    16 | lm loss: 5.441347E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3974/  128728 | consumed samples:        63584 | consumed tokens:    130220032 | elapsed time per iteration (s): 15.15 | learning rate: 2.084E-05 | global batch size:    16 | lm loss: 5.388178E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     3975/  128728 | consumed samples:        63600 | consumed tokens:    130252800 | elapsed time per iteration (s): 15.18 | learning rate: 2.084E-05 | global batch size:    16 | lm loss: 5.478914E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3976/  128728 | consumed samples:        63616 | consumed tokens:    130285568 | elapsed time per iteration (s): 15.22 | learning rate: 2.085E-05 | global batch size:    16 | lm loss: 5.390545E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3977/  128728 | consumed samples:        63632 | consumed tokens:    130318336 | elapsed time per iteration (s): 15.22 | learning rate: 2.085E-05 | global batch size:    16 | lm loss: 5.489986E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3978/  128728 | consumed samples:        63648 | consumed tokens:    130351104 | elapsed time per iteration (s): 15.20 | learning rate: 2.086E-05 | global batch size:    16 | lm loss: 5.220353E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3979/  128728 | consumed samples:        63664 | consumed tokens:    130383872 | elapsed time per iteration (s): 15.22 | learning rate: 2.086E-05 | global batch size:    16 | lm loss: 5.544164E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3980/  128728 | consumed samples:        63680 | consumed tokens:    130416640 | elapsed time per iteration (s): 15.21 | learning rate: 2.087E-05 | global batch size:    16 | lm loss: 5.339544E+00 | grad norm: 1.046 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3981/  128728 | consumed samples:        63696 | consumed tokens:    130449408 | elapsed time per iteration (s): 15.24 | learning rate: 2.087E-05 | global batch size:    16 | lm loss: 5.444860E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3982/  128728 | consumed samples:        63712 | consumed tokens:    130482176 | elapsed time per iteration (s): 15.20 | learning rate: 2.088E-05 | global batch size:    16 | lm loss: 5.265263E+00 | grad norm: 1.516 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3983/  128728 | consumed samples:        63728 | consumed tokens:    130514944 | elapsed time per iteration (s): 15.25 | learning rate: 2.088E-05 | global batch size:    16 | lm loss: 5.278424E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3984/  128728 | consumed samples:        63744 | consumed tokens:    130547712 | elapsed time per iteration (s): 15.21 | learning rate: 2.089E-05 | global batch size:    16 | lm loss: 5.488007E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3985/  128728 | consumed samples:        63760 | consumed tokens:    130580480 | elapsed time per iteration (s): 15.23 | learning rate: 2.089E-05 | global batch size:    16 | lm loss: 5.426978E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     3986/  128728 | consumed samples:        63776 | consumed tokens:    130613248 | elapsed time per iteration (s): 15.25 | learning rate: 2.090E-05 | global batch size:    16 | lm loss: 5.588451E+00 | grad norm: 1.938 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3987/  128728 | consumed samples:        63792 | consumed tokens:    130646016 | elapsed time per iteration (s): 15.26 | learning rate: 2.090E-05 | global batch size:    16 | lm loss: 5.366596E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     3988/  128728 | consumed samples:        63808 | consumed tokens:    130678784 | elapsed time per iteration (s): 15.17 | learning rate: 2.091E-05 | global batch size:    16 | lm loss: 5.419157E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     3989/  128728 | consumed samples:        63824 | consumed tokens:    130711552 | elapsed time per iteration (s): 15.20 | learning rate: 2.091E-05 | global batch size:    16 | lm loss: 5.551931E+00 | grad norm: 1.175 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3990/  128728 | consumed samples:        63840 | consumed tokens:    130744320 | elapsed time per iteration (s): 15.24 | learning rate: 2.092E-05 | global batch size:    16 | lm loss: 5.222095E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3991/  128728 | consumed samples:        63856 | consumed tokens:    130777088 | elapsed time per iteration (s): 15.21 | learning rate: 2.092E-05 | global batch size:    16 | lm loss: 5.331915E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     3992/  128728 | consumed samples:        63872 | consumed tokens:    130809856 | elapsed time per iteration (s): 15.17 | learning rate: 2.093E-05 | global batch size:    16 | lm loss: 5.306742E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     3993/  128728 | consumed samples:        63888 | consumed tokens:    130842624 | elapsed time per iteration (s): 15.23 | learning rate: 2.093E-05 | global batch size:    16 | lm loss: 5.580595E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     3994/  128728 | consumed samples:        63904 | consumed tokens:    130875392 | elapsed time per iteration (s): 15.20 | learning rate: 2.094E-05 | global batch size:    16 | lm loss: 5.409997E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3995/  128728 | consumed samples:        63920 | consumed tokens:    130908160 | elapsed time per iteration (s): 15.21 | learning rate: 2.095E-05 | global batch size:    16 | lm loss: 5.411019E+00 | grad norm: 1.052 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     3996/  128728 | consumed samples:        63936 | consumed tokens:    130940928 | elapsed time per iteration (s): 15.20 | learning rate: 2.095E-05 | global batch size:    16 | lm loss: 5.389151E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     3997/  128728 | consumed samples:        63952 | consumed tokens:    130973696 | elapsed time per iteration (s): 15.18 | learning rate: 2.096E-05 | global batch size:    16 | lm loss: 5.458337E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     3998/  128728 | consumed samples:        63968 | consumed tokens:    131006464 | elapsed time per iteration (s): 15.23 | learning rate: 2.096E-05 | global batch size:    16 | lm loss: 5.207241E+00 | grad norm: 0.648 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     3999/  128728 | consumed samples:        63984 | consumed tokens:    131039232 | elapsed time per iteration (s): 15.21 | learning rate: 2.097E-05 | global batch size:    16 | lm loss: 5.415626E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4000/  128728 | consumed samples:        64000 | consumed tokens:    131072000 | elapsed time per iteration (s): 15.25 | learning rate: 2.097E-05 | global batch size:    16 | lm loss: 5.328699E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default0]:[2022-03-03 22:54:56,238] [INFO] [logging.py:69:log_dist] [Rank 0] step=4000, skipped=0, lr=[2.0971573687228642e-05, 2.0971573687228642e-05, 2.0971573687228642e-05], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
[default0]:steps: 4000 loss: 5.3287 iter time (s): 14.265 samples/sec: 1.122
[default7]:------------------------------------------------------------------------------------------
[default7]:valid loss at iteration 4000 | lm loss value: 5.666330E+00 | lm loss PPL: 2.889720E+02 | 
[default7]:------------------------------------------------------------------------------------------
[default0]:saving checkpoint at iteration    4000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:[2022-03-03 22:55:25,744] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/mp_rank_00_model_states.pt
[default1]:[2022-03-03 22:55:25,763] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/mp_rank_01_model_states.pt
[default7]:[2022-03-03 22:55:38,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt
[default4]:[2022-03-03 22:55:39,465] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 22:55:39,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 22:55:39,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt
[default5]:[2022-03-03 22:55:39,888] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
[default5]:[2022-03-03 22:55:39,963] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 22:55:39,987] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 22:55:40,034] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt
[default6]:[2022-03-03 22:55:40,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt
[default3]:[2022-03-03 22:55:40,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 22:55:40,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt
[default1]:[2022-03-03 22:55:40,217] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 22:55:40,300] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt
[default0]:[2022-03-03 22:55:40,303] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 22:55:40,396] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt
[default6]:[2022-03-03 22:55:40,358] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt
[default1]:[2022-03-03 22:55:40,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt
[default6]:[2022-03-03 22:55:40,462] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt
[default4]:[2022-03-03 22:55:40,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 22:55:40,696] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 22:55:41,026] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt
[default5]:[2022-03-03 22:55:40,986] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt
[default0]:[2022-03-03 22:55:41,039] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt
[default4]:[2022-03-03 22:55:41,014] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt
[default1]:[2022-03-03 22:55:41,074] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 22:55:41,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 22:55:41,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt
[default0]:[2022-03-03 22:55:41,206] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt
[default6]:[2022-03-03 22:55:41,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt
[default2]:[2022-03-03 22:55:41,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt
[default1]:[2022-03-03 22:55:41,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 22:55:41,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt
[default3]:[2022-03-03 22:55:41,419] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt
[default5]:[2022-03-03 22:55:41,545] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt
[default4]:[2022-03-03 22:55:41,515] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
[default2]:[2022-03-03 22:55:41,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 22:55:41,641] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt
[default3]:[2022-03-03 22:55:41,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt
[default2]:[2022-03-03 22:55:41,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt
[default1]:[2022-03-03 22:55:41,792] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 22:55:42,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt
[default1]:[2022-03-03 22:55:42,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt
[default1]:[2022-03-03 22:55:42,297] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 22:55:42,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt
[default0]:[2022-03-03 22:55:42,316] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt
[default2]:[2022-03-03 22:55:42,328] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt
[default7]:[2022-03-03 22:55:42,491] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt
[default1]:[2022-03-03 22:55:42,458] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 22:55:42,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt
[default0]:[2022-03-03 22:55:42,565] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt
[default5]:[2022-03-03 22:55:42,602] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 22:55:42,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 22:55:42,697] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt
[default5]:[2022-03-03 22:55:42,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 22:55:42,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt
[default4]:[2022-03-03 22:55:42,785] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt
[default6]:[2022-03-03 22:55:42,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt
[default6]:[2022-03-03 22:55:42,891] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt
[default1]:[2022-03-03 22:55:42,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 22:55:42,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 22:55:42,889] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 22:55:42,891] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt
[default7]:[2022-03-03 22:55:42,993] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt
[default0]:[2022-03-03 22:55:42,951] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 22:55:42,951] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 22:55:42,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt
[default3]:[2022-03-03 22:55:42,990] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt
[default1]:[2022-03-03 22:55:42,885] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt
[default7]:[2022-03-03 22:55:43,137] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt
[default5]:[2022-03-03 22:55:43,122] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt
[default6]:[2022-03-03 22:55:43,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt
[default3]:[2022-03-03 22:55:43,265] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt
[default1]:[2022-03-03 22:55:43,278] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt
[default7]:[2022-03-03 22:55:43,303] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 22:55:43,332] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt
[default0]:[2022-03-03 22:55:43,199] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
[default0]:[2022-03-03 22:55:43,267] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
[default5]:[2022-03-03 22:55:43,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 22:55:43,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt
[default4]:[2022-03-03 22:55:43,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt
[default0]:[2022-03-03 22:55:43,470] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt
[default3]:[2022-03-03 22:55:43,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 22:55:43,479] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 22:55:43,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt
[default6]:[2022-03-03 22:55:43,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt
[default6]:[2022-03-03 22:55:43,557] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt
[default2]:[2022-03-03 22:55:43,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt
[default1]:[2022-03-03 22:55:43,933] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt
[default7]:[2022-03-03 22:55:43,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt
[default7]:[2022-03-03 22:55:43,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt
[default2]:[2022-03-03 22:55:44,040] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 22:55:44,088] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt
[default3]:[2022-03-03 22:55:44,170] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt
[default7]:[2022-03-03 22:55:44,112] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt
[default6]:[2022-03-03 22:55:44,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt
[default6]:[2022-03-03 22:55:44,205] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt
[default2]:[2022-03-03 22:55:44,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt
[default4]:[2022-03-03 22:55:44,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
[default3]:[2022-03-03 22:55:44,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt
[default6]:[2022-03-03 22:55:44,510] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt
[default1]:[2022-03-03 22:55:44,529] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt
[default0]:[2022-03-03 22:55:44,553] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt
[default3]:[2022-03-03 22:55:44,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt
[default5]:[2022-03-03 22:55:44,635] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 22:55:44,637] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt
[default2]:[2022-03-03 22:55:44,670] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt
[default3]:[2022-03-03 22:55:44,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt
[default4]:[2022-03-03 22:55:44,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt
[default5]:[2022-03-03 22:55:44,727] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt
[default2]:[2022-03-03 22:55:44,927] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt
[default0]:[2022-03-03 22:55:45,031] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 22:55:45,034] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt
[default7]:[2022-03-03 22:55:45,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt
[default1]:[2022-03-03 22:55:45,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt
[default2]:[2022-03-03 22:55:45,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt
[default2]:[2022-03-03 22:55:45,175] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt
[default6]:[2022-03-03 22:55:45,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt
[default5]:[2022-03-03 22:55:45,287] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 22:55:45,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt
[default3]:[2022-03-03 22:55:45,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 22:55:45,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt
[default1]:[2022-03-03 22:55:45,471] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 22:55:45,529] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt
[default5]:[2022-03-03 22:55:45,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt
[default7]:[2022-03-03 22:55:45,480] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt
[default3]:[2022-03-03 22:55:45,567] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt
[default3]:[2022-03-03 22:55:45,631] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt
[default7]:[2022-03-03 22:55:45,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt
[default6]:[2022-03-03 22:55:45,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt
[default7]:[2022-03-03 22:55:45,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt
[default4]:[2022-03-03 22:55:45,813] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 22:55:45,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 22:55:45,812] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt
[default3]:[2022-03-03 22:55:45,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt
[default4]:[2022-03-03 22:55:45,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt
[default5]:[2022-03-03 22:55:45,974] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt
[default2]:[2022-03-03 22:55:46,158] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt
[default7]:[2022-03-03 22:55:46,206] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt
[default4]:[2022-03-03 22:55:46,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt
[default7]:[2022-03-03 22:55:46,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt
[default1]:[2022-03-03 22:55:46,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt
[default6]:[2022-03-03 22:55:46,268] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 22:55:46,310] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt
[default0]:[2022-03-03 22:55:46,296] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt
[default5]:[2022-03-03 22:55:46,294] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt
[default2]:[2022-03-03 22:55:46,373] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt
[default3]:[2022-03-03 22:55:46,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt
[default4]:[2022-03-03 22:55:46,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 22:55:46,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 22:55:46,443] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt
[default3]:[2022-03-03 22:55:46,491] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt
[default0]:[2022-03-03 22:55:46,535] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt
[default6]:[2022-03-03 22:55:46,541] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt
[default1]:[2022-03-03 22:55:46,546] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt
[default4]:[2022-03-03 22:55:46,581] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt
[default3]:[2022-03-03 22:55:46,569] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt
[default1]:[2022-03-03 22:55:46,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 22:55:46,694] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt
[default5]:[2022-03-03 22:55:46,707] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt
[default6]:[2022-03-03 22:55:46,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt
[default2]:[2022-03-03 22:55:46,723] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt
[default2]:[2022-03-03 22:55:46,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt
[default5]:[2022-03-03 22:55:46,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt
[default1]:[2022-03-03 22:55:46,782] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt
[default4]:[2022-03-03 22:55:46,771] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt
[default1]:[2022-03-03 22:55:46,723] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 22:55:46,814] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 22:55:46,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt
[default3]:[2022-03-03 22:55:46,819] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt
[default3]:[2022-03-03 22:55:46,774] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt
[default5]:[2022-03-03 22:55:46,829] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt
[default3]:[2022-03-03 22:55:46,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt
[default7]:[2022-03-03 22:55:46,906] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt
[default3]:[2022-03-03 22:55:46,952] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt
[default7]:[2022-03-03 22:55:46,874] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt
[default2]:[2022-03-03 22:55:46,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt
[default0]:[2022-03-03 22:55:46,921] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt
[default4]:[2022-03-03 22:55:46,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt
[default0]:[2022-03-03 22:55:46,969] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt
[default5]:[2022-03-03 22:55:46,995] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 22:55:46,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt
[default3]:[2022-03-03 22:55:47,057] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt
[default0]:[2022-03-03 22:55:47,082] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt
[default2]:[2022-03-03 22:55:47,085] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt
[default1]:[2022-03-03 22:55:47,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt
[default2]:[2022-03-03 22:55:47,085] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt
[default0]:[2022-03-03 22:55:47,157] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt
[default2]:[2022-03-03 22:55:47,184] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt
[default1]:[2022-03-03 22:55:47,145] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt
[default0]:[2022-03-03 22:55:47,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt
[default2]:[2022-03-03 22:55:47,175] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt
[default4]:[2022-03-03 22:55:47,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt
[default2]:[2022-03-03 22:55:47,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt
[default5]:[2022-03-03 22:55:47,251] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt
[default5]:[2022-03-03 22:55:47,307] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 22:55:47,364] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt
[default3]:[2022-03-03 22:55:47,318] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt
[default7]:[2022-03-03 22:55:47,308] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt
[default7]:[2022-03-03 22:55:47,371] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt
[default0]:[2022-03-03 22:55:47,414] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt
[default0]:[2022-03-03 22:55:47,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt
[default2]:[2022-03-03 22:55:47,434] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt
[default2]:[2022-03-03 22:55:47,445] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt
[default4]:[2022-03-03 22:55:47,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt
[default1]:[2022-03-03 22:55:47,475] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt
[default4]:[2022-03-03 22:55:47,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 22:55:47,566] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt
[default5]:[2022-03-03 22:55:47,542] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt
[default3]:[2022-03-03 22:55:47,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt
[default1]:[2022-03-03 22:55:47,617] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt
[default7]:[2022-03-03 22:55:47,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt
[default2]:[2022-03-03 22:55:47,677] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt
[default0]:[2022-03-03 22:55:47,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt
[default4]:[2022-03-03 22:55:47,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt
[default5]:[2022-03-03 22:55:47,681] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt
[default7]:[2022-03-03 22:55:47,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt
[default5]:[2022-03-03 22:55:47,761] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt
[default1]:[2022-03-03 22:55:47,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt
[default1]:[2022-03-03 22:55:47,755] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt
[default6]:[2022-03-03 22:55:47,791] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt
[default2]:[2022-03-03 22:55:47,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt
[default0]:[2022-03-03 22:55:47,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt
[default1]:[2022-03-03 22:55:47,825] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt
[default6]:[2022-03-03 22:55:47,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 22:55:47,918] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt
[default7]:[2022-03-03 22:55:47,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt
[default7]:[2022-03-03 22:55:47,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt
[default6]:[2022-03-03 22:55:47,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt
[default0]:[2022-03-03 22:55:47,871] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt
[default6]:[2022-03-03 22:55:47,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt
[default4]:[2022-03-03 22:55:47,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 22:55:47,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt
[default3]:[2022-03-03 22:55:48,025] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt
[default1]:[2022-03-03 22:55:47,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt
[default1]:[2022-03-03 22:55:48,036] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt
[default2]:[2022-03-03 22:55:48,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt
[default7]:[2022-03-03 22:55:48,025] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt
[default6]:[2022-03-03 22:55:48,055] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt
[default0]:[2022-03-03 22:55:48,118] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt
[default2]:[2022-03-03 22:55:48,076] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 22:55:48,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt
[default0]:[2022-03-03 22:55:48,098] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt
[default1]:[2022-03-03 22:55:48,136] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt
[default5]:[2022-03-03 22:55:48,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 22:55:48,259] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt
[default6]:[2022-03-03 22:55:48,257] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt
[default5]:[2022-03-03 22:55:48,288] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt
[default3]:[2022-03-03 22:55:48,380] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt
[default1]:[2022-03-03 22:55:48,435] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 22:55:48,480] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt
[default6]:[2022-03-03 22:55:48,604] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 22:55:48,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt
[default4]:[2022-03-03 22:55:48,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt
[default7]:[2022-03-03 22:55:48,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt
[default3]:[2022-03-03 22:55:48,697] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 22:55:48,709] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt
[default2]:[2022-03-03 22:55:48,709] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt
[default2]:[2022-03-03 22:55:48,828] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt
[default0]:[2022-03-03 22:55:48,848] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 22:55:48,849] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt
[default1]:[2022-03-03 22:55:48,829] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt
[default0]:[2022-03-03 22:55:48,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt
[default6]:[2022-03-03 22:55:49,052] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt
[default4]:[2022-03-03 22:55:49,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt
[default1]:[2022-03-03 22:55:48,980] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt
[default6]:[2022-03-03 22:55:48,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt
[default4]:[2022-03-03 22:55:49,026] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt
[default7]:[2022-03-03 22:55:49,130] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt
[default3]:[2022-03-03 22:55:49,173] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt
[default1]:[2022-03-03 22:55:49,106] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt
[default2]:[2022-03-03 22:55:49,204] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt
[default0]:[2022-03-03 22:55:49,214] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt
[default4]:[2022-03-03 22:55:49,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt
[default0]:[2022-03-03 22:55:49,340] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt
[default2]:[2022-03-03 22:55:49,364] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt
[default5]:[2022-03-03 22:55:49,344] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt
[default2]:[2022-03-03 22:55:49,310] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt
[default6]:[2022-03-03 22:55:49,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt
[default4]:[2022-03-03 22:55:49,364] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt
[default3]:[2022-03-03 22:55:49,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt
[default1]:[2022-03-03 22:55:49,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt
[default0]:[2022-03-03 22:55:49,482] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 22:55:49,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt
[default5]:[2022-03-03 22:55:49,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt
[default7]:[2022-03-03 22:55:49,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt
[default3]:[2022-03-03 22:55:49,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt
[default1]:[2022-03-03 22:55:49,622] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt
[default6]:[2022-03-03 22:55:49,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt
[default6]:[2022-03-03 22:55:49,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 22:55:49,602] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt
[default0]:[2022-03-03 22:55:49,651] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[default7]:[2022-03-03 22:55:49,654] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt
[default1]:[2022-03-03 22:55:49,657] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt
[default3]:[2022-03-03 22:55:49,922] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt
[default4]:[2022-03-03 22:55:49,962] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt
[default6]:[2022-03-03 22:55:50,059] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt
[default5]:[2022-03-03 22:55:50,044] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt
[default2]:[2022-03-03 22:55:50,089] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 22:55:50,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt
[default7]:[2022-03-03 22:55:50,119] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt
[default0]:[2022-03-03 22:55:50,162] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 22:55:50,206] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt
[default2]:[2022-03-03 22:55:50,334] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt
[default7]:[2022-03-03 22:55:50,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt
[default6]:[2022-03-03 22:55:50,374] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt
[default1]:[2022-03-03 22:55:50,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt
[default7]:[2022-03-03 22:55:50,484] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt
[default4]:[2022-03-03 22:55:50,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
[default6]:[2022-03-03 22:55:50,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt
[default4]:[2022-03-03 22:55:50,581] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt
[default1]:[2022-03-03 22:55:50,590] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt
[default4]:[2022-03-03 22:55:50,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt
[default5]:[2022-03-03 22:55:50,596] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt
[default0]:[2022-03-03 22:55:50,639] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt
[default6]:[2022-03-03 22:55:50,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt
[default4]:[2022-03-03 22:55:50,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt
[default5]:[2022-03-03 22:55:50,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt
[default5]:[2022-03-03 22:55:50,888] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt
[default0]:[2022-03-03 22:55:50,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt
[default5]:[2022-03-03 22:55:50,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt
[default7]:[2022-03-03 22:55:51,057] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt
[default2]:[2022-03-03 22:55:51,145] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt
[default4]:[2022-03-03 22:55:51,074] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt
[default7]:[2022-03-03 22:55:51,181] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt
[default5]:[2022-03-03 22:55:51,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt
[default4]:[2022-03-03 22:55:51,294] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt
[default3]:[2022-03-03 22:55:51,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt
[default2]:[2022-03-03 22:55:51,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt
[default3]:[2022-03-03 22:55:51,498] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt
[default3]:[2022-03-03 22:55:51,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt
[default1]:[2022-03-03 22:55:51,538] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt
[default2]:[2022-03-03 22:55:51,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt
[default3]:[2022-03-03 22:55:51,862] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt
[default5]:[2022-03-03 22:55:51,831] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt
[default5]:[2022-03-03 22:55:51,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt
[default5]:[2022-03-03 22:55:52,014] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt
[default4]:[2022-03-03 22:55:52,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt
[default7]:[2022-03-03 22:55:52,054] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt
[default4]:[2022-03-03 22:55:52,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt
[default1]:[2022-03-03 22:55:52,120] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt
[default4]:[2022-03-03 22:55:52,129] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
[default0]:[2022-03-03 22:55:52,109] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt
[default1]:[2022-03-03 22:55:52,119] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt
[default5]:[2022-03-03 22:55:52,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt
[default7]:[2022-03-03 22:55:52,186] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt
[default2]:[2022-03-03 22:55:52,316] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt
[default3]:[2022-03-03 22:55:52,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt
[default3]:[2022-03-03 22:55:52,286] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt
[default1]:[2022-03-03 22:55:52,404] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt
[default2]:[2022-03-03 22:55:52,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt
[default6]:[2022-03-03 22:55:52,581] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt
[default3]:[2022-03-03 22:55:52,554] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt
[default2]:[2022-03-03 22:55:52,539] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt
[default0]:[2022-03-03 22:55:52,572] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt
[default6]:[2022-03-03 22:55:52,596] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt
[default6]:[2022-03-03 22:55:52,594] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt
[default7]:[2022-03-03 22:55:52,614] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt
[default0]:[2022-03-03 22:55:52,751] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
[default3]:[2022-03-03 22:55:52,736] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt
[default0]:[2022-03-03 22:55:52,916] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt
[default7]:[2022-03-03 22:55:53,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt
[default4]:[2022-03-03 22:55:53,391] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt
[default5]:[2022-03-03 22:55:53,386] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt
[default6]:[2022-03-03 22:55:53,793] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt
[default6]:[2022-03-03 22:55:53,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt
[default5]:[2022-03-03 22:55:53,920] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt
[default4]:[2022-03-03 22:55:53,970] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt
[default7]:[2022-03-03 22:55:54,023] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt
[default5]:[2022-03-03 22:55:54,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt
[default4]:[2022-03-03 22:55:54,176] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt
[default7]:[2022-03-03 22:55:54,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt
[default6]:[2022-03-03 22:55:54,577] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt
[default6]:[2022-03-03 22:55:54,759] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt
[default7]:[2022-03-03 22:55:54,813] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt
[default6]:[2022-03-03 22:55:54,870] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt
[default5]:[2022-03-03 22:55:54,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt
[default4]:[2022-03-03 22:55:55,042] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt
[default7]:[2022-03-03 22:55:55,019] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt
[default1]:[2022-03-03 22:55:55,298] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt
[default0]:[2022-03-03 22:55:55,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt
[default2]:[2022-03-03 22:55:55,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt
[default3]:[2022-03-03 22:55:55,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt
[default0]:[2022-03-03 22:55:56,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt
[default7]:time (ms) | save-checkpoint: 39581.71
[default0]:  successfully saved checkpoint at iteration    4000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default1]:[2022-03-03 22:55:56,277] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt
[default7]: iteration     4001/  128728 | consumed samples:        64016 | consumed tokens:    131104768 | elapsed time per iteration (s): 74.30 | learning rate: 2.098E-05 | global batch size:    16 | lm loss: 5.543649E+00 | grad norm: 1.476 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.215 | TFLOPs: 1.65 |
[default7]: iteration     4002/  128728 | consumed samples:        64032 | consumed tokens:    131137536 | elapsed time per iteration (s): 15.26 | learning rate: 2.098E-05 | global batch size:    16 | lm loss: 5.309994E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4003/  128728 | consumed samples:        64048 | consumed tokens:    131170304 | elapsed time per iteration (s): 15.25 | learning rate: 2.099E-05 | global batch size:    16 | lm loss: 5.382421E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4004/  128728 | consumed samples:        64064 | consumed tokens:    131203072 | elapsed time per iteration (s): 15.21 | learning rate: 2.099E-05 | global batch size:    16 | lm loss: 5.472140E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4005/  128728 | consumed samples:        64080 | consumed tokens:    131235840 | elapsed time per iteration (s): 15.21 | learning rate: 2.100E-05 | global batch size:    16 | lm loss: 5.577133E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4006/  128728 | consumed samples:        64096 | consumed tokens:    131268608 | elapsed time per iteration (s): 15.20 | learning rate: 2.100E-05 | global batch size:    16 | lm loss: 5.207786E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4007/  128728 | consumed samples:        64112 | consumed tokens:    131301376 | elapsed time per iteration (s): 15.17 | learning rate: 2.101E-05 | global batch size:    16 | lm loss: 5.436874E+00 | grad norm: 0.891 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4008/  128728 | consumed samples:        64128 | consumed tokens:    131334144 | elapsed time per iteration (s): 15.21 | learning rate: 2.101E-05 | global batch size:    16 | lm loss: 5.084764E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4009/  128728 | consumed samples:        64144 | consumed tokens:    131366912 | elapsed time per iteration (s): 15.19 | learning rate: 2.102E-05 | global batch size:    16 | lm loss: 5.114124E+00 | grad norm: 1.494 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4010/  128728 | consumed samples:        64160 | consumed tokens:    131399680 | elapsed time per iteration (s): 15.23 | learning rate: 2.102E-05 | global batch size:    16 | lm loss: 5.451921E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4011/  128728 | consumed samples:        64176 | consumed tokens:    131432448 | elapsed time per iteration (s): 15.21 | learning rate: 2.103E-05 | global batch size:    16 | lm loss: 5.215581E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4012/  128728 | consumed samples:        64192 | consumed tokens:    131465216 | elapsed time per iteration (s): 15.18 | learning rate: 2.103E-05 | global batch size:    16 | lm loss: 5.158479E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4013/  128728 | consumed samples:        64208 | consumed tokens:    131497984 | elapsed time per iteration (s): 15.18 | learning rate: 2.104E-05 | global batch size:    16 | lm loss: 5.238644E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4014/  128728 | consumed samples:        64224 | consumed tokens:    131530752 | elapsed time per iteration (s): 15.29 | learning rate: 2.104E-05 | global batch size:    16 | lm loss: 5.194250E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.01 |
[default7]: iteration     4015/  128728 | consumed samples:        64240 | consumed tokens:    131563520 | elapsed time per iteration (s): 15.23 | learning rate: 2.105E-05 | global batch size:    16 | lm loss: 5.281526E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4016/  128728 | consumed samples:        64256 | consumed tokens:    131596288 | elapsed time per iteration (s): 15.20 | learning rate: 2.106E-05 | global batch size:    16 | lm loss: 5.243568E+00 | grad norm: 1.218 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4017/  128728 | consumed samples:        64272 | consumed tokens:    131629056 | elapsed time per iteration (s): 15.20 | learning rate: 2.106E-05 | global batch size:    16 | lm loss: 5.439724E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4018/  128728 | consumed samples:        64288 | consumed tokens:    131661824 | elapsed time per iteration (s): 15.23 | learning rate: 2.107E-05 | global batch size:    16 | lm loss: 5.292508E+00 | grad norm: 0.651 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4019/  128728 | consumed samples:        64304 | consumed tokens:    131694592 | elapsed time per iteration (s): 15.23 | learning rate: 2.107E-05 | global batch size:    16 | lm loss: 5.304052E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4020/  128728 | consumed samples:        64320 | consumed tokens:    131727360 | elapsed time per iteration (s): 15.24 | learning rate: 2.108E-05 | global batch size:    16 | lm loss: 5.270075E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4021/  128728 | consumed samples:        64336 | consumed tokens:    131760128 | elapsed time per iteration (s): 15.21 | learning rate: 2.108E-05 | global batch size:    16 | lm loss: 5.335760E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4022/  128728 | consumed samples:        64352 | consumed tokens:    131792896 | elapsed time per iteration (s): 15.20 | learning rate: 2.109E-05 | global batch size:    16 | lm loss: 5.347544E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4023/  128728 | consumed samples:        64368 | consumed tokens:    131825664 | elapsed time per iteration (s): 15.20 | learning rate: 2.109E-05 | global batch size:    16 | lm loss: 5.364645E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4024/  128728 | consumed samples:        64384 | consumed tokens:    131858432 | elapsed time per iteration (s): 15.21 | learning rate: 2.110E-05 | global batch size:    16 | lm loss: 5.381454E+00 | grad norm: 0.852 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4025/  128728 | consumed samples:        64400 | consumed tokens:    131891200 | elapsed time per iteration (s): 15.20 | learning rate: 2.110E-05 | global batch size:    16 | lm loss: 5.369591E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4026/  128728 | consumed samples:        64416 | consumed tokens:    131923968 | elapsed time per iteration (s): 15.24 | learning rate: 2.111E-05 | global batch size:    16 | lm loss: 5.085846E+00 | grad norm: 0.683 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4027/  128728 | consumed samples:        64432 | consumed tokens:    131956736 | elapsed time per iteration (s): 15.26 | learning rate: 2.111E-05 | global batch size:    16 | lm loss: 5.172122E+00 | grad norm: 1.055 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4028/  128728 | consumed samples:        64448 | consumed tokens:    131989504 | elapsed time per iteration (s): 15.17 | learning rate: 2.112E-05 | global batch size:    16 | lm loss: 5.216266E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4029/  128728 | consumed samples:        64464 | consumed tokens:    132022272 | elapsed time per iteration (s): 15.17 | learning rate: 2.112E-05 | global batch size:    16 | lm loss: 5.658593E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4030/  128728 | consumed samples:        64480 | consumed tokens:    132055040 | elapsed time per iteration (s): 15.26 | learning rate: 2.113E-05 | global batch size:    16 | lm loss: 5.392428E+00 | grad norm: 1.095 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4031/  128728 | consumed samples:        64496 | consumed tokens:    132087808 | elapsed time per iteration (s): 15.24 | learning rate: 2.113E-05 | global batch size:    16 | lm loss: 5.504789E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4032/  128728 | consumed samples:        64512 | consumed tokens:    132120576 | elapsed time per iteration (s): 15.23 | learning rate: 2.114E-05 | global batch size:    16 | lm loss: 5.514112E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4033/  128728 | consumed samples:        64528 | consumed tokens:    132153344 | elapsed time per iteration (s): 15.23 | learning rate: 2.114E-05 | global batch size:    16 | lm loss: 5.394161E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4034/  128728 | consumed samples:        64544 | consumed tokens:    132186112 | elapsed time per iteration (s): 15.20 | learning rate: 2.115E-05 | global batch size:    16 | lm loss: 5.352733E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4035/  128728 | consumed samples:        64560 | consumed tokens:    132218880 | elapsed time per iteration (s): 15.17 | learning rate: 2.116E-05 | global batch size:    16 | lm loss: 5.341866E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4036/  128728 | consumed samples:        64576 | consumed tokens:    132251648 | elapsed time per iteration (s): 15.21 | learning rate: 2.116E-05 | global batch size:    16 | lm loss: 5.249400E+00 | grad norm: 0.860 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4037/  128728 | consumed samples:        64592 | consumed tokens:    132284416 | elapsed time per iteration (s): 15.23 | learning rate: 2.117E-05 | global batch size:    16 | lm loss: 5.349155E+00 | grad norm: 0.646 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4038/  128728 | consumed samples:        64608 | consumed tokens:    132317184 | elapsed time per iteration (s): 15.24 | learning rate: 2.117E-05 | global batch size:    16 | lm loss: 5.406515E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4039/  128728 | consumed samples:        64624 | consumed tokens:    132349952 | elapsed time per iteration (s): 15.21 | learning rate: 2.118E-05 | global batch size:    16 | lm loss: 5.378917E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4040/  128728 | consumed samples:        64640 | consumed tokens:    132382720 | elapsed time per iteration (s): 15.20 | learning rate: 2.118E-05 | global batch size:    16 | lm loss: 5.407258E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4041/  128728 | consumed samples:        64656 | consumed tokens:    132415488 | elapsed time per iteration (s): 15.23 | learning rate: 2.119E-05 | global batch size:    16 | lm loss: 5.460241E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4042/  128728 | consumed samples:        64672 | consumed tokens:    132448256 | elapsed time per iteration (s): 15.17 | learning rate: 2.119E-05 | global batch size:    16 | lm loss: 5.356334E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4043/  128728 | consumed samples:        64688 | consumed tokens:    132481024 | elapsed time per iteration (s): 15.23 | learning rate: 2.120E-05 | global batch size:    16 | lm loss: 5.390794E+00 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4044/  128728 | consumed samples:        64704 | consumed tokens:    132513792 | elapsed time per iteration (s): 15.24 | learning rate: 2.120E-05 | global batch size:    16 | lm loss: 5.281722E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4045/  128728 | consumed samples:        64720 | consumed tokens:    132546560 | elapsed time per iteration (s): 15.19 | learning rate: 2.121E-05 | global batch size:    16 | lm loss: 5.315298E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4046/  128728 | consumed samples:        64736 | consumed tokens:    132579328 | elapsed time per iteration (s): 15.22 | learning rate: 2.121E-05 | global batch size:    16 | lm loss: 5.433873E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4047/  128728 | consumed samples:        64752 | consumed tokens:    132612096 | elapsed time per iteration (s): 15.24 | learning rate: 2.122E-05 | global batch size:    16 | lm loss: 5.412467E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4048/  128728 | consumed samples:        64768 | consumed tokens:    132644864 | elapsed time per iteration (s): 15.23 | learning rate: 2.122E-05 | global batch size:    16 | lm loss: 5.430539E+00 | grad norm: 0.621 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4049/  128728 | consumed samples:        64784 | consumed tokens:    132677632 | elapsed time per iteration (s): 15.16 | learning rate: 2.123E-05 | global batch size:    16 | lm loss: 5.519146E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4050/  128728 | consumed samples:        64800 | consumed tokens:    132710400 | elapsed time per iteration (s): 15.21 | learning rate: 2.123E-05 | global batch size:    16 | lm loss: 5.257305E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4051/  128728 | consumed samples:        64816 | consumed tokens:    132743168 | elapsed time per iteration (s): 15.23 | learning rate: 2.124E-05 | global batch size:    16 | lm loss: 5.442199E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4052/  128728 | consumed samples:        64832 | consumed tokens:    132775936 | elapsed time per iteration (s): 15.24 | learning rate: 2.124E-05 | global batch size:    16 | lm loss: 5.309348E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4053/  128728 | consumed samples:        64848 | consumed tokens:    132808704 | elapsed time per iteration (s): 15.23 | learning rate: 2.125E-05 | global batch size:    16 | lm loss: 5.319548E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4054/  128728 | consumed samples:        64864 | consumed tokens:    132841472 | elapsed time per iteration (s): 15.25 | learning rate: 2.125E-05 | global batch size:    16 | lm loss: 5.239305E+00 | grad norm: 1.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4055/  128728 | consumed samples:        64880 | consumed tokens:    132874240 | elapsed time per iteration (s): 15.20 | learning rate: 2.126E-05 | global batch size:    16 | lm loss: 5.223973E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4056/  128728 | consumed samples:        64896 | consumed tokens:    132907008 | elapsed time per iteration (s): 15.21 | learning rate: 2.127E-05 | global batch size:    16 | lm loss: 5.312015E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4057/  128728 | consumed samples:        64912 | consumed tokens:    132939776 | elapsed time per iteration (s): 15.22 | learning rate: 2.127E-05 | global batch size:    16 | lm loss: 5.303139E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4058/  128728 | consumed samples:        64928 | consumed tokens:    132972544 | elapsed time per iteration (s): 15.20 | learning rate: 2.128E-05 | global batch size:    16 | lm loss: 5.248675E+00 | grad norm: 0.865 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4059/  128728 | consumed samples:        64944 | consumed tokens:    133005312 | elapsed time per iteration (s): 15.23 | learning rate: 2.128E-05 | global batch size:    16 | lm loss: 5.210124E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4060/  128728 | consumed samples:        64960 | consumed tokens:    133038080 | elapsed time per iteration (s): 15.21 | learning rate: 2.129E-05 | global batch size:    16 | lm loss: 5.407516E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4061/  128728 | consumed samples:        64976 | consumed tokens:    133070848 | elapsed time per iteration (s): 15.24 | learning rate: 2.129E-05 | global batch size:    16 | lm loss: 5.311096E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4062/  128728 | consumed samples:        64992 | consumed tokens:    133103616 | elapsed time per iteration (s): 15.21 | learning rate: 2.130E-05 | global batch size:    16 | lm loss: 5.263693E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4063/  128728 | consumed samples:        65008 | consumed tokens:    133136384 | elapsed time per iteration (s): 15.24 | learning rate: 2.130E-05 | global batch size:    16 | lm loss: 5.587279E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4064/  128728 | consumed samples:        65024 | consumed tokens:    133169152 | elapsed time per iteration (s): 15.22 | learning rate: 2.131E-05 | global batch size:    16 | lm loss: 5.389854E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4065/  128728 | consumed samples:        65040 | consumed tokens:    133201920 | elapsed time per iteration (s): 15.22 | learning rate: 2.131E-05 | global batch size:    16 | lm loss: 5.493057E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4066/  128728 | consumed samples:        65056 | consumed tokens:    133234688 | elapsed time per iteration (s): 15.21 | learning rate: 2.132E-05 | global batch size:    16 | lm loss: 5.206816E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4067/  128728 | consumed samples:        65072 | consumed tokens:    133267456 | elapsed time per iteration (s): 15.23 | learning rate: 2.132E-05 | global batch size:    16 | lm loss: 5.457879E+00 | grad norm: 0.985 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4068/  128728 | consumed samples:        65088 | consumed tokens:    133300224 | elapsed time per iteration (s): 15.16 | learning rate: 2.133E-05 | global batch size:    16 | lm loss: 5.155887E+00 | grad norm: 0.949 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4069/  128728 | consumed samples:        65104 | consumed tokens:    133332992 | elapsed time per iteration (s): 15.22 | learning rate: 2.133E-05 | global batch size:    16 | lm loss: 5.326896E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4070/  128728 | consumed samples:        65120 | consumed tokens:    133365760 | elapsed time per iteration (s): 15.21 | learning rate: 2.134E-05 | global batch size:    16 | lm loss: 5.390995E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4071/  128728 | consumed samples:        65136 | consumed tokens:    133398528 | elapsed time per iteration (s): 15.21 | learning rate: 2.134E-05 | global batch size:    16 | lm loss: 5.471291E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4072/  128728 | consumed samples:        65152 | consumed tokens:    133431296 | elapsed time per iteration (s): 15.17 | learning rate: 2.135E-05 | global batch size:    16 | lm loss: 5.400915E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     4073/  128728 | consumed samples:        65168 | consumed tokens:    133464064 | elapsed time per iteration (s): 15.21 | learning rate: 2.135E-05 | global batch size:    16 | lm loss: 5.486737E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4074/  128728 | consumed samples:        65184 | consumed tokens:    133496832 | elapsed time per iteration (s): 15.21 | learning rate: 2.136E-05 | global batch size:    16 | lm loss: 5.547410E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4075/  128728 | consumed samples:        65200 | consumed tokens:    133529600 | elapsed time per iteration (s): 15.24 | learning rate: 2.136E-05 | global batch size:    16 | lm loss: 5.069981E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4076/  128728 | consumed samples:        65216 | consumed tokens:    133562368 | elapsed time per iteration (s): 15.18 | learning rate: 2.137E-05 | global batch size:    16 | lm loss: 5.209479E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4077/  128728 | consumed samples:        65232 | consumed tokens:    133595136 | elapsed time per iteration (s): 15.21 | learning rate: 2.138E-05 | global batch size:    16 | lm loss: 5.274742E+00 | grad norm: 1.531 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4078/  128728 | consumed samples:        65248 | consumed tokens:    133627904 | elapsed time per iteration (s): 15.22 | learning rate: 2.138E-05 | global batch size:    16 | lm loss: 5.524727E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4079/  128728 | consumed samples:        65264 | consumed tokens:    133660672 | elapsed time per iteration (s): 15.17 | learning rate: 2.139E-05 | global batch size:    16 | lm loss: 5.480323E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4080/  128728 | consumed samples:        65280 | consumed tokens:    133693440 | elapsed time per iteration (s): 15.21 | learning rate: 2.139E-05 | global batch size:    16 | lm loss: 5.410918E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4081/  128728 | consumed samples:        65296 | consumed tokens:    133726208 | elapsed time per iteration (s): 15.18 | learning rate: 2.140E-05 | global batch size:    16 | lm loss: 5.439363E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4082/  128728 | consumed samples:        65312 | consumed tokens:    133758976 | elapsed time per iteration (s): 15.22 | learning rate: 2.140E-05 | global batch size:    16 | lm loss: 5.484829E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4083/  128728 | consumed samples:        65328 | consumed tokens:    133791744 | elapsed time per iteration (s): 15.22 | learning rate: 2.141E-05 | global batch size:    16 | lm loss: 5.084867E+00 | grad norm: 1.353 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4084/  128728 | consumed samples:        65344 | consumed tokens:    133824512 | elapsed time per iteration (s): 15.23 | learning rate: 2.141E-05 | global batch size:    16 | lm loss: 5.351573E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4085/  128728 | consumed samples:        65360 | consumed tokens:    133857280 | elapsed time per iteration (s): 15.20 | learning rate: 2.142E-05 | global batch size:    16 | lm loss: 5.428648E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4086/  128728 | consumed samples:        65376 | consumed tokens:    133890048 | elapsed time per iteration (s): 15.23 | learning rate: 2.142E-05 | global batch size:    16 | lm loss: 5.263672E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4087/  128728 | consumed samples:        65392 | consumed tokens:    133922816 | elapsed time per iteration (s): 15.24 | learning rate: 2.143E-05 | global batch size:    16 | lm loss: 5.312733E+00 | grad norm: 5.160 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4088/  128728 | consumed samples:        65408 | consumed tokens:    133955584 | elapsed time per iteration (s): 15.15 | learning rate: 2.143E-05 | global batch size:    16 | lm loss: 5.681293E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     4089/  128728 | consumed samples:        65424 | consumed tokens:    133988352 | elapsed time per iteration (s): 15.23 | learning rate: 2.144E-05 | global batch size:    16 | lm loss: 5.367632E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4090/  128728 | consumed samples:        65440 | consumed tokens:    134021120 | elapsed time per iteration (s): 15.22 | learning rate: 2.144E-05 | global batch size:    16 | lm loss: 5.301403E+00 | grad norm: 2.269 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4091/  128728 | consumed samples:        65456 | consumed tokens:    134053888 | elapsed time per iteration (s): 15.20 | learning rate: 2.145E-05 | global batch size:    16 | lm loss: 5.406228E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4092/  128728 | consumed samples:        65472 | consumed tokens:    134086656 | elapsed time per iteration (s): 15.22 | learning rate: 2.145E-05 | global batch size:    16 | lm loss: 5.559772E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4093/  128728 | consumed samples:        65488 | consumed tokens:    134119424 | elapsed time per iteration (s): 15.25 | learning rate: 2.146E-05 | global batch size:    16 | lm loss: 5.276206E+00 | grad norm: 2.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4094/  128728 | consumed samples:        65504 | consumed tokens:    134152192 | elapsed time per iteration (s): 15.25 | learning rate: 2.146E-05 | global batch size:    16 | lm loss: 5.252672E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4095/  128728 | consumed samples:        65520 | consumed tokens:    134184960 | elapsed time per iteration (s): 15.22 | learning rate: 2.147E-05 | global batch size:    16 | lm loss: 5.704553E+00 | grad norm: 0.956 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4096/  128728 | consumed samples:        65536 | consumed tokens:    134217728 | elapsed time per iteration (s): 15.24 | learning rate: 2.147E-05 | global batch size:    16 | lm loss: 5.402080E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4097/  128728 | consumed samples:        65552 | consumed tokens:    134250496 | elapsed time per iteration (s): 15.25 | learning rate: 2.148E-05 | global batch size:    16 | lm loss: 5.459865E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4098/  128728 | consumed samples:        65568 | consumed tokens:    134283264 | elapsed time per iteration (s): 15.22 | learning rate: 2.149E-05 | global batch size:    16 | lm loss: 5.188199E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4099/  128728 | consumed samples:        65584 | consumed tokens:    134316032 | elapsed time per iteration (s): 15.17 | learning rate: 2.149E-05 | global batch size:    16 | lm loss: 5.217367E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4100/  128728 | consumed samples:        65600 | consumed tokens:    134348800 | elapsed time per iteration (s): 15.25 | learning rate: 2.150E-05 | global batch size:    16 | lm loss: 5.555455E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4101/  128728 | consumed samples:        65616 | consumed tokens:    134381568 | elapsed time per iteration (s): 15.26 | learning rate: 2.150E-05 | global batch size:    16 | lm loss: 5.449624E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4102/  128728 | consumed samples:        65632 | consumed tokens:    134414336 | elapsed time per iteration (s): 15.24 | learning rate: 2.151E-05 | global batch size:    16 | lm loss: 5.290594E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4103/  128728 | consumed samples:        65648 | consumed tokens:    134447104 | elapsed time per iteration (s): 15.19 | learning rate: 2.151E-05 | global batch size:    16 | lm loss: 5.386337E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4104/  128728 | consumed samples:        65664 | consumed tokens:    134479872 | elapsed time per iteration (s): 15.21 | learning rate: 2.152E-05 | global batch size:    16 | lm loss: 5.386989E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4105/  128728 | consumed samples:        65680 | consumed tokens:    134512640 | elapsed time per iteration (s): 15.20 | learning rate: 2.152E-05 | global batch size:    16 | lm loss: 5.478673E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4106/  128728 | consumed samples:        65696 | consumed tokens:    134545408 | elapsed time per iteration (s): 15.22 | learning rate: 2.153E-05 | global batch size:    16 | lm loss: 5.298902E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4107/  128728 | consumed samples:        65712 | consumed tokens:    134578176 | elapsed time per iteration (s): 15.24 | learning rate: 2.153E-05 | global batch size:    16 | lm loss: 5.438011E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4108/  128728 | consumed samples:        65728 | consumed tokens:    134610944 | elapsed time per iteration (s): 15.20 | learning rate: 2.154E-05 | global batch size:    16 | lm loss: 5.436982E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4109/  128728 | consumed samples:        65744 | consumed tokens:    134643712 | elapsed time per iteration (s): 15.16 | learning rate: 2.154E-05 | global batch size:    16 | lm loss: 5.352978E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4110/  128728 | consumed samples:        65760 | consumed tokens:    134676480 | elapsed time per iteration (s): 15.16 | learning rate: 2.155E-05 | global batch size:    16 | lm loss: 5.435023E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     4111/  128728 | consumed samples:        65776 | consumed tokens:    134709248 | elapsed time per iteration (s): 15.25 | learning rate: 2.155E-05 | global batch size:    16 | lm loss: 5.427981E+00 | grad norm: 3.049 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4112/  128728 | consumed samples:        65792 | consumed tokens:    134742016 | elapsed time per iteration (s): 15.19 | learning rate: 2.156E-05 | global batch size:    16 | lm loss: 5.224471E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4113/  128728 | consumed samples:        65808 | consumed tokens:    134774784 | elapsed time per iteration (s): 15.21 | learning rate: 2.156E-05 | global batch size:    16 | lm loss: 5.465331E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4114/  128728 | consumed samples:        65824 | consumed tokens:    134807552 | elapsed time per iteration (s): 15.20 | learning rate: 2.157E-05 | global batch size:    16 | lm loss: 5.457709E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4115/  128728 | consumed samples:        65840 | consumed tokens:    134840320 | elapsed time per iteration (s): 15.20 | learning rate: 2.157E-05 | global batch size:    16 | lm loss: 5.527482E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4116/  128728 | consumed samples:        65856 | consumed tokens:    134873088 | elapsed time per iteration (s): 15.20 | learning rate: 2.158E-05 | global batch size:    16 | lm loss: 5.455926E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4117/  128728 | consumed samples:        65872 | consumed tokens:    134905856 | elapsed time per iteration (s): 15.20 | learning rate: 2.158E-05 | global batch size:    16 | lm loss: 5.303745E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4118/  128728 | consumed samples:        65888 | consumed tokens:    134938624 | elapsed time per iteration (s): 15.18 | learning rate: 2.159E-05 | global batch size:    16 | lm loss: 5.071564E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4119/  128728 | consumed samples:        65904 | consumed tokens:    134971392 | elapsed time per iteration (s): 15.22 | learning rate: 2.160E-05 | global batch size:    16 | lm loss: 5.477110E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4120/  128728 | consumed samples:        65920 | consumed tokens:    135004160 | elapsed time per iteration (s): 15.22 | learning rate: 2.160E-05 | global batch size:    16 | lm loss: 5.212381E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4121/  128728 | consumed samples:        65936 | consumed tokens:    135036928 | elapsed time per iteration (s): 15.20 | learning rate: 2.161E-05 | global batch size:    16 | lm loss: 5.270615E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4122/  128728 | consumed samples:        65952 | consumed tokens:    135069696 | elapsed time per iteration (s): 15.25 | learning rate: 2.161E-05 | global batch size:    16 | lm loss: 5.196959E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4123/  128728 | consumed samples:        65968 | consumed tokens:    135102464 | elapsed time per iteration (s): 15.23 | learning rate: 2.162E-05 | global batch size:    16 | lm loss: 5.202285E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4124/  128728 | consumed samples:        65984 | consumed tokens:    135135232 | elapsed time per iteration (s): 15.24 | learning rate: 2.162E-05 | global batch size:    16 | lm loss: 5.252374E+00 | grad norm: 1.102 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4125/  128728 | consumed samples:        66000 | consumed tokens:    135168000 | elapsed time per iteration (s): 15.22 | learning rate: 2.163E-05 | global batch size:    16 | lm loss: 5.523699E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4126/  128728 | consumed samples:        66016 | consumed tokens:    135200768 | elapsed time per iteration (s): 15.25 | learning rate: 2.163E-05 | global batch size:    16 | lm loss: 5.421499E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4127/  128728 | consumed samples:        66032 | consumed tokens:    135233536 | elapsed time per iteration (s): 15.23 | learning rate: 2.164E-05 | global batch size:    16 | lm loss: 5.364021E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4128/  128728 | consumed samples:        66048 | consumed tokens:    135266304 | elapsed time per iteration (s): 15.28 | learning rate: 2.164E-05 | global batch size:    16 | lm loss: 5.282767E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     4129/  128728 | consumed samples:        66064 | consumed tokens:    135299072 | elapsed time per iteration (s): 15.23 | learning rate: 2.165E-05 | global batch size:    16 | lm loss: 5.315971E+00 | grad norm: 1.252 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4130/  128728 | consumed samples:        66080 | consumed tokens:    135331840 | elapsed time per iteration (s): 15.22 | learning rate: 2.165E-05 | global batch size:    16 | lm loss: 5.305749E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4131/  128728 | consumed samples:        66096 | consumed tokens:    135364608 | elapsed time per iteration (s): 15.24 | learning rate: 2.166E-05 | global batch size:    16 | lm loss: 5.339551E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4132/  128728 | consumed samples:        66112 | consumed tokens:    135397376 | elapsed time per iteration (s): 15.16 | learning rate: 2.166E-05 | global batch size:    16 | lm loss: 5.253937E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4133/  128728 | consumed samples:        66128 | consumed tokens:    135430144 | elapsed time per iteration (s): 15.22 | learning rate: 2.167E-05 | global batch size:    16 | lm loss: 5.494246E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4134/  128728 | consumed samples:        66144 | consumed tokens:    135462912 | elapsed time per iteration (s): 15.21 | learning rate: 2.167E-05 | global batch size:    16 | lm loss: 5.367308E+00 | grad norm: 1.105 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4135/  128728 | consumed samples:        66160 | consumed tokens:    135495680 | elapsed time per iteration (s): 15.21 | learning rate: 2.168E-05 | global batch size:    16 | lm loss: 5.511875E+00 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4136/  128728 | consumed samples:        66176 | consumed tokens:    135528448 | elapsed time per iteration (s): 15.13 | learning rate: 2.168E-05 | global batch size:    16 | lm loss: 5.383279E+00 | grad norm: 0.633 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.10 |
[default7]: iteration     4137/  128728 | consumed samples:        66192 | consumed tokens:    135561216 | elapsed time per iteration (s): 15.20 | learning rate: 2.169E-05 | global batch size:    16 | lm loss: 5.457438E+00 | grad norm: 0.905 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4138/  128728 | consumed samples:        66208 | consumed tokens:    135593984 | elapsed time per iteration (s): 15.20 | learning rate: 2.170E-05 | global batch size:    16 | lm loss: 5.423141E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4139/  128728 | consumed samples:        66224 | consumed tokens:    135626752 | elapsed time per iteration (s): 15.18 | learning rate: 2.170E-05 | global batch size:    16 | lm loss: 5.250698E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4140/  128728 | consumed samples:        66240 | consumed tokens:    135659520 | elapsed time per iteration (s): 15.23 | learning rate: 2.171E-05 | global batch size:    16 | lm loss: 5.458531E+00 | grad norm: 1.062 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4141/  128728 | consumed samples:        66256 | consumed tokens:    135692288 | elapsed time per iteration (s): 15.22 | learning rate: 2.171E-05 | global batch size:    16 | lm loss: 5.207209E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4142/  128728 | consumed samples:        66272 | consumed tokens:    135725056 | elapsed time per iteration (s): 15.23 | learning rate: 2.172E-05 | global batch size:    16 | lm loss: 5.151188E+00 | grad norm: 0.978 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4143/  128728 | consumed samples:        66288 | consumed tokens:    135757824 | elapsed time per iteration (s): 15.15 | learning rate: 2.172E-05 | global batch size:    16 | lm loss: 5.418912E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     4144/  128728 | consumed samples:        66304 | consumed tokens:    135790592 | elapsed time per iteration (s): 15.24 | learning rate: 2.173E-05 | global batch size:    16 | lm loss: 5.352671E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4145/  128728 | consumed samples:        66320 | consumed tokens:    135823360 | elapsed time per iteration (s): 15.23 | learning rate: 2.173E-05 | global batch size:    16 | lm loss: 5.245791E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4146/  128728 | consumed samples:        66336 | consumed tokens:    135856128 | elapsed time per iteration (s): 15.17 | learning rate: 2.174E-05 | global batch size:    16 | lm loss: 5.323851E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4147/  128728 | consumed samples:        66352 | consumed tokens:    135888896 | elapsed time per iteration (s): 15.23 | learning rate: 2.174E-05 | global batch size:    16 | lm loss: 5.287149E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4148/  128728 | consumed samples:        66368 | consumed tokens:    135921664 | elapsed time per iteration (s): 15.22 | learning rate: 2.175E-05 | global batch size:    16 | lm loss: 5.266491E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4149/  128728 | consumed samples:        66384 | consumed tokens:    135954432 | elapsed time per iteration (s): 15.23 | learning rate: 2.175E-05 | global batch size:    16 | lm loss: 5.090036E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4150/  128728 | consumed samples:        66400 | consumed tokens:    135987200 | elapsed time per iteration (s): 15.24 | learning rate: 2.176E-05 | global batch size:    16 | lm loss: 5.160160E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4151/  128728 | consumed samples:        66416 | consumed tokens:    136019968 | elapsed time per iteration (s): 15.23 | learning rate: 2.176E-05 | global batch size:    16 | lm loss: 5.263034E+00 | grad norm: 0.865 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4152/  128728 | consumed samples:        66432 | consumed tokens:    136052736 | elapsed time per iteration (s): 15.20 | learning rate: 2.177E-05 | global batch size:    16 | lm loss: 5.461198E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4153/  128728 | consumed samples:        66448 | consumed tokens:    136085504 | elapsed time per iteration (s): 15.22 | learning rate: 2.177E-05 | global batch size:    16 | lm loss: 5.331557E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4154/  128728 | consumed samples:        66464 | consumed tokens:    136118272 | elapsed time per iteration (s): 15.21 | learning rate: 2.178E-05 | global batch size:    16 | lm loss: 5.365318E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4155/  128728 | consumed samples:        66480 | consumed tokens:    136151040 | elapsed time per iteration (s): 15.23 | learning rate: 2.178E-05 | global batch size:    16 | lm loss: 5.274574E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4156/  128728 | consumed samples:        66496 | consumed tokens:    136183808 | elapsed time per iteration (s): 15.19 | learning rate: 2.179E-05 | global batch size:    16 | lm loss: 5.311491E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4157/  128728 | consumed samples:        66512 | consumed tokens:    136216576 | elapsed time per iteration (s): 15.20 | learning rate: 2.179E-05 | global batch size:    16 | lm loss: 5.222072E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4158/  128728 | consumed samples:        66528 | consumed tokens:    136249344 | elapsed time per iteration (s): 15.24 | learning rate: 2.180E-05 | global batch size:    16 | lm loss: 5.269310E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4159/  128728 | consumed samples:        66544 | consumed tokens:    136282112 | elapsed time per iteration (s): 15.25 | learning rate: 2.181E-05 | global batch size:    16 | lm loss: 5.600447E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4160/  128728 | consumed samples:        66560 | consumed tokens:    136314880 | elapsed time per iteration (s): 15.20 | learning rate: 2.181E-05 | global batch size:    16 | lm loss: 5.225094E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4161/  128728 | consumed samples:        66576 | consumed tokens:    136347648 | elapsed time per iteration (s): 15.25 | learning rate: 2.182E-05 | global batch size:    16 | lm loss: 5.211500E+00 | grad norm: 0.992 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4162/  128728 | consumed samples:        66592 | consumed tokens:    136380416 | elapsed time per iteration (s): 15.21 | learning rate: 2.182E-05 | global batch size:    16 | lm loss: 5.321060E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4163/  128728 | consumed samples:        66608 | consumed tokens:    136413184 | elapsed time per iteration (s): 15.25 | learning rate: 2.183E-05 | global batch size:    16 | lm loss: 5.462878E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4164/  128728 | consumed samples:        66624 | consumed tokens:    136445952 | elapsed time per iteration (s): 15.16 | learning rate: 2.183E-05 | global batch size:    16 | lm loss: 5.297573E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4165/  128728 | consumed samples:        66640 | consumed tokens:    136478720 | elapsed time per iteration (s): 15.25 | learning rate: 2.184E-05 | global batch size:    16 | lm loss: 5.334221E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4166/  128728 | consumed samples:        66656 | consumed tokens:    136511488 | elapsed time per iteration (s): 15.18 | learning rate: 2.184E-05 | global batch size:    16 | lm loss: 5.570589E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4167/  128728 | consumed samples:        66672 | consumed tokens:    136544256 | elapsed time per iteration (s): 15.21 | learning rate: 2.185E-05 | global batch size:    16 | lm loss: 5.293012E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4168/  128728 | consumed samples:        66688 | consumed tokens:    136577024 | elapsed time per iteration (s): 15.23 | learning rate: 2.185E-05 | global batch size:    16 | lm loss: 5.266202E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4169/  128728 | consumed samples:        66704 | consumed tokens:    136609792 | elapsed time per iteration (s): 15.17 | learning rate: 2.186E-05 | global batch size:    16 | lm loss: 5.267851E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     4170/  128728 | consumed samples:        66720 | consumed tokens:    136642560 | elapsed time per iteration (s): 15.17 | learning rate: 2.186E-05 | global batch size:    16 | lm loss: 5.526597E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     4171/  128728 | consumed samples:        66736 | consumed tokens:    136675328 | elapsed time per iteration (s): 15.16 | learning rate: 2.187E-05 | global batch size:    16 | lm loss: 5.368105E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4172/  128728 | consumed samples:        66752 | consumed tokens:    136708096 | elapsed time per iteration (s): 15.18 | learning rate: 2.187E-05 | global batch size:    16 | lm loss: 5.369236E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4173/  128728 | consumed samples:        66768 | consumed tokens:    136740864 | elapsed time per iteration (s): 15.16 | learning rate: 2.188E-05 | global batch size:    16 | lm loss: 5.310276E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4174/  128728 | consumed samples:        66784 | consumed tokens:    136773632 | elapsed time per iteration (s): 15.20 | learning rate: 2.188E-05 | global batch size:    16 | lm loss: 5.394924E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4175/  128728 | consumed samples:        66800 | consumed tokens:    136806400 | elapsed time per iteration (s): 15.20 | learning rate: 2.189E-05 | global batch size:    16 | lm loss: 5.482425E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4176/  128728 | consumed samples:        66816 | consumed tokens:    136839168 | elapsed time per iteration (s): 15.26 | learning rate: 2.189E-05 | global batch size:    16 | lm loss: 5.236983E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4177/  128728 | consumed samples:        66832 | consumed tokens:    136871936 | elapsed time per iteration (s): 15.21 | learning rate: 2.190E-05 | global batch size:    16 | lm loss: 5.265677E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4178/  128728 | consumed samples:        66848 | consumed tokens:    136904704 | elapsed time per iteration (s): 15.27 | learning rate: 2.190E-05 | global batch size:    16 | lm loss: 5.455791E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     4179/  128728 | consumed samples:        66864 | consumed tokens:    136937472 | elapsed time per iteration (s): 15.19 | learning rate: 2.191E-05 | global batch size:    16 | lm loss: 5.398285E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4180/  128728 | consumed samples:        66880 | consumed tokens:    136970240 | elapsed time per iteration (s): 15.27 | learning rate: 2.192E-05 | global batch size:    16 | lm loss: 5.405532E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     4181/  128728 | consumed samples:        66896 | consumed tokens:    137003008 | elapsed time per iteration (s): 15.19 | learning rate: 2.192E-05 | global batch size:    16 | lm loss: 5.397069E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     4182/  128728 | consumed samples:        66912 | consumed tokens:    137035776 | elapsed time per iteration (s): 15.15 | learning rate: 2.193E-05 | global batch size:    16 | lm loss: 5.299715E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     4183/  128728 | consumed samples:        66928 | consumed tokens:    137068544 | elapsed time per iteration (s): 15.15 | learning rate: 2.193E-05 | global batch size:    16 | lm loss: 5.255721E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     4184/  128728 | consumed samples:        66944 | consumed tokens:    137101312 | elapsed time per iteration (s): 15.17 | learning rate: 2.194E-05 | global batch size:    16 | lm loss: 5.327215E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     4185/  128728 | consumed samples:        66960 | consumed tokens:    137134080 | elapsed time per iteration (s): 15.22 | learning rate: 2.194E-05 | global batch size:    16 | lm loss: 5.332559E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4186/  128728 | consumed samples:        66976 | consumed tokens:    137166848 | elapsed time per iteration (s): 15.21 | learning rate: 2.195E-05 | global batch size:    16 | lm loss: 5.161037E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4187/  128728 | consumed samples:        66992 | consumed tokens:    137199616 | elapsed time per iteration (s): 15.17 | learning rate: 2.195E-05 | global batch size:    16 | lm loss: 5.237501E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4188/  128728 | consumed samples:        67008 | consumed tokens:    137232384 | elapsed time per iteration (s): 15.19 | learning rate: 2.196E-05 | global batch size:    16 | lm loss: 5.113091E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4189/  128728 | consumed samples:        67024 | consumed tokens:    137265152 | elapsed time per iteration (s): 15.19 | learning rate: 2.196E-05 | global batch size:    16 | lm loss: 5.165911E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4190/  128728 | consumed samples:        67040 | consumed tokens:    137297920 | elapsed time per iteration (s): 15.15 | learning rate: 2.197E-05 | global batch size:    16 | lm loss: 5.443858E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     4191/  128728 | consumed samples:        67056 | consumed tokens:    137330688 | elapsed time per iteration (s): 15.15 | learning rate: 2.197E-05 | global batch size:    16 | lm loss: 5.206475E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     4192/  128728 | consumed samples:        67072 | consumed tokens:    137363456 | elapsed time per iteration (s): 15.18 | learning rate: 2.198E-05 | global batch size:    16 | lm loss: 5.317861E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4193/  128728 | consumed samples:        67088 | consumed tokens:    137396224 | elapsed time per iteration (s): 15.21 | learning rate: 2.198E-05 | global batch size:    16 | lm loss: 5.115374E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4194/  128728 | consumed samples:        67104 | consumed tokens:    137428992 | elapsed time per iteration (s): 15.23 | learning rate: 2.199E-05 | global batch size:    16 | lm loss: 5.261423E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4195/  128728 | consumed samples:        67120 | consumed tokens:    137461760 | elapsed time per iteration (s): 15.24 | learning rate: 2.199E-05 | global batch size:    16 | lm loss: 5.248822E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4196/  128728 | consumed samples:        67136 | consumed tokens:    137494528 | elapsed time per iteration (s): 15.18 | learning rate: 2.200E-05 | global batch size:    16 | lm loss: 5.515530E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4197/  128728 | consumed samples:        67152 | consumed tokens:    137527296 | elapsed time per iteration (s): 15.18 | learning rate: 2.200E-05 | global batch size:    16 | lm loss: 5.335524E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4198/  128728 | consumed samples:        67168 | consumed tokens:    137560064 | elapsed time per iteration (s): 15.23 | learning rate: 2.201E-05 | global batch size:    16 | lm loss: 5.289407E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4199/  128728 | consumed samples:        67184 | consumed tokens:    137592832 | elapsed time per iteration (s): 15.24 | learning rate: 2.201E-05 | global batch size:    16 | lm loss: 5.457632E+00 | grad norm: 1.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4200/  128728 | consumed samples:        67200 | consumed tokens:    137625600 | elapsed time per iteration (s): 15.24 | learning rate: 2.202E-05 | global batch size:    16 | lm loss: 5.170599E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4201/  128728 | consumed samples:        67216 | consumed tokens:    137658368 | elapsed time per iteration (s): 15.21 | learning rate: 2.203E-05 | global batch size:    16 | lm loss: 5.229961E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4202/  128728 | consumed samples:        67232 | consumed tokens:    137691136 | elapsed time per iteration (s): 15.28 | learning rate: 2.203E-05 | global batch size:    16 | lm loss: 5.323138E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     4203/  128728 | consumed samples:        67248 | consumed tokens:    137723904 | elapsed time per iteration (s): 15.25 | learning rate: 2.204E-05 | global batch size:    16 | lm loss: 5.334191E+00 | grad norm: 2.369 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4204/  128728 | consumed samples:        67264 | consumed tokens:    137756672 | elapsed time per iteration (s): 15.20 | learning rate: 2.204E-05 | global batch size:    16 | lm loss: 5.436996E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4205/  128728 | consumed samples:        67280 | consumed tokens:    137789440 | elapsed time per iteration (s): 15.19 | learning rate: 2.205E-05 | global batch size:    16 | lm loss: 5.285421E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     4206/  128728 | consumed samples:        67296 | consumed tokens:    137822208 | elapsed time per iteration (s): 15.16 | learning rate: 2.205E-05 | global batch size:    16 | lm loss: 5.376272E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     4207/  128728 | consumed samples:        67312 | consumed tokens:    137854976 | elapsed time per iteration (s): 15.25 | learning rate: 2.206E-05 | global batch size:    16 | lm loss: 5.097405E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4208/  128728 | consumed samples:        67328 | consumed tokens:    137887744 | elapsed time per iteration (s): 15.19 | learning rate: 2.206E-05 | global batch size:    16 | lm loss: 5.426728E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4209/  128728 | consumed samples:        67344 | consumed tokens:    137920512 | elapsed time per iteration (s): 15.24 | learning rate: 2.207E-05 | global batch size:    16 | lm loss: 5.375102E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4210/  128728 | consumed samples:        67360 | consumed tokens:    137953280 | elapsed time per iteration (s): 15.21 | learning rate: 2.207E-05 | global batch size:    16 | lm loss: 5.303322E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4211/  128728 | consumed samples:        67376 | consumed tokens:    137986048 | elapsed time per iteration (s): 15.26 | learning rate: 2.208E-05 | global batch size:    16 | lm loss: 5.251130E+00 | grad norm: 1.012 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     4212/  128728 | consumed samples:        67392 | consumed tokens:    138018816 | elapsed time per iteration (s): 15.24 | learning rate: 2.208E-05 | global batch size:    16 | lm loss: 5.580229E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4213/  128728 | consumed samples:        67408 | consumed tokens:    138051584 | elapsed time per iteration (s): 15.23 | learning rate: 2.209E-05 | global batch size:    16 | lm loss: 5.441099E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4214/  128728 | consumed samples:        67424 | consumed tokens:    138084352 | elapsed time per iteration (s): 15.25 | learning rate: 2.209E-05 | global batch size:    16 | lm loss: 5.526458E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4215/  128728 | consumed samples:        67440 | consumed tokens:    138117120 | elapsed time per iteration (s): 15.19 | learning rate: 2.210E-05 | global batch size:    16 | lm loss: 5.415507E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4216/  128728 | consumed samples:        67456 | consumed tokens:    138149888 | elapsed time per iteration (s): 15.28 | learning rate: 2.210E-05 | global batch size:    16 | lm loss: 5.300536E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     4217/  128728 | consumed samples:        67472 | consumed tokens:    138182656 | elapsed time per iteration (s): 15.24 | learning rate: 2.211E-05 | global batch size:    16 | lm loss: 5.354405E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4218/  128728 | consumed samples:        67488 | consumed tokens:    138215424 | elapsed time per iteration (s): 15.22 | learning rate: 2.211E-05 | global batch size:    16 | lm loss: 5.247156E+00 | grad norm: 0.860 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4219/  128728 | consumed samples:        67504 | consumed tokens:    138248192 | elapsed time per iteration (s): 15.21 | learning rate: 2.212E-05 | global batch size:    16 | lm loss: 5.200278E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4220/  128728 | consumed samples:        67520 | consumed tokens:    138280960 | elapsed time per iteration (s): 15.21 | learning rate: 2.213E-05 | global batch size:    16 | lm loss: 5.260693E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4221/  128728 | consumed samples:        67536 | consumed tokens:    138313728 | elapsed time per iteration (s): 15.22 | learning rate: 2.213E-05 | global batch size:    16 | lm loss: 5.003216E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4222/  128728 | consumed samples:        67552 | consumed tokens:    138346496 | elapsed time per iteration (s): 15.21 | learning rate: 2.214E-05 | global batch size:    16 | lm loss: 5.429131E+00 | grad norm: 1.260 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4223/  128728 | consumed samples:        67568 | consumed tokens:    138379264 | elapsed time per iteration (s): 15.22 | learning rate: 2.214E-05 | global batch size:    16 | lm loss: 5.260954E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4224/  128728 | consumed samples:        67584 | consumed tokens:    138412032 | elapsed time per iteration (s): 15.20 | learning rate: 2.215E-05 | global batch size:    16 | lm loss: 5.218945E+00 | grad norm: 3.163 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4225/  128728 | consumed samples:        67600 | consumed tokens:    138444800 | elapsed time per iteration (s): 15.25 | learning rate: 2.215E-05 | global batch size:    16 | lm loss: 5.612597E+00 | grad norm: 1.303 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4226/  128728 | consumed samples:        67616 | consumed tokens:    138477568 | elapsed time per iteration (s): 15.22 | learning rate: 2.216E-05 | global batch size:    16 | lm loss: 5.233457E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4227/  128728 | consumed samples:        67632 | consumed tokens:    138510336 | elapsed time per iteration (s): 15.20 | learning rate: 2.216E-05 | global batch size:    16 | lm loss: 5.089907E+00 | grad norm: 0.970 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4228/  128728 | consumed samples:        67648 | consumed tokens:    138543104 | elapsed time per iteration (s): 15.25 | learning rate: 2.217E-05 | global batch size:    16 | lm loss: 5.520075E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4229/  128728 | consumed samples:        67664 | consumed tokens:    138575872 | elapsed time per iteration (s): 15.19 | learning rate: 2.217E-05 | global batch size:    16 | lm loss: 5.053356E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4230/  128728 | consumed samples:        67680 | consumed tokens:    138608640 | elapsed time per iteration (s): 15.24 | learning rate: 2.218E-05 | global batch size:    16 | lm loss: 5.406147E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4231/  128728 | consumed samples:        67696 | consumed tokens:    138641408 | elapsed time per iteration (s): 15.20 | learning rate: 2.218E-05 | global batch size:    16 | lm loss: 5.376842E+00 | grad norm: 0.913 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4232/  128728 | consumed samples:        67712 | consumed tokens:    138674176 | elapsed time per iteration (s): 15.21 | learning rate: 2.219E-05 | global batch size:    16 | lm loss: 5.179604E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4233/  128728 | consumed samples:        67728 | consumed tokens:    138706944 | elapsed time per iteration (s): 15.24 | learning rate: 2.219E-05 | global batch size:    16 | lm loss: 5.316276E+00 | grad norm: 1.087 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4234/  128728 | consumed samples:        67744 | consumed tokens:    138739712 | elapsed time per iteration (s): 15.20 | learning rate: 2.220E-05 | global batch size:    16 | lm loss: 5.243623E+00 | grad norm: 1.276 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4235/  128728 | consumed samples:        67760 | consumed tokens:    138772480 | elapsed time per iteration (s): 15.23 | learning rate: 2.220E-05 | global batch size:    16 | lm loss: 5.085675E+00 | grad norm: 1.324 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4236/  128728 | consumed samples:        67776 | consumed tokens:    138805248 | elapsed time per iteration (s): 15.24 | learning rate: 2.221E-05 | global batch size:    16 | lm loss: 5.159794E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4237/  128728 | consumed samples:        67792 | consumed tokens:    138838016 | elapsed time per iteration (s): 15.17 | learning rate: 2.221E-05 | global batch size:    16 | lm loss: 5.064829E+00 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4238/  128728 | consumed samples:        67808 | consumed tokens:    138870784 | elapsed time per iteration (s): 15.22 | learning rate: 2.222E-05 | global batch size:    16 | lm loss: 5.373168E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4239/  128728 | consumed samples:        67824 | consumed tokens:    138903552 | elapsed time per iteration (s): 15.25 | learning rate: 2.222E-05 | global batch size:    16 | lm loss: 5.072435E+00 | grad norm: 0.663 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4240/  128728 | consumed samples:        67840 | consumed tokens:    138936320 | elapsed time per iteration (s): 15.23 | learning rate: 2.223E-05 | global batch size:    16 | lm loss: 5.378523E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4241/  128728 | consumed samples:        67856 | consumed tokens:    138969088 | elapsed time per iteration (s): 15.24 | learning rate: 2.224E-05 | global batch size:    16 | lm loss: 5.313819E+00 | grad norm: 0.980 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4242/  128728 | consumed samples:        67872 | consumed tokens:    139001856 | elapsed time per iteration (s): 15.22 | learning rate: 2.224E-05 | global batch size:    16 | lm loss: 5.239834E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4243/  128728 | consumed samples:        67888 | consumed tokens:    139034624 | elapsed time per iteration (s): 15.25 | learning rate: 2.225E-05 | global batch size:    16 | lm loss: 5.422865E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4244/  128728 | consumed samples:        67904 | consumed tokens:    139067392 | elapsed time per iteration (s): 15.15 | learning rate: 2.225E-05 | global batch size:    16 | lm loss: 5.566054E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     4245/  128728 | consumed samples:        67920 | consumed tokens:    139100160 | elapsed time per iteration (s): 15.15 | learning rate: 2.226E-05 | global batch size:    16 | lm loss: 5.035063E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     4246/  128728 | consumed samples:        67936 | consumed tokens:    139132928 | elapsed time per iteration (s): 15.27 | learning rate: 2.226E-05 | global batch size:    16 | lm loss: 5.305432E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     4247/  128728 | consumed samples:        67952 | consumed tokens:    139165696 | elapsed time per iteration (s): 15.17 | learning rate: 2.227E-05 | global batch size:    16 | lm loss: 5.268905E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     4248/  128728 | consumed samples:        67968 | consumed tokens:    139198464 | elapsed time per iteration (s): 15.17 | learning rate: 2.227E-05 | global batch size:    16 | lm loss: 5.425210E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4249/  128728 | consumed samples:        67984 | consumed tokens:    139231232 | elapsed time per iteration (s): 15.18 | learning rate: 2.228E-05 | global batch size:    16 | lm loss: 5.495585E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4250/  128728 | consumed samples:        68000 | consumed tokens:    139264000 | elapsed time per iteration (s): 15.15 | learning rate: 2.228E-05 | global batch size:    16 | lm loss: 5.291341E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     4251/  128728 | consumed samples:        68016 | consumed tokens:    139296768 | elapsed time per iteration (s): 15.18 | learning rate: 2.229E-05 | global batch size:    16 | lm loss: 5.175395E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4252/  128728 | consumed samples:        68032 | consumed tokens:    139329536 | elapsed time per iteration (s): 15.22 | learning rate: 2.229E-05 | global batch size:    16 | lm loss: 5.436680E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4253/  128728 | consumed samples:        68048 | consumed tokens:    139362304 | elapsed time per iteration (s): 15.21 | learning rate: 2.230E-05 | global batch size:    16 | lm loss: 5.096869E+00 | grad norm: 2.073 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4254/  128728 | consumed samples:        68064 | consumed tokens:    139395072 | elapsed time per iteration (s): 15.25 | learning rate: 2.230E-05 | global batch size:    16 | lm loss: 5.111172E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4255/  128728 | consumed samples:        68080 | consumed tokens:    139427840 | elapsed time per iteration (s): 15.22 | learning rate: 2.231E-05 | global batch size:    16 | lm loss: 5.014842E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4256/  128728 | consumed samples:        68096 | consumed tokens:    139460608 | elapsed time per iteration (s): 15.17 | learning rate: 2.231E-05 | global batch size:    16 | lm loss: 5.302151E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4257/  128728 | consumed samples:        68112 | consumed tokens:    139493376 | elapsed time per iteration (s): 15.18 | learning rate: 2.232E-05 | global batch size:    16 | lm loss: 5.351344E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4258/  128728 | consumed samples:        68128 | consumed tokens:    139526144 | elapsed time per iteration (s): 15.23 | learning rate: 2.232E-05 | global batch size:    16 | lm loss: 5.253459E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4259/  128728 | consumed samples:        68144 | consumed tokens:    139558912 | elapsed time per iteration (s): 15.15 | learning rate: 2.233E-05 | global batch size:    16 | lm loss: 5.244567E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     4260/  128728 | consumed samples:        68160 | consumed tokens:    139591680 | elapsed time per iteration (s): 15.25 | learning rate: 2.233E-05 | global batch size:    16 | lm loss: 5.337202E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4261/  128728 | consumed samples:        68176 | consumed tokens:    139624448 | elapsed time per iteration (s): 15.20 | learning rate: 2.234E-05 | global batch size:    16 | lm loss: 5.356158E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4262/  128728 | consumed samples:        68192 | consumed tokens:    139657216 | elapsed time per iteration (s): 15.25 | learning rate: 2.235E-05 | global batch size:    16 | lm loss: 5.314350E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4263/  128728 | consumed samples:        68208 | consumed tokens:    139689984 | elapsed time per iteration (s): 15.23 | learning rate: 2.235E-05 | global batch size:    16 | lm loss: 5.277968E+00 | grad norm: 0.867 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4264/  128728 | consumed samples:        68224 | consumed tokens:    139722752 | elapsed time per iteration (s): 15.19 | learning rate: 2.236E-05 | global batch size:    16 | lm loss: 5.386879E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     4265/  128728 | consumed samples:        68240 | consumed tokens:    139755520 | elapsed time per iteration (s): 15.19 | learning rate: 2.236E-05 | global batch size:    16 | lm loss: 5.298102E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     4266/  128728 | consumed samples:        68256 | consumed tokens:    139788288 | elapsed time per iteration (s): 15.20 | learning rate: 2.237E-05 | global batch size:    16 | lm loss: 5.063458E+00 | grad norm: 1.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4267/  128728 | consumed samples:        68272 | consumed tokens:    139821056 | elapsed time per iteration (s): 15.22 | learning rate: 2.237E-05 | global batch size:    16 | lm loss: 5.150900E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4268/  128728 | consumed samples:        68288 | consumed tokens:    139853824 | elapsed time per iteration (s): 15.24 | learning rate: 2.238E-05 | global batch size:    16 | lm loss: 5.480645E+00 | grad norm: 1.452 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4269/  128728 | consumed samples:        68304 | consumed tokens:    139886592 | elapsed time per iteration (s): 15.21 | learning rate: 2.238E-05 | global batch size:    16 | lm loss: 5.393959E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4270/  128728 | consumed samples:        68320 | consumed tokens:    139919360 | elapsed time per iteration (s): 15.21 | learning rate: 2.239E-05 | global batch size:    16 | lm loss: 5.195272E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4271/  128728 | consumed samples:        68336 | consumed tokens:    139952128 | elapsed time per iteration (s): 15.22 | learning rate: 2.239E-05 | global batch size:    16 | lm loss: 5.329949E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4272/  128728 | consumed samples:        68352 | consumed tokens:    139984896 | elapsed time per iteration (s): 15.20 | learning rate: 2.240E-05 | global batch size:    16 | lm loss: 5.188565E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4273/  128728 | consumed samples:        68368 | consumed tokens:    140017664 | elapsed time per iteration (s): 15.23 | learning rate: 2.240E-05 | global batch size:    16 | lm loss: 5.395569E+00 | grad norm: 0.996 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4274/  128728 | consumed samples:        68384 | consumed tokens:    140050432 | elapsed time per iteration (s): 15.23 | learning rate: 2.241E-05 | global batch size:    16 | lm loss: 5.257808E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4275/  128728 | consumed samples:        68400 | consumed tokens:    140083200 | elapsed time per iteration (s): 15.23 | learning rate: 2.241E-05 | global batch size:    16 | lm loss: 5.396634E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4276/  128728 | consumed samples:        68416 | consumed tokens:    140115968 | elapsed time per iteration (s): 15.20 | learning rate: 2.242E-05 | global batch size:    16 | lm loss: 5.054380E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4277/  128728 | consumed samples:        68432 | consumed tokens:    140148736 | elapsed time per iteration (s): 15.23 | learning rate: 2.242E-05 | global batch size:    16 | lm loss: 5.394772E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4278/  128728 | consumed samples:        68448 | consumed tokens:    140181504 | elapsed time per iteration (s): 15.19 | learning rate: 2.243E-05 | global batch size:    16 | lm loss: 5.329741E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4279/  128728 | consumed samples:        68464 | consumed tokens:    140214272 | elapsed time per iteration (s): 15.23 | learning rate: 2.243E-05 | global batch size:    16 | lm loss: 5.123846E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4280/  128728 | consumed samples:        68480 | consumed tokens:    140247040 | elapsed time per iteration (s): 15.22 | learning rate: 2.244E-05 | global batch size:    16 | lm loss: 5.114894E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4281/  128728 | consumed samples:        68496 | consumed tokens:    140279808 | elapsed time per iteration (s): 15.24 | learning rate: 2.244E-05 | global batch size:    16 | lm loss: 5.329511E+00 | grad norm: 1.052 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4282/  128728 | consumed samples:        68512 | consumed tokens:    140312576 | elapsed time per iteration (s): 15.23 | learning rate: 2.245E-05 | global batch size:    16 | lm loss: 5.257269E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4283/  128728 | consumed samples:        68528 | consumed tokens:    140345344 | elapsed time per iteration (s): 15.22 | learning rate: 2.246E-05 | global batch size:    16 | lm loss: 5.189564E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4284/  128728 | consumed samples:        68544 | consumed tokens:    140378112 | elapsed time per iteration (s): 15.19 | learning rate: 2.246E-05 | global batch size:    16 | lm loss: 5.470556E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     4285/  128728 | consumed samples:        68560 | consumed tokens:    140410880 | elapsed time per iteration (s): 15.21 | learning rate: 2.247E-05 | global batch size:    16 | lm loss: 5.387702E+00 | grad norm: 1.116 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4286/  128728 | consumed samples:        68576 | consumed tokens:    140443648 | elapsed time per iteration (s): 15.22 | learning rate: 2.247E-05 | global batch size:    16 | lm loss: 5.492844E+00 | grad norm: 1.051 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4287/  128728 | consumed samples:        68592 | consumed tokens:    140476416 | elapsed time per iteration (s): 15.22 | learning rate: 2.248E-05 | global batch size:    16 | lm loss: 5.419727E+00 | grad norm: 0.911 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4288/  128728 | consumed samples:        68608 | consumed tokens:    140509184 | elapsed time per iteration (s): 15.22 | learning rate: 2.248E-05 | global batch size:    16 | lm loss: 5.376180E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4289/  128728 | consumed samples:        68624 | consumed tokens:    140541952 | elapsed time per iteration (s): 15.21 | learning rate: 2.249E-05 | global batch size:    16 | lm loss: 5.231359E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4290/  128728 | consumed samples:        68640 | consumed tokens:    140574720 | elapsed time per iteration (s): 15.20 | learning rate: 2.249E-05 | global batch size:    16 | lm loss: 5.340625E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4291/  128728 | consumed samples:        68656 | consumed tokens:    140607488 | elapsed time per iteration (s): 15.14 | learning rate: 2.250E-05 | global batch size:    16 | lm loss: 5.693937E+00 | grad norm: 1.466 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     4292/  128728 | consumed samples:        68672 | consumed tokens:    140640256 | elapsed time per iteration (s): 15.21 | learning rate: 2.250E-05 | global batch size:    16 | lm loss: 5.231561E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4293/  128728 | consumed samples:        68688 | consumed tokens:    140673024 | elapsed time per iteration (s): 15.24 | learning rate: 2.251E-05 | global batch size:    16 | lm loss: 5.350264E+00 | grad norm: 1.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4294/  128728 | consumed samples:        68704 | consumed tokens:    140705792 | elapsed time per iteration (s): 15.21 | learning rate: 2.251E-05 | global batch size:    16 | lm loss: 5.243148E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4295/  128728 | consumed samples:        68720 | consumed tokens:    140738560 | elapsed time per iteration (s): 15.24 | learning rate: 2.252E-05 | global batch size:    16 | lm loss: 5.305950E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4296/  128728 | consumed samples:        68736 | consumed tokens:    140771328 | elapsed time per iteration (s): 15.20 | learning rate: 2.252E-05 | global batch size:    16 | lm loss: 5.412365E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4297/  128728 | consumed samples:        68752 | consumed tokens:    140804096 | elapsed time per iteration (s): 15.21 | learning rate: 2.253E-05 | global batch size:    16 | lm loss: 5.151298E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4298/  128728 | consumed samples:        68768 | consumed tokens:    140836864 | elapsed time per iteration (s): 15.16 | learning rate: 2.253E-05 | global batch size:    16 | lm loss: 5.339790E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     4299/  128728 | consumed samples:        68784 | consumed tokens:    140869632 | elapsed time per iteration (s): 15.22 | learning rate: 2.254E-05 | global batch size:    16 | lm loss: 5.416695E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4300/  128728 | consumed samples:        68800 | consumed tokens:    140902400 | elapsed time per iteration (s): 15.21 | learning rate: 2.254E-05 | global batch size:    16 | lm loss: 5.202350E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4301/  128728 | consumed samples:        68816 | consumed tokens:    140935168 | elapsed time per iteration (s): 15.20 | learning rate: 2.255E-05 | global batch size:    16 | lm loss: 5.075429E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4302/  128728 | consumed samples:        68832 | consumed tokens:    140967936 | elapsed time per iteration (s): 15.18 | learning rate: 2.255E-05 | global batch size:    16 | lm loss: 5.435070E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4303/  128728 | consumed samples:        68848 | consumed tokens:    141000704 | elapsed time per iteration (s): 15.22 | learning rate: 2.256E-05 | global batch size:    16 | lm loss: 5.356237E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4304/  128728 | consumed samples:        68864 | consumed tokens:    141033472 | elapsed time per iteration (s): 15.23 | learning rate: 2.257E-05 | global batch size:    16 | lm loss: 5.306829E+00 | grad norm: 0.635 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4305/  128728 | consumed samples:        68880 | consumed tokens:    141066240 | elapsed time per iteration (s): 15.22 | learning rate: 2.257E-05 | global batch size:    16 | lm loss: 5.368681E+00 | grad norm: 1.050 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4306/  128728 | consumed samples:        68896 | consumed tokens:    141099008 | elapsed time per iteration (s): 15.25 | learning rate: 2.258E-05 | global batch size:    16 | lm loss: 5.147976E+00 | grad norm: 1.436 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     4307/  128728 | consumed samples:        68912 | consumed tokens:    141131776 | elapsed time per iteration (s): 15.22 | learning rate: 2.258E-05 | global batch size:    16 | lm loss: 5.660544E+00 | grad norm: 1.063 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4308/  128728 | consumed samples:        68928 | consumed tokens:    141164544 | elapsed time per iteration (s): 15.18 | learning rate: 2.259E-05 | global batch size:    16 | lm loss: 5.237420E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4309/  128728 | consumed samples:        68944 | consumed tokens:    141197312 | elapsed time per iteration (s): 15.21 | learning rate: 2.259E-05 | global batch size:    16 | lm loss: 5.274828E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4310/  128728 | consumed samples:        68960 | consumed tokens:    141230080 | elapsed time per iteration (s): 15.22 | learning rate: 2.260E-05 | global batch size:    16 | lm loss: 5.353731E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4311/  128728 | consumed samples:        68976 | consumed tokens:    141262848 | elapsed time per iteration (s): 15.20 | learning rate: 2.260E-05 | global batch size:    16 | lm loss: 5.093883E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4312/  128728 | consumed samples:        68992 | consumed tokens:    141295616 | elapsed time per iteration (s): 15.22 | learning rate: 2.261E-05 | global batch size:    16 | lm loss: 5.287315E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4313/  128728 | consumed samples:        69008 | consumed tokens:    141328384 | elapsed time per iteration (s): 15.17 | learning rate: 2.261E-05 | global batch size:    16 | lm loss: 5.389223E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4314/  128728 | consumed samples:        69024 | consumed tokens:    141361152 | elapsed time per iteration (s): 15.20 | learning rate: 2.262E-05 | global batch size:    16 | lm loss: 5.299072E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4315/  128728 | consumed samples:        69040 | consumed tokens:    141393920 | elapsed time per iteration (s): 15.23 | learning rate: 2.262E-05 | global batch size:    16 | lm loss: 5.455530E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4316/  128728 | consumed samples:        69056 | consumed tokens:    141426688 | elapsed time per iteration (s): 15.24 | learning rate: 2.263E-05 | global batch size:    16 | lm loss: 5.203159E+00 | grad norm: 0.619 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4317/  128728 | consumed samples:        69072 | consumed tokens:    141459456 | elapsed time per iteration (s): 15.21 | learning rate: 2.263E-05 | global batch size:    16 | lm loss: 5.403166E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4318/  128728 | consumed samples:        69088 | consumed tokens:    141492224 | elapsed time per iteration (s): 15.28 | learning rate: 2.264E-05 | global batch size:    16 | lm loss: 5.332869E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     4319/  128728 | consumed samples:        69104 | consumed tokens:    141524992 | elapsed time per iteration (s): 15.24 | learning rate: 2.264E-05 | global batch size:    16 | lm loss: 5.213949E+00 | grad norm: 1.344 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4320/  128728 | consumed samples:        69120 | consumed tokens:    141557760 | elapsed time per iteration (s): 15.28 | learning rate: 2.265E-05 | global batch size:    16 | lm loss: 5.245769E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     4321/  128728 | consumed samples:        69136 | consumed tokens:    141590528 | elapsed time per iteration (s): 15.23 | learning rate: 2.265E-05 | global batch size:    16 | lm loss: 5.118613E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4322/  128728 | consumed samples:        69152 | consumed tokens:    141623296 | elapsed time per iteration (s): 15.14 | learning rate: 2.266E-05 | global batch size:    16 | lm loss: 5.262797E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     4323/  128728 | consumed samples:        69168 | consumed tokens:    141656064 | elapsed time per iteration (s): 15.15 | learning rate: 2.267E-05 | global batch size:    16 | lm loss: 5.311221E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     4324/  128728 | consumed samples:        69184 | consumed tokens:    141688832 | elapsed time per iteration (s): 15.21 | learning rate: 2.267E-05 | global batch size:    16 | lm loss: 5.268604E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4325/  128728 | consumed samples:        69200 | consumed tokens:    141721600 | elapsed time per iteration (s): 15.20 | learning rate: 2.268E-05 | global batch size:    16 | lm loss: 5.282369E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4326/  128728 | consumed samples:        69216 | consumed tokens:    141754368 | elapsed time per iteration (s): 15.23 | learning rate: 2.268E-05 | global batch size:    16 | lm loss: 5.319102E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4327/  128728 | consumed samples:        69232 | consumed tokens:    141787136 | elapsed time per iteration (s): 15.25 | learning rate: 2.269E-05 | global batch size:    16 | lm loss: 5.178657E+00 | grad norm: 0.922 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4328/  128728 | consumed samples:        69248 | consumed tokens:    141819904 | elapsed time per iteration (s): 15.22 | learning rate: 2.269E-05 | global batch size:    16 | lm loss: 5.155839E+00 | grad norm: 0.884 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4329/  128728 | consumed samples:        69264 | consumed tokens:    141852672 | elapsed time per iteration (s): 15.23 | learning rate: 2.270E-05 | global batch size:    16 | lm loss: 5.336724E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4330/  128728 | consumed samples:        69280 | consumed tokens:    141885440 | elapsed time per iteration (s): 15.26 | learning rate: 2.270E-05 | global batch size:    16 | lm loss: 5.267392E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4331/  128728 | consumed samples:        69296 | consumed tokens:    141918208 | elapsed time per iteration (s): 15.24 | learning rate: 2.271E-05 | global batch size:    16 | lm loss: 5.210427E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4332/  128728 | consumed samples:        69312 | consumed tokens:    141950976 | elapsed time per iteration (s): 15.20 | learning rate: 2.271E-05 | global batch size:    16 | lm loss: 5.287461E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4333/  128728 | consumed samples:        69328 | consumed tokens:    141983744 | elapsed time per iteration (s): 15.22 | learning rate: 2.272E-05 | global batch size:    16 | lm loss: 5.281710E+00 | grad norm: 0.632 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4334/  128728 | consumed samples:        69344 | consumed tokens:    142016512 | elapsed time per iteration (s): 15.20 | learning rate: 2.272E-05 | global batch size:    16 | lm loss: 5.312432E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4335/  128728 | consumed samples:        69360 | consumed tokens:    142049280 | elapsed time per iteration (s): 15.22 | learning rate: 2.273E-05 | global batch size:    16 | lm loss: 5.456483E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4336/  128728 | consumed samples:        69376 | consumed tokens:    142082048 | elapsed time per iteration (s): 15.20 | learning rate: 2.273E-05 | global batch size:    16 | lm loss: 4.999959E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4337/  128728 | consumed samples:        69392 | consumed tokens:    142114816 | elapsed time per iteration (s): 15.25 | learning rate: 2.274E-05 | global batch size:    16 | lm loss: 5.472358E+00 | grad norm: 0.987 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4338/  128728 | consumed samples:        69408 | consumed tokens:    142147584 | elapsed time per iteration (s): 15.22 | learning rate: 2.274E-05 | global batch size:    16 | lm loss: 5.075167E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4339/  128728 | consumed samples:        69424 | consumed tokens:    142180352 | elapsed time per iteration (s): 15.20 | learning rate: 2.275E-05 | global batch size:    16 | lm loss: 5.377363E+00 | grad norm: 0.643 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4340/  128728 | consumed samples:        69440 | consumed tokens:    142213120 | elapsed time per iteration (s): 15.23 | learning rate: 2.275E-05 | global batch size:    16 | lm loss: 5.143479E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4341/  128728 | consumed samples:        69456 | consumed tokens:    142245888 | elapsed time per iteration (s): 15.22 | learning rate: 2.276E-05 | global batch size:    16 | lm loss: 5.224275E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4342/  128728 | consumed samples:        69472 | consumed tokens:    142278656 | elapsed time per iteration (s): 15.21 | learning rate: 2.276E-05 | global batch size:    16 | lm loss: 4.925807E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4343/  128728 | consumed samples:        69488 | consumed tokens:    142311424 | elapsed time per iteration (s): 15.20 | learning rate: 2.277E-05 | global batch size:    16 | lm loss: 5.460708E+00 | grad norm: 0.648 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4344/  128728 | consumed samples:        69504 | consumed tokens:    142344192 | elapsed time per iteration (s): 15.23 | learning rate: 2.278E-05 | global batch size:    16 | lm loss: 5.483164E+00 | grad norm: 10.016 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4345/  128728 | consumed samples:        69520 | consumed tokens:    142376960 | elapsed time per iteration (s): 15.22 | learning rate: 2.278E-05 | global batch size:    16 | lm loss: 5.302545E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4346/  128728 | consumed samples:        69536 | consumed tokens:    142409728 | elapsed time per iteration (s): 15.22 | learning rate: 2.279E-05 | global batch size:    16 | lm loss: 5.222922E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4347/  128728 | consumed samples:        69552 | consumed tokens:    142442496 | elapsed time per iteration (s): 15.22 | learning rate: 2.279E-05 | global batch size:    16 | lm loss: 5.134640E+00 | grad norm: 4.857 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4348/  128728 | consumed samples:        69568 | consumed tokens:    142475264 | elapsed time per iteration (s): 15.22 | learning rate: 2.280E-05 | global batch size:    16 | lm loss: 5.309505E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4349/  128728 | consumed samples:        69584 | consumed tokens:    142508032 | elapsed time per iteration (s): 15.26 | learning rate: 2.280E-05 | global batch size:    16 | lm loss: 5.236284E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4350/  128728 | consumed samples:        69600 | consumed tokens:    142540800 | elapsed time per iteration (s): 15.24 | learning rate: 2.281E-05 | global batch size:    16 | lm loss: 5.381992E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4351/  128728 | consumed samples:        69616 | consumed tokens:    142573568 | elapsed time per iteration (s): 15.22 | learning rate: 2.281E-05 | global batch size:    16 | lm loss: 5.128081E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4352/  128728 | consumed samples:        69632 | consumed tokens:    142606336 | elapsed time per iteration (s): 15.22 | learning rate: 2.282E-05 | global batch size:    16 | lm loss: 5.119745E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4353/  128728 | consumed samples:        69648 | consumed tokens:    142639104 | elapsed time per iteration (s): 15.21 | learning rate: 2.282E-05 | global batch size:    16 | lm loss: 5.373334E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4354/  128728 | consumed samples:        69664 | consumed tokens:    142671872 | elapsed time per iteration (s): 15.24 | learning rate: 2.283E-05 | global batch size:    16 | lm loss: 5.252212E+00 | grad norm: 1.092 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4355/  128728 | consumed samples:        69680 | consumed tokens:    142704640 | elapsed time per iteration (s): 15.23 | learning rate: 2.283E-05 | global batch size:    16 | lm loss: 5.083073E+00 | grad norm: 1.281 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4356/  128728 | consumed samples:        69696 | consumed tokens:    142737408 | elapsed time per iteration (s): 15.23 | learning rate: 2.284E-05 | global batch size:    16 | lm loss: 5.302938E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4357/  128728 | consumed samples:        69712 | consumed tokens:    142770176 | elapsed time per iteration (s): 15.21 | learning rate: 2.284E-05 | global batch size:    16 | lm loss: 5.174490E+00 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4358/  128728 | consumed samples:        69728 | consumed tokens:    142802944 | elapsed time per iteration (s): 15.20 | learning rate: 2.285E-05 | global batch size:    16 | lm loss: 5.063147E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4359/  128728 | consumed samples:        69744 | consumed tokens:    142835712 | elapsed time per iteration (s): 15.22 | learning rate: 2.285E-05 | global batch size:    16 | lm loss: 5.283265E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4360/  128728 | consumed samples:        69760 | consumed tokens:    142868480 | elapsed time per iteration (s): 15.16 | learning rate: 2.286E-05 | global batch size:    16 | lm loss: 5.007393E+00 | grad norm: 1.027 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4361/  128728 | consumed samples:        69776 | consumed tokens:    142901248 | elapsed time per iteration (s): 15.22 | learning rate: 2.286E-05 | global batch size:    16 | lm loss: 5.309522E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4362/  128728 | consumed samples:        69792 | consumed tokens:    142934016 | elapsed time per iteration (s): 15.23 | learning rate: 2.287E-05 | global batch size:    16 | lm loss: 5.322911E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4363/  128728 | consumed samples:        69808 | consumed tokens:    142966784 | elapsed time per iteration (s): 15.22 | learning rate: 2.287E-05 | global batch size:    16 | lm loss: 4.989873E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4364/  128728 | consumed samples:        69824 | consumed tokens:    142999552 | elapsed time per iteration (s): 15.17 | learning rate: 2.288E-05 | global batch size:    16 | lm loss: 5.066822E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4365/  128728 | consumed samples:        69840 | consumed tokens:    143032320 | elapsed time per iteration (s): 15.26 | learning rate: 2.289E-05 | global batch size:    16 | lm loss: 5.063513E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     4366/  128728 | consumed samples:        69856 | consumed tokens:    143065088 | elapsed time per iteration (s): 15.24 | learning rate: 2.289E-05 | global batch size:    16 | lm loss: 5.041920E+00 | grad norm: 1.079 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4367/  128728 | consumed samples:        69872 | consumed tokens:    143097856 | elapsed time per iteration (s): 15.20 | learning rate: 2.290E-05 | global batch size:    16 | lm loss: 5.264553E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4368/  128728 | consumed samples:        69888 | consumed tokens:    143130624 | elapsed time per iteration (s): 15.25 | learning rate: 2.290E-05 | global batch size:    16 | lm loss: 5.502565E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4369/  128728 | consumed samples:        69904 | consumed tokens:    143163392 | elapsed time per iteration (s): 15.21 | learning rate: 2.291E-05 | global batch size:    16 | lm loss: 5.140605E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4370/  128728 | consumed samples:        69920 | consumed tokens:    143196160 | elapsed time per iteration (s): 15.20 | learning rate: 2.291E-05 | global batch size:    16 | lm loss: 5.729661E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4371/  128728 | consumed samples:        69936 | consumed tokens:    143228928 | elapsed time per iteration (s): 15.26 | learning rate: 2.292E-05 | global batch size:    16 | lm loss: 5.166599E+00 | grad norm: 1.097 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     4372/  128728 | consumed samples:        69952 | consumed tokens:    143261696 | elapsed time per iteration (s): 15.25 | learning rate: 2.292E-05 | global batch size:    16 | lm loss: 5.305621E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4373/  128728 | consumed samples:        69968 | consumed tokens:    143294464 | elapsed time per iteration (s): 15.14 | learning rate: 2.293E-05 | global batch size:    16 | lm loss: 5.239990E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     4374/  128728 | consumed samples:        69984 | consumed tokens:    143327232 | elapsed time per iteration (s): 15.22 | learning rate: 2.293E-05 | global batch size:    16 | lm loss: 5.457160E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4375/  128728 | consumed samples:        70000 | consumed tokens:    143360000 | elapsed time per iteration (s): 15.22 | learning rate: 2.294E-05 | global batch size:    16 | lm loss: 5.377467E+00 | grad norm: 1.225 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4376/  128728 | consumed samples:        70016 | consumed tokens:    143392768 | elapsed time per iteration (s): 15.20 | learning rate: 2.294E-05 | global batch size:    16 | lm loss: 5.137845E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4377/  128728 | consumed samples:        70032 | consumed tokens:    143425536 | elapsed time per iteration (s): 15.25 | learning rate: 2.295E-05 | global batch size:    16 | lm loss: 5.104263E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4378/  128728 | consumed samples:        70048 | consumed tokens:    143458304 | elapsed time per iteration (s): 15.20 | learning rate: 2.295E-05 | global batch size:    16 | lm loss: 5.039233E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4379/  128728 | consumed samples:        70064 | consumed tokens:    143491072 | elapsed time per iteration (s): 15.22 | learning rate: 2.296E-05 | global batch size:    16 | lm loss: 5.201389E+00 | grad norm: 1.449 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4380/  128728 | consumed samples:        70080 | consumed tokens:    143523840 | elapsed time per iteration (s): 15.27 | learning rate: 2.296E-05 | global batch size:    16 | lm loss: 5.206597E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     4381/  128728 | consumed samples:        70096 | consumed tokens:    143556608 | elapsed time per iteration (s): 15.20 | learning rate: 2.297E-05 | global batch size:    16 | lm loss: 5.222647E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4382/  128728 | consumed samples:        70112 | consumed tokens:    143589376 | elapsed time per iteration (s): 15.21 | learning rate: 2.297E-05 | global batch size:    16 | lm loss: 4.988690E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4383/  128728 | consumed samples:        70128 | consumed tokens:    143622144 | elapsed time per iteration (s): 15.19 | learning rate: 2.298E-05 | global batch size:    16 | lm loss: 5.317173E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4384/  128728 | consumed samples:        70144 | consumed tokens:    143654912 | elapsed time per iteration (s): 15.25 | learning rate: 2.298E-05 | global batch size:    16 | lm loss: 5.185295E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4385/  128728 | consumed samples:        70160 | consumed tokens:    143687680 | elapsed time per iteration (s): 15.17 | learning rate: 2.299E-05 | global batch size:    16 | lm loss: 5.370522E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4386/  128728 | consumed samples:        70176 | consumed tokens:    143720448 | elapsed time per iteration (s): 15.15 | learning rate: 2.300E-05 | global batch size:    16 | lm loss: 5.361063E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     4387/  128728 | consumed samples:        70192 | consumed tokens:    143753216 | elapsed time per iteration (s): 15.26 | learning rate: 2.300E-05 | global batch size:    16 | lm loss: 5.225990E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4388/  128728 | consumed samples:        70208 | consumed tokens:    143785984 | elapsed time per iteration (s): 15.26 | learning rate: 2.301E-05 | global batch size:    16 | lm loss: 5.465258E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     4389/  128728 | consumed samples:        70224 | consumed tokens:    143818752 | elapsed time per iteration (s): 15.20 | learning rate: 2.301E-05 | global batch size:    16 | lm loss: 5.258640E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4390/  128728 | consumed samples:        70240 | consumed tokens:    143851520 | elapsed time per iteration (s): 15.26 | learning rate: 2.302E-05 | global batch size:    16 | lm loss: 5.209820E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4391/  128728 | consumed samples:        70256 | consumed tokens:    143884288 | elapsed time per iteration (s): 15.21 | learning rate: 2.302E-05 | global batch size:    16 | lm loss: 4.884523E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4392/  128728 | consumed samples:        70272 | consumed tokens:    143917056 | elapsed time per iteration (s): 15.26 | learning rate: 2.303E-05 | global batch size:    16 | lm loss: 5.230456E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     4393/  128728 | consumed samples:        70288 | consumed tokens:    143949824 | elapsed time per iteration (s): 15.22 | learning rate: 2.303E-05 | global batch size:    16 | lm loss: 5.428142E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4394/  128728 | consumed samples:        70304 | consumed tokens:    143982592 | elapsed time per iteration (s): 15.19 | learning rate: 2.304E-05 | global batch size:    16 | lm loss: 5.217700E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4395/  128728 | consumed samples:        70320 | consumed tokens:    144015360 | elapsed time per iteration (s): 15.23 | learning rate: 2.304E-05 | global batch size:    16 | lm loss: 5.157529E+00 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4396/  128728 | consumed samples:        70336 | consumed tokens:    144048128 | elapsed time per iteration (s): 15.25 | learning rate: 2.305E-05 | global batch size:    16 | lm loss: 5.335325E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     4397/  128728 | consumed samples:        70352 | consumed tokens:    144080896 | elapsed time per iteration (s): 15.25 | learning rate: 2.305E-05 | global batch size:    16 | lm loss: 5.159653E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4398/  128728 | consumed samples:        70368 | consumed tokens:    144113664 | elapsed time per iteration (s): 15.24 | learning rate: 2.306E-05 | global batch size:    16 | lm loss: 5.342342E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4399/  128728 | consumed samples:        70384 | consumed tokens:    144146432 | elapsed time per iteration (s): 15.24 | learning rate: 2.306E-05 | global batch size:    16 | lm loss: 5.175276E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4400/  128728 | consumed samples:        70400 | consumed tokens:    144179200 | elapsed time per iteration (s): 15.23 | learning rate: 2.307E-05 | global batch size:    16 | lm loss: 5.433102E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4401/  128728 | consumed samples:        70416 | consumed tokens:    144211968 | elapsed time per iteration (s): 15.28 | learning rate: 2.307E-05 | global batch size:    16 | lm loss: 5.209073E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     4402/  128728 | consumed samples:        70432 | consumed tokens:    144244736 | elapsed time per iteration (s): 15.24 | learning rate: 2.308E-05 | global batch size:    16 | lm loss: 5.068762E+00 | grad norm: 0.881 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4403/  128728 | consumed samples:        70448 | consumed tokens:    144277504 | elapsed time per iteration (s): 15.17 | learning rate: 2.308E-05 | global batch size:    16 | lm loss: 5.463076E+00 | grad norm: 0.631 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4404/  128728 | consumed samples:        70464 | consumed tokens:    144310272 | elapsed time per iteration (s): 15.24 | learning rate: 2.309E-05 | global batch size:    16 | lm loss: 4.915584E+00 | grad norm: 1.283 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4405/  128728 | consumed samples:        70480 | consumed tokens:    144343040 | elapsed time per iteration (s): 15.21 | learning rate: 2.309E-05 | global batch size:    16 | lm loss: 5.192725E+00 | grad norm: 0.827 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4406/  128728 | consumed samples:        70496 | consumed tokens:    144375808 | elapsed time per iteration (s): 15.22 | learning rate: 2.310E-05 | global batch size:    16 | lm loss: 5.232658E+00 | grad norm: 0.647 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4407/  128728 | consumed samples:        70512 | consumed tokens:    144408576 | elapsed time per iteration (s): 15.23 | learning rate: 2.311E-05 | global batch size:    16 | lm loss: 4.972489E+00 | grad norm: 0.943 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4408/  128728 | consumed samples:        70528 | consumed tokens:    144441344 | elapsed time per iteration (s): 15.23 | learning rate: 2.311E-05 | global batch size:    16 | lm loss: 5.359754E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4409/  128728 | consumed samples:        70544 | consumed tokens:    144474112 | elapsed time per iteration (s): 15.23 | learning rate: 2.312E-05 | global batch size:    16 | lm loss: 5.230769E+00 | grad norm: 1.020 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4410/  128728 | consumed samples:        70560 | consumed tokens:    144506880 | elapsed time per iteration (s): 15.21 | learning rate: 2.312E-05 | global batch size:    16 | lm loss: 5.368015E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4411/  128728 | consumed samples:        70576 | consumed tokens:    144539648 | elapsed time per iteration (s): 15.23 | learning rate: 2.313E-05 | global batch size:    16 | lm loss: 5.025774E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4412/  128728 | consumed samples:        70592 | consumed tokens:    144572416 | elapsed time per iteration (s): 15.22 | learning rate: 2.313E-05 | global batch size:    16 | lm loss: 5.240927E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4413/  128728 | consumed samples:        70608 | consumed tokens:    144605184 | elapsed time per iteration (s): 15.25 | learning rate: 2.314E-05 | global batch size:    16 | lm loss: 5.289531E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4414/  128728 | consumed samples:        70624 | consumed tokens:    144637952 | elapsed time per iteration (s): 15.24 | learning rate: 2.314E-05 | global batch size:    16 | lm loss: 5.324119E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4415/  128728 | consumed samples:        70640 | consumed tokens:    144670720 | elapsed time per iteration (s): 15.18 | learning rate: 2.315E-05 | global batch size:    16 | lm loss: 5.208157E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4416/  128728 | consumed samples:        70656 | consumed tokens:    144703488 | elapsed time per iteration (s): 15.26 | learning rate: 2.315E-05 | global batch size:    16 | lm loss: 5.270568E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4417/  128728 | consumed samples:        70672 | consumed tokens:    144736256 | elapsed time per iteration (s): 15.24 | learning rate: 2.316E-05 | global batch size:    16 | lm loss: 4.967587E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4418/  128728 | consumed samples:        70688 | consumed tokens:    144769024 | elapsed time per iteration (s): 15.26 | learning rate: 2.316E-05 | global batch size:    16 | lm loss: 5.290422E+00 | grad norm: 1.096 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4419/  128728 | consumed samples:        70704 | consumed tokens:    144801792 | elapsed time per iteration (s): 15.24 | learning rate: 2.317E-05 | global batch size:    16 | lm loss: 5.287684E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4420/  128728 | consumed samples:        70720 | consumed tokens:    144834560 | elapsed time per iteration (s): 15.21 | learning rate: 2.317E-05 | global batch size:    16 | lm loss: 5.091879E+00 | grad norm: 1.074 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4421/  128728 | consumed samples:        70736 | consumed tokens:    144867328 | elapsed time per iteration (s): 15.20 | learning rate: 2.318E-05 | global batch size:    16 | lm loss: 5.381857E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4422/  128728 | consumed samples:        70752 | consumed tokens:    144900096 | elapsed time per iteration (s): 15.21 | learning rate: 2.318E-05 | global batch size:    16 | lm loss: 5.311369E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4423/  128728 | consumed samples:        70768 | consumed tokens:    144932864 | elapsed time per iteration (s): 15.24 | learning rate: 2.319E-05 | global batch size:    16 | lm loss: 4.966698E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4424/  128728 | consumed samples:        70784 | consumed tokens:    144965632 | elapsed time per iteration (s): 15.25 | learning rate: 2.319E-05 | global batch size:    16 | lm loss: 5.143426E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4425/  128728 | consumed samples:        70800 | consumed tokens:    144998400 | elapsed time per iteration (s): 15.25 | learning rate: 2.320E-05 | global batch size:    16 | lm loss: 4.974629E+00 | grad norm: 1.213 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4426/  128728 | consumed samples:        70816 | consumed tokens:    145031168 | elapsed time per iteration (s): 15.23 | learning rate: 2.321E-05 | global batch size:    16 | lm loss: 5.348171E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4427/  128728 | consumed samples:        70832 | consumed tokens:    145063936 | elapsed time per iteration (s): 15.19 | learning rate: 2.321E-05 | global batch size:    16 | lm loss: 5.526504E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4428/  128728 | consumed samples:        70848 | consumed tokens:    145096704 | elapsed time per iteration (s): 15.18 | learning rate: 2.322E-05 | global batch size:    16 | lm loss: 5.218261E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4429/  128728 | consumed samples:        70864 | consumed tokens:    145129472 | elapsed time per iteration (s): 15.26 | learning rate: 2.322E-05 | global batch size:    16 | lm loss: 5.136736E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4430/  128728 | consumed samples:        70880 | consumed tokens:    145162240 | elapsed time per iteration (s): 15.20 | learning rate: 2.323E-05 | global batch size:    16 | lm loss: 5.037167E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4431/  128728 | consumed samples:        70896 | consumed tokens:    145195008 | elapsed time per iteration (s): 15.24 | learning rate: 2.323E-05 | global batch size:    16 | lm loss: 5.275063E+00 | grad norm: 1.014 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4432/  128728 | consumed samples:        70912 | consumed tokens:    145227776 | elapsed time per iteration (s): 15.23 | learning rate: 2.324E-05 | global batch size:    16 | lm loss: 5.226987E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4433/  128728 | consumed samples:        70928 | consumed tokens:    145260544 | elapsed time per iteration (s): 15.22 | learning rate: 2.324E-05 | global batch size:    16 | lm loss: 5.212551E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4434/  128728 | consumed samples:        70944 | consumed tokens:    145293312 | elapsed time per iteration (s): 15.24 | learning rate: 2.325E-05 | global batch size:    16 | lm loss: 5.190849E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4435/  128728 | consumed samples:        70960 | consumed tokens:    145326080 | elapsed time per iteration (s): 15.25 | learning rate: 2.325E-05 | global batch size:    16 | lm loss: 5.322753E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4436/  128728 | consumed samples:        70976 | consumed tokens:    145358848 | elapsed time per iteration (s): 15.23 | learning rate: 2.326E-05 | global batch size:    16 | lm loss: 5.214334E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4437/  128728 | consumed samples:        70992 | consumed tokens:    145391616 | elapsed time per iteration (s): 15.20 | learning rate: 2.326E-05 | global batch size:    16 | lm loss: 5.383008E+00 | grad norm: 0.827 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4438/  128728 | consumed samples:        71008 | consumed tokens:    145424384 | elapsed time per iteration (s): 15.27 | learning rate: 2.327E-05 | global batch size:    16 | lm loss: 5.295764E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     4439/  128728 | consumed samples:        71024 | consumed tokens:    145457152 | elapsed time per iteration (s): 15.23 | learning rate: 2.327E-05 | global batch size:    16 | lm loss: 5.206472E+00 | grad norm: 1.348 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4440/  128728 | consumed samples:        71040 | consumed tokens:    145489920 | elapsed time per iteration (s): 15.25 | learning rate: 2.328E-05 | global batch size:    16 | lm loss: 5.287164E+00 | grad norm: 1.248 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4441/  128728 | consumed samples:        71056 | consumed tokens:    145522688 | elapsed time per iteration (s): 15.27 | learning rate: 2.328E-05 | global batch size:    16 | lm loss: 5.455941E+00 | grad norm: 2.619 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     4442/  128728 | consumed samples:        71072 | consumed tokens:    145555456 | elapsed time per iteration (s): 15.21 | learning rate: 2.329E-05 | global batch size:    16 | lm loss: 5.322211E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4443/  128728 | consumed samples:        71088 | consumed tokens:    145588224 | elapsed time per iteration (s): 15.27 | learning rate: 2.329E-05 | global batch size:    16 | lm loss: 4.969383E+00 | grad norm: 1.570 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     4444/  128728 | consumed samples:        71104 | consumed tokens:    145620992 | elapsed time per iteration (s): 15.18 | learning rate: 2.330E-05 | global batch size:    16 | lm loss: 5.289163E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4445/  128728 | consumed samples:        71120 | consumed tokens:    145653760 | elapsed time per iteration (s): 15.23 | learning rate: 2.330E-05 | global batch size:    16 | lm loss: 5.375591E+00 | grad norm: 1.066 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4446/  128728 | consumed samples:        71136 | consumed tokens:    145686528 | elapsed time per iteration (s): 15.22 | learning rate: 2.331E-05 | global batch size:    16 | lm loss: 5.250404E+00 | grad norm: 1.054 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4447/  128728 | consumed samples:        71152 | consumed tokens:    145719296 | elapsed time per iteration (s): 15.21 | learning rate: 2.332E-05 | global batch size:    16 | lm loss: 5.128370E+00 | grad norm: 1.126 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4448/  128728 | consumed samples:        71168 | consumed tokens:    145752064 | elapsed time per iteration (s): 15.27 | learning rate: 2.332E-05 | global batch size:    16 | lm loss: 5.044110E+00 | grad norm: 1.051 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     4449/  128728 | consumed samples:        71184 | consumed tokens:    145784832 | elapsed time per iteration (s): 15.20 | learning rate: 2.333E-05 | global batch size:    16 | lm loss: 5.220716E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4450/  128728 | consumed samples:        71200 | consumed tokens:    145817600 | elapsed time per iteration (s): 15.23 | learning rate: 2.333E-05 | global batch size:    16 | lm loss: 5.262385E+00 | grad norm: 1.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4451/  128728 | consumed samples:        71216 | consumed tokens:    145850368 | elapsed time per iteration (s): 15.30 | learning rate: 2.334E-05 | global batch size:    16 | lm loss: 5.220573E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.045 | TFLOPs: 8.00 |
[default7]: iteration     4452/  128728 | consumed samples:        71232 | consumed tokens:    145883136 | elapsed time per iteration (s): 15.24 | learning rate: 2.334E-05 | global batch size:    16 | lm loss: 5.203768E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4453/  128728 | consumed samples:        71248 | consumed tokens:    145915904 | elapsed time per iteration (s): 15.18 | learning rate: 2.335E-05 | global batch size:    16 | lm loss: 5.269801E+00 | grad norm: 0.649 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4454/  128728 | consumed samples:        71264 | consumed tokens:    145948672 | elapsed time per iteration (s): 15.20 | learning rate: 2.335E-05 | global batch size:    16 | lm loss: 5.050187E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4455/  128728 | consumed samples:        71280 | consumed tokens:    145981440 | elapsed time per iteration (s): 15.20 | learning rate: 2.336E-05 | global batch size:    16 | lm loss: 5.117810E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4456/  128728 | consumed samples:        71296 | consumed tokens:    146014208 | elapsed time per iteration (s): 15.23 | learning rate: 2.336E-05 | global batch size:    16 | lm loss: 5.348668E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4457/  128728 | consumed samples:        71312 | consumed tokens:    146046976 | elapsed time per iteration (s): 15.20 | learning rate: 2.337E-05 | global batch size:    16 | lm loss: 5.208596E+00 | grad norm: 0.874 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4458/  128728 | consumed samples:        71328 | consumed tokens:    146079744 | elapsed time per iteration (s): 15.21 | learning rate: 2.337E-05 | global batch size:    16 | lm loss: 5.182173E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4459/  128728 | consumed samples:        71344 | consumed tokens:    146112512 | elapsed time per iteration (s): 15.20 | learning rate: 2.338E-05 | global batch size:    16 | lm loss: 5.112963E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4460/  128728 | consumed samples:        71360 | consumed tokens:    146145280 | elapsed time per iteration (s): 15.18 | learning rate: 2.338E-05 | global batch size:    16 | lm loss: 5.317193E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4461/  128728 | consumed samples:        71376 | consumed tokens:    146178048 | elapsed time per iteration (s): 15.22 | learning rate: 2.339E-05 | global batch size:    16 | lm loss: 5.396627E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4462/  128728 | consumed samples:        71392 | consumed tokens:    146210816 | elapsed time per iteration (s): 15.24 | learning rate: 2.339E-05 | global batch size:    16 | lm loss: 5.153749E+00 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4463/  128728 | consumed samples:        71408 | consumed tokens:    146243584 | elapsed time per iteration (s): 15.20 | learning rate: 2.340E-05 | global batch size:    16 | lm loss: 4.955423E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4464/  128728 | consumed samples:        71424 | consumed tokens:    146276352 | elapsed time per iteration (s): 15.20 | learning rate: 2.340E-05 | global batch size:    16 | lm loss: 5.183180E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4465/  128728 | consumed samples:        71440 | consumed tokens:    146309120 | elapsed time per iteration (s): 15.23 | learning rate: 2.341E-05 | global batch size:    16 | lm loss: 5.210735E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4466/  128728 | consumed samples:        71456 | consumed tokens:    146341888 | elapsed time per iteration (s): 15.19 | learning rate: 2.341E-05 | global batch size:    16 | lm loss: 5.146707E+00 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4467/  128728 | consumed samples:        71472 | consumed tokens:    146374656 | elapsed time per iteration (s): 15.21 | learning rate: 2.342E-05 | global batch size:    16 | lm loss: 5.419012E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4468/  128728 | consumed samples:        71488 | consumed tokens:    146407424 | elapsed time per iteration (s): 15.23 | learning rate: 2.343E-05 | global batch size:    16 | lm loss: 4.935767E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4469/  128728 | consumed samples:        71504 | consumed tokens:    146440192 | elapsed time per iteration (s): 15.22 | learning rate: 2.343E-05 | global batch size:    16 | lm loss: 5.208894E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4470/  128728 | consumed samples:        71520 | consumed tokens:    146472960 | elapsed time per iteration (s): 15.31 | learning rate: 2.344E-05 | global batch size:    16 | lm loss: 5.157829E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.045 | TFLOPs: 8.00 |
[default7]: iteration     4471/  128728 | consumed samples:        71536 | consumed tokens:    146505728 | elapsed time per iteration (s): 15.20 | learning rate: 2.344E-05 | global batch size:    16 | lm loss: 5.193877E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4472/  128728 | consumed samples:        71552 | consumed tokens:    146538496 | elapsed time per iteration (s): 15.18 | learning rate: 2.345E-05 | global batch size:    16 | lm loss: 5.028915E+00 | grad norm: 1.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4473/  128728 | consumed samples:        71568 | consumed tokens:    146571264 | elapsed time per iteration (s): 15.20 | learning rate: 2.345E-05 | global batch size:    16 | lm loss: 5.195506E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4474/  128728 | consumed samples:        71584 | consumed tokens:    146604032 | elapsed time per iteration (s): 15.20 | learning rate: 2.346E-05 | global batch size:    16 | lm loss: 5.137490E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4475/  128728 | consumed samples:        71600 | consumed tokens:    146636800 | elapsed time per iteration (s): 15.21 | learning rate: 2.346E-05 | global batch size:    16 | lm loss: 5.193035E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4476/  128728 | consumed samples:        71616 | consumed tokens:    146669568 | elapsed time per iteration (s): 15.22 | learning rate: 2.347E-05 | global batch size:    16 | lm loss: 5.103290E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4477/  128728 | consumed samples:        71632 | consumed tokens:    146702336 | elapsed time per iteration (s): 15.28 | learning rate: 2.347E-05 | global batch size:    16 | lm loss: 5.285797E+00 | grad norm: 1.021 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     4478/  128728 | consumed samples:        71648 | consumed tokens:    146735104 | elapsed time per iteration (s): 15.21 | learning rate: 2.348E-05 | global batch size:    16 | lm loss: 5.268842E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4479/  128728 | consumed samples:        71664 | consumed tokens:    146767872 | elapsed time per iteration (s): 15.21 | learning rate: 2.348E-05 | global batch size:    16 | lm loss: 5.037316E+00 | grad norm: 0.663 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4480/  128728 | consumed samples:        71680 | consumed tokens:    146800640 | elapsed time per iteration (s): 15.23 | learning rate: 2.349E-05 | global batch size:    16 | lm loss: 5.297882E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4481/  128728 | consumed samples:        71696 | consumed tokens:    146833408 | elapsed time per iteration (s): 15.21 | learning rate: 2.349E-05 | global batch size:    16 | lm loss: 5.265193E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4482/  128728 | consumed samples:        71712 | consumed tokens:    146866176 | elapsed time per iteration (s): 15.23 | learning rate: 2.350E-05 | global batch size:    16 | lm loss: 5.264235E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4483/  128728 | consumed samples:        71728 | consumed tokens:    146898944 | elapsed time per iteration (s): 15.22 | learning rate: 2.350E-05 | global batch size:    16 | lm loss: 5.150328E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4484/  128728 | consumed samples:        71744 | consumed tokens:    146931712 | elapsed time per iteration (s): 15.21 | learning rate: 2.351E-05 | global batch size:    16 | lm loss: 5.206596E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4485/  128728 | consumed samples:        71760 | consumed tokens:    146964480 | elapsed time per iteration (s): 15.20 | learning rate: 2.351E-05 | global batch size:    16 | lm loss: 5.173460E+00 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4486/  128728 | consumed samples:        71776 | consumed tokens:    146997248 | elapsed time per iteration (s): 15.23 | learning rate: 2.352E-05 | global batch size:    16 | lm loss: 5.111909E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4487/  128728 | consumed samples:        71792 | consumed tokens:    147030016 | elapsed time per iteration (s): 15.21 | learning rate: 2.352E-05 | global batch size:    16 | lm loss: 5.359019E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4488/  128728 | consumed samples:        71808 | consumed tokens:    147062784 | elapsed time per iteration (s): 15.21 | learning rate: 2.353E-05 | global batch size:    16 | lm loss: 5.182050E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4489/  128728 | consumed samples:        71824 | consumed tokens:    147095552 | elapsed time per iteration (s): 15.20 | learning rate: 2.354E-05 | global batch size:    16 | lm loss: 5.222503E+00 | grad norm: 2.350 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4490/  128728 | consumed samples:        71840 | consumed tokens:    147128320 | elapsed time per iteration (s): 15.24 | learning rate: 2.354E-05 | global batch size:    16 | lm loss: 5.156126E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4491/  128728 | consumed samples:        71856 | consumed tokens:    147161088 | elapsed time per iteration (s): 15.27 | learning rate: 2.355E-05 | global batch size:    16 | lm loss: 5.105990E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     4492/  128728 | consumed samples:        71872 | consumed tokens:    147193856 | elapsed time per iteration (s): 15.24 | learning rate: 2.355E-05 | global batch size:    16 | lm loss: 5.223593E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4493/  128728 | consumed samples:        71888 | consumed tokens:    147226624 | elapsed time per iteration (s): 15.22 | learning rate: 2.356E-05 | global batch size:    16 | lm loss: 5.256061E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4494/  128728 | consumed samples:        71904 | consumed tokens:    147259392 | elapsed time per iteration (s): 15.23 | learning rate: 2.356E-05 | global batch size:    16 | lm loss: 5.453145E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4495/  128728 | consumed samples:        71920 | consumed tokens:    147292160 | elapsed time per iteration (s): 15.15 | learning rate: 2.357E-05 | global batch size:    16 | lm loss: 5.212368E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.09 |
[default7]: iteration     4496/  128728 | consumed samples:        71936 | consumed tokens:    147324928 | elapsed time per iteration (s): 15.21 | learning rate: 2.357E-05 | global batch size:    16 | lm loss: 4.998689E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4497/  128728 | consumed samples:        71952 | consumed tokens:    147357696 | elapsed time per iteration (s): 15.22 | learning rate: 2.358E-05 | global batch size:    16 | lm loss: 5.206701E+00 | grad norm: 1.081 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4498/  128728 | consumed samples:        71968 | consumed tokens:    147390464 | elapsed time per iteration (s): 15.25 | learning rate: 2.358E-05 | global batch size:    16 | lm loss: 5.252453E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4499/  128728 | consumed samples:        71984 | consumed tokens:    147423232 | elapsed time per iteration (s): 15.21 | learning rate: 2.359E-05 | global batch size:    16 | lm loss: 5.190476E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4500/  128728 | consumed samples:        72000 | consumed tokens:    147456000 | elapsed time per iteration (s): 15.22 | learning rate: 2.359E-05 | global batch size:    16 | lm loss: 5.268880E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default0]:saving checkpoint at iteration    4500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default1]:[2022-03-04 01:02:52,316] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/mp_rank_01_model_states.pt
[default0]:[2022-03-04 01:02:52,625] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/mp_rank_00_model_states.pt
[default1]:[2022-03-04 01:03:04,623] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt
[default5]:[2022-03-04 01:03:04,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt
[default3]:[2022-03-04 01:03:05,017] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt
[default1]:[2022-03-04 01:03:05,300] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt
[default7]:[2022-03-04 01:03:05,280] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt
[default0]:[2022-03-04 01:03:05,304] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt
[default2]:[2022-03-04 01:03:05,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt
[default6]:[2022-03-04 01:03:05,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt
[default7]:[2022-03-04 01:03:05,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt
[default6]:[2022-03-04 01:03:05,491] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt
[default0]:[2022-03-04 01:03:05,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt
[default2]:[2022-03-04 01:03:05,604] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt
[default7]:[2022-03-04 01:03:05,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt
[default6]:[2022-03-04 01:03:05,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt
[default5]:[2022-03-04 01:03:05,778] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt
[default4]:[2022-03-04 01:03:05,865] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt
[default2]:[2022-03-04 01:03:05,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt
[default3]:[2022-03-04 01:03:05,873] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt
[default3]:[2022-03-04 01:03:05,892] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt
[default5]:[2022-03-04 01:03:05,952] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt
[default4]:[2022-03-04 01:03:06,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt
[default2]:[2022-03-04 01:03:06,093] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt
[default0]:[2022-03-04 01:03:06,121] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt
[default1]:[2022-03-04 01:03:06,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt
[default0]:[2022-03-04 01:03:06,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt
[default4]:[2022-03-04 01:03:06,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt
[default0]:[2022-03-04 01:03:06,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt
[default5]:[2022-03-04 01:03:06,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt
[default3]:[2022-03-04 01:03:06,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt
[default1]:[2022-03-04 01:03:06,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt
[default4]:[2022-03-04 01:03:06,829] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt
[default6]:[2022-03-04 01:03:06,850] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt
[default4]:[2022-03-04 01:03:06,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt
[default7]:[2022-03-04 01:03:07,004] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt
[default5]:[2022-03-04 01:03:07,009] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt
[default4]:[2022-03-04 01:03:07,039] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt
[default6]:[2022-03-04 01:03:07,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt
[default7]:[2022-03-04 01:03:07,116] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt
[default0]:[2022-03-04 01:03:07,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt
[default5]:[2022-03-04 01:03:07,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt
[default2]:[2022-03-04 01:03:07,338] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt
[default0]:[2022-03-04 01:03:07,412] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
[default4]:[2022-03-04 01:03:07,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt
[default6]:[2022-03-04 01:03:07,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt
[default1]:[2022-03-04 01:03:07,404] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt
[default7]:[2022-03-04 01:03:07,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt
[default2]:[2022-03-04 01:03:07,556] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt
[default4]:[2022-03-04 01:03:07,552] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt
[default5]:[2022-03-04 01:03:07,552] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt
[default6]:[2022-03-04 01:03:07,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt
[default0]:[2022-03-04 01:03:07,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt
[default3]:[2022-03-04 01:03:07,757] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt
[default2]:[2022-03-04 01:03:07,717] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt
[default4]:[2022-03-04 01:03:07,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt
[default5]:[2022-03-04 01:03:07,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt
[default2]:[2022-03-04 01:03:07,876] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt
[default0]:[2022-03-04 01:03:07,906] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt
[default1]:[2022-03-04 01:03:07,921] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt
[default1]:[2022-03-04 01:03:07,954] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt
[default2]:[2022-03-04 01:03:07,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt
[default7]:[2022-03-04 01:03:07,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt
[default0]:[2022-03-04 01:03:07,980] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt
[default3]:[2022-03-04 01:03:08,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt
[default3]:[2022-03-04 01:03:08,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt
[default0]:[2022-03-04 01:03:08,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt
[default0]:[2022-03-04 01:03:08,122] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt
[default6]:[2022-03-04 01:03:08,221] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt
[default5]:[2022-03-04 01:03:08,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt
[default1]:[2022-03-04 01:03:08,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt
[default0]:[2022-03-04 01:03:08,176] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt
[default7]:[2022-03-04 01:03:08,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt
[default1]:[2022-03-04 01:03:08,267] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt
[default5]:[2022-03-04 01:03:08,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt
[default3]:[2022-03-04 01:03:08,284] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt
[default4]:[2022-03-04 01:03:08,322] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt
[default1]:[2022-03-04 01:03:08,354] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt
[default6]:[2022-03-04 01:03:08,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt
[default7]:[2022-03-04 01:03:08,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt
[default1]:[2022-03-04 01:03:08,441] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt
[default5]:[2022-03-04 01:03:08,444] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt
[default6]:[2022-03-04 01:03:08,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt
[default1]:[2022-03-04 01:03:08,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt
[default2]:[2022-03-04 01:03:08,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt
[default7]:[2022-03-04 01:03:08,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt
[default3]:[2022-03-04 01:03:08,513] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt
[default6]:[2022-03-04 01:03:08,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt
[default6]:[2022-03-04 01:03:08,536] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt
[default3]:[2022-03-04 01:03:08,631] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt
[default1]:[2022-03-04 01:03:08,617] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt
[default7]:[2022-03-04 01:03:08,630] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt
[default2]:[2022-03-04 01:03:08,655] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt
[default6]:[2022-03-04 01:03:08,661] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt
[default7]:[2022-03-04 01:03:08,659] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt
[default3]:[2022-03-04 01:03:08,660] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt
[default1]:[2022-03-04 01:03:08,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt
[default5]:[2022-03-04 01:03:08,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt
[default0]:[2022-03-04 01:03:08,768] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt
[default4]:[2022-03-04 01:03:08,837] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt
[default3]:[2022-03-04 01:03:08,843] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt
[default3]:[2022-03-04 01:03:08,853] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt
[default2]:[2022-03-04 01:03:08,883] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt
[default0]:[2022-03-04 01:03:08,959] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt
[default2]:[2022-03-04 01:03:08,979] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt
[default5]:[2022-03-04 01:03:08,933] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt
[default5]:[2022-03-04 01:03:09,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt
[default2]:[2022-03-04 01:03:09,143] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt
[default4]:[2022-03-04 01:03:09,195] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt
[default7]:[2022-03-04 01:03:09,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt
[default4]:[2022-03-04 01:03:09,281] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
[default1]:[2022-03-04 01:03:09,363] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt
[default4]:[2022-03-04 01:03:09,347] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt
[default0]:[2022-03-04 01:03:09,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt
[default0]:[2022-03-04 01:03:09,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt
[default1]:[2022-03-04 01:03:09,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt
[default1]:[2022-03-04 01:03:09,542] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt
[default3]:[2022-03-04 01:03:09,567] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt
[default0]:[2022-03-04 01:03:09,860] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt
[default5]:[2022-03-04 01:03:09,891] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt
[default5]:[2022-03-04 01:03:09,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt
[default2]:[2022-03-04 01:03:09,927] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt
[default2]:[2022-03-04 01:03:10,000] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt
[default1]:[2022-03-04 01:03:10,133] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt
[default5]:[2022-03-04 01:03:10,126] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt
[default3]:[2022-03-04 01:03:10,297] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt
[default2]:[2022-03-04 01:03:10,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt
[default7]:[2022-03-04 01:03:10,319] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt
[default4]:[2022-03-04 01:03:10,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt
[default5]:[2022-03-04 01:03:10,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt
[default1]:[2022-03-04 01:03:10,371] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt
[default6]:[2022-03-04 01:03:10,476] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt
[default5]:[2022-03-04 01:03:10,449] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt
[default2]:[2022-03-04 01:03:10,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt
[default6]:[2022-03-04 01:03:10,535] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt
[default3]:[2022-03-04 01:03:10,588] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt
[default4]:[2022-03-04 01:03:10,554] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt
[default7]:[2022-03-04 01:03:10,606] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt
[default3]:[2022-03-04 01:03:10,576] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt
[default0]:[2022-03-04 01:03:10,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt
[default4]:[2022-03-04 01:03:10,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
[default5]:[2022-03-04 01:03:10,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
[default3]:[2022-03-04 01:03:10,777] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt
[default2]:[2022-03-04 01:03:10,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt
[default4]:[2022-03-04 01:03:10,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt
[default0]:[2022-03-04 01:03:10,853] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt
[default1]:[2022-03-04 01:03:10,808] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt
[default3]:[2022-03-04 01:03:10,864] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt
[default4]:[2022-03-04 01:03:10,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
[default5]:[2022-03-04 01:03:10,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt
[default5]:[2022-03-04 01:03:11,001] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt
[default2]:[2022-03-04 01:03:11,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt
[default7]:[2022-03-04 01:03:11,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt
[default3]:[2022-03-04 01:03:11,147] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt
[default4]:[2022-03-04 01:03:11,186] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt
[default7]:[2022-03-04 01:03:11,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt
[default1]:[2022-03-04 01:03:11,199] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt
[default6]:[2022-03-04 01:03:11,218] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt
[default6]:[2022-03-04 01:03:11,284] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt
[default0]:[2022-03-04 01:03:11,363] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt
[default2]:[2022-03-04 01:03:11,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt
[default6]:[2022-03-04 01:03:11,411] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt
[default7]:[2022-03-04 01:03:11,438] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt
[default3]:[2022-03-04 01:03:11,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt
[default2]:[2022-03-04 01:03:11,514] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt
[default4]:[2022-03-04 01:03:11,527] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt
[default4]:[2022-03-04 01:03:11,577] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt
[default3]:[2022-03-04 01:03:11,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt
[default7]:[2022-03-04 01:03:11,753] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt
[default2]:[2022-03-04 01:03:11,764] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt
[default6]:[2022-03-04 01:03:11,878] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt
[default3]:[2022-03-04 01:03:11,902] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt
[default5]:[2022-03-04 01:03:11,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt
[default0]:[2022-03-04 01:03:11,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
[default4]:[2022-03-04 01:03:11,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt
[default0]:[2022-03-04 01:03:11,990] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt
[default6]:[2022-03-04 01:03:12,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt
[default5]:[2022-03-04 01:03:12,017] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt
[default7]:[2022-03-04 01:03:12,108] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt
[default6]:[2022-03-04 01:03:12,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt
[default0]:[2022-03-04 01:03:12,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt
[default1]:[2022-03-04 01:03:12,204] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt
[default5]:[2022-03-04 01:03:12,262] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt
[default1]:[2022-03-04 01:03:12,186] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt
[default5]:[2022-03-04 01:03:12,269] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt
[default4]:[2022-03-04 01:03:12,427] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt
[default1]:[2022-03-04 01:03:12,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt
[default1]:[2022-03-04 01:03:12,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt
[default6]:[2022-03-04 01:03:12,663] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt
[default4]:[2022-03-04 01:03:12,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt
[default4]:[2022-03-04 01:03:12,734] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt
[default4]:[2022-03-04 01:03:12,847] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt
[default7]:[2022-03-04 01:03:12,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt
[default5]:[2022-03-04 01:03:13,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt
[default6]:[2022-03-04 01:03:13,011] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt
[default4]:[2022-03-04 01:03:12,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt
[default3]:[2022-03-04 01:03:13,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt
[default5]:[2022-03-04 01:03:13,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt
[default1]:[2022-03-04 01:03:13,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt
[default7]:[2022-03-04 01:03:13,157] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt
[default2]:[2022-03-04 01:03:13,169] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt
[default6]:[2022-03-04 01:03:13,181] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt
[default3]:[2022-03-04 01:03:13,173] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt
[default2]:[2022-03-04 01:03:13,168] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt
[default2]:[2022-03-04 01:03:13,238] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt
[default5]:[2022-03-04 01:03:13,407] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt
[default4]:[2022-03-04 01:03:13,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt
[default1]:[2022-03-04 01:03:13,388] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt
[default7]:[2022-03-04 01:03:13,411] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt
[default6]:[2022-03-04 01:03:13,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt
[default7]:[2022-03-04 01:03:13,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt
[default0]:[2022-03-04 01:03:13,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt
[default7]:[2022-03-04 01:03:13,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt
[default5]:[2022-03-04 01:03:13,543] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt
[default7]:[2022-03-04 01:03:13,545] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt
[default3]:[2022-03-04 01:03:13,542] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt
[default2]:[2022-03-04 01:03:13,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt
[default3]:[2022-03-04 01:03:13,669] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt
[default4]:[2022-03-04 01:03:13,701] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt
[default0]:[2022-03-04 01:03:13,785] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt
[default2]:[2022-03-04 01:03:13,925] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt
[default2]:[2022-03-04 01:03:13,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt
[default3]:[2022-03-04 01:03:13,962] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt
[default0]:[2022-03-04 01:03:14,007] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt
[default4]:[2022-03-04 01:03:13,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt
[default7]:[2022-03-04 01:03:14,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt
[default6]:[2022-03-04 01:03:14,066] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt
[default6]:[2022-03-04 01:03:14,113] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt
[default0]:[2022-03-04 01:03:14,120] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt
[default7]:[2022-03-04 01:03:14,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt
[default1]:[2022-03-04 01:03:14,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt
[default4]:[2022-03-04 01:03:14,140] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt
[default6]:[2022-03-04 01:03:14,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt
[default0]:[2022-03-04 01:03:14,158] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[default2]:[2022-03-04 01:03:14,147] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt
[default5]:[2022-03-04 01:03:14,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt
[default3]:[2022-03-04 01:03:14,345] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt
[default1]:[2022-03-04 01:03:14,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt
[default3]:[2022-03-04 01:03:14,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt
[default2]:[2022-03-04 01:03:14,374] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt
[default6]:[2022-03-04 01:03:14,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt
[default3]:[2022-03-04 01:03:14,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt
[default2]:[2022-03-04 01:03:14,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt
[default2]:[2022-03-04 01:03:14,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt
[default7]:[2022-03-04 01:03:14,482] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt
[default3]:[2022-03-04 01:03:14,516] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt
[default2]:[2022-03-04 01:03:14,643] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt
[default7]:[2022-03-04 01:03:14,699] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt
[default6]:[2022-03-04 01:03:14,751] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt
[default0]:[2022-03-04 01:03:14,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt
[default5]:[2022-03-04 01:03:14,722] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt
[default4]:[2022-03-04 01:03:14,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
[default3]:[2022-03-04 01:03:14,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt
[default7]:[2022-03-04 01:03:14,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt
[default3]:[2022-03-04 01:03:15,058] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt
[default3]:[2022-03-04 01:03:15,307] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt
[default7]:[2022-03-04 01:03:15,430] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt
[default6]:[2022-03-04 01:03:15,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt
[default3]:[2022-03-04 01:03:15,436] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt
[default6]:[2022-03-04 01:03:15,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt
[default3]:[2022-03-04 01:03:15,557] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt
[default4]:[2022-03-04 01:03:15,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt
[default1]:[2022-03-04 01:03:15,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt
[default7]:[2022-03-04 01:03:15,574] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt
[default5]:[2022-03-04 01:03:15,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt
[default7]:[2022-03-04 01:03:15,672] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt
[default5]:[2022-03-04 01:03:15,730] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt
[default3]:[2022-03-04 01:03:15,691] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt
[default1]:[2022-03-04 01:03:15,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt
[default3]:[2022-03-04 01:03:15,875] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt
[default2]:[2022-03-04 01:03:15,919] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt
[default0]:[2022-03-04 01:03:15,945] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt
[default5]:[2022-03-04 01:03:15,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt
[default6]:[2022-03-04 01:03:16,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt
[default3]:[2022-03-04 01:03:16,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt
[default0]:[2022-03-04 01:03:16,181] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt
[default6]:[2022-03-04 01:03:16,211] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt
[default0]:[2022-03-04 01:03:16,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
[default2]:[2022-03-04 01:03:16,253] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt
[default2]:[2022-03-04 01:03:16,310] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt
[default6]:[2022-03-04 01:03:16,365] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt
[default1]:[2022-03-04 01:03:16,379] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt
[default0]:[2022-03-04 01:03:16,378] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt
[default1]:[2022-03-04 01:03:16,389] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt
[default7]:[2022-03-04 01:03:16,358] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt
[default1]:[2022-03-04 01:03:16,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt
[default4]:[2022-03-04 01:03:16,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt
[default5]:[2022-03-04 01:03:16,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt
[default3]:[2022-03-04 01:03:16,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt
[default1]:[2022-03-04 01:03:16,489] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt
[default2]:[2022-03-04 01:03:16,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt
[default0]:[2022-03-04 01:03:16,571] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt
[default6]:[2022-03-04 01:03:16,659] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt
[default4]:[2022-03-04 01:03:16,646] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt
[default2]:[2022-03-04 01:03:16,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt
[default1]:[2022-03-04 01:03:16,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt
[default1]:[2022-03-04 01:03:16,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt
[default7]:[2022-03-04 01:03:16,851] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt
[default6]:[2022-03-04 01:03:16,912] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt
[default4]:[2022-03-04 01:03:16,941] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt
[default6]:[2022-03-04 01:03:16,811] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt
[default2]:[2022-03-04 01:03:16,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt
[default7]:[2022-03-04 01:03:16,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt
[default2]:[2022-03-04 01:03:17,026] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt
[default1]:[2022-03-04 01:03:16,860] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt
[default7]:[2022-03-04 01:03:17,134] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt
[default0]:[2022-03-04 01:03:17,147] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt
[default0]:[2022-03-04 01:03:17,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt
[default5]:[2022-03-04 01:03:17,143] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt
[default7]:[2022-03-04 01:03:17,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt
[default7]:[2022-03-04 01:03:17,269] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt
[default7]:[2022-03-04 01:03:17,414] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt
[default1]:[2022-03-04 01:03:17,380] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt
[default2]:[2022-03-04 01:03:17,481] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt
[default6]:[2022-03-04 01:03:17,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt
[default2]:[2022-03-04 01:03:17,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt
[default0]:[2022-03-04 01:03:17,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt
[default1]:[2022-03-04 01:03:17,633] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt
[default3]:[2022-03-04 01:03:17,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt
[default3]:[2022-03-04 01:03:17,662] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt
[default6]:[2022-03-04 01:03:17,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt
[default0]:[2022-03-04 01:03:17,963] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt
[default0]:[2022-03-04 01:03:17,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt
[default1]:[2022-03-04 01:03:18,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt
[default6]:[2022-03-04 01:03:18,049] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt
[default5]:[2022-03-04 01:03:18,201] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt
[default5]:[2022-03-04 01:03:18,510] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt
[default7]:[2022-03-04 01:03:18,539] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt
[default1]:[2022-03-04 01:03:18,498] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt
[default0]:[2022-03-04 01:03:18,572] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt
[default1]:[2022-03-04 01:03:18,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt
[default5]:[2022-03-04 01:03:18,632] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt
[default7]:[2022-03-04 01:03:18,679] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt
[default0]:[2022-03-04 01:03:18,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt
[default6]:[2022-03-04 01:03:18,706] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt
[default4]:[2022-03-04 01:03:18,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt
[default1]:[2022-03-04 01:03:18,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt
[default2]:[2022-03-04 01:03:18,807] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt
[default6]:[2022-03-04 01:03:18,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt
[default4]:[2022-03-04 01:03:18,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt
[default3]:[2022-03-04 01:03:18,805] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt
[default4]:[2022-03-04 01:03:18,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt
[default7]:[2022-03-04 01:03:18,871] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt
[default0]:[2022-03-04 01:03:18,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt
[default4]:[2022-03-04 01:03:18,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt
[default5]:[2022-03-04 01:03:18,944] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt
[default5]:[2022-03-04 01:03:18,918] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt
[default4]:[2022-03-04 01:03:18,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt
[default0]:[2022-03-04 01:03:19,062] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt
[default1]:[2022-03-04 01:03:19,113] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt
[default2]:[2022-03-04 01:03:19,125] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt
[default6]:[2022-03-04 01:03:19,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt
[default4]:[2022-03-04 01:03:19,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt
[default2]:[2022-03-04 01:03:19,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt
[default7]:[2022-03-04 01:03:19,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt
[default3]:[2022-03-04 01:03:19,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt
[default5]:[2022-03-04 01:03:19,224] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt
[default5]:[2022-03-04 01:03:19,274] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt
[default0]:[2022-03-04 01:03:19,372] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt
[default1]:[2022-03-04 01:03:19,381] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt
[default1]:[2022-03-04 01:03:19,394] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt
[default0]:[2022-03-04 01:03:19,631] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt
[default3]:[2022-03-04 01:03:19,729] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt
[default4]:[2022-03-04 01:03:19,792] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt
[default4]:[2022-03-04 01:03:19,855] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt
[default6]:[2022-03-04 01:03:19,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt
[default0]:[2022-03-04 01:03:20,045] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt
[default1]:[2022-03-04 01:03:19,971] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt
[default6]:[2022-03-04 01:03:20,338] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt
[default7]:[2022-03-04 01:03:20,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt
[default6]:[2022-03-04 01:03:20,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt
[default7]:[2022-03-04 01:03:20,376] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt
[default5]:[2022-03-04 01:03:20,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt
[default6]:[2022-03-04 01:03:20,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt
[default0]:[2022-03-04 01:03:20,421] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt
[default4]:[2022-03-04 01:03:20,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt
[default7]:[2022-03-04 01:03:20,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt
[default5]:[2022-03-04 01:03:20,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt
[default3]:[2022-03-04 01:03:20,708] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt
[default2]:[2022-03-04 01:03:20,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt
[default5]:[2022-03-04 01:03:20,858] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt
[default4]:[2022-03-04 01:03:20,881] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt
[default3]:[2022-03-04 01:03:21,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt
[default2]:[2022-03-04 01:03:21,812] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt
[default5]:[2022-03-04 01:03:22,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt
[default7]:time (ms) | save-checkpoint: 37771.40
[default0]:  successfully saved checkpoint at iteration    4500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default4]:[2022-03-04 01:03:22,298] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt
[default7]: iteration     4501/  128728 | consumed samples:        72016 | consumed tokens:    147488768 | elapsed time per iteration (s): 52.98 | learning rate: 2.360E-05 | global batch size:    16 | lm loss: 5.325017E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.302 | TFLOPs: 2.31 |
[default7]: iteration     4502/  128728 | consumed samples:        72032 | consumed tokens:    147521536 | elapsed time per iteration (s): 15.24 | learning rate: 2.360E-05 | global batch size:    16 | lm loss: 5.206753E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4503/  128728 | consumed samples:        72048 | consumed tokens:    147554304 | elapsed time per iteration (s): 15.17 | learning rate: 2.361E-05 | global batch size:    16 | lm loss: 5.296180E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     4504/  128728 | consumed samples:        72064 | consumed tokens:    147587072 | elapsed time per iteration (s): 15.21 | learning rate: 2.361E-05 | global batch size:    16 | lm loss: 5.398469E+00 | grad norm: 1.169 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4505/  128728 | consumed samples:        72080 | consumed tokens:    147619840 | elapsed time per iteration (s): 15.23 | learning rate: 2.362E-05 | global batch size:    16 | lm loss: 5.553847E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4506/  128728 | consumed samples:        72096 | consumed tokens:    147652608 | elapsed time per iteration (s): 15.21 | learning rate: 2.362E-05 | global batch size:    16 | lm loss: 5.168607E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4507/  128728 | consumed samples:        72112 | consumed tokens:    147685376 | elapsed time per iteration (s): 15.21 | learning rate: 2.363E-05 | global batch size:    16 | lm loss: 5.327075E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4508/  128728 | consumed samples:        72128 | consumed tokens:    147718144 | elapsed time per iteration (s): 15.14 | learning rate: 2.363E-05 | global batch size:    16 | lm loss: 5.287652E+00 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     4509/  128728 | consumed samples:        72144 | consumed tokens:    147750912 | elapsed time per iteration (s): 15.25 | learning rate: 2.364E-05 | global batch size:    16 | lm loss: 5.220299E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4510/  128728 | consumed samples:        72160 | consumed tokens:    147783680 | elapsed time per iteration (s): 15.25 | learning rate: 2.365E-05 | global batch size:    16 | lm loss: 5.033144E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4511/  128728 | consumed samples:        72176 | consumed tokens:    147816448 | elapsed time per iteration (s): 15.22 | learning rate: 2.365E-05 | global batch size:    16 | lm loss: 5.558724E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4512/  128728 | consumed samples:        72192 | consumed tokens:    147849216 | elapsed time per iteration (s): 15.14 | learning rate: 2.366E-05 | global batch size:    16 | lm loss: 5.105259E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.057 | TFLOPs: 8.09 |
[default7]: iteration     4513/  128728 | consumed samples:        72208 | consumed tokens:    147881984 | elapsed time per iteration (s): 15.21 | learning rate: 2.366E-05 | global batch size:    16 | lm loss: 5.260120E+00 | grad norm: 0.663 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4514/  128728 | consumed samples:        72224 | consumed tokens:    147914752 | elapsed time per iteration (s): 15.23 | learning rate: 2.367E-05 | global batch size:    16 | lm loss: 4.996598E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4515/  128728 | consumed samples:        72240 | consumed tokens:    147947520 | elapsed time per iteration (s): 15.19 | learning rate: 2.367E-05 | global batch size:    16 | lm loss: 5.050591E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4516/  128728 | consumed samples:        72256 | consumed tokens:    147980288 | elapsed time per iteration (s): 15.22 | learning rate: 2.368E-05 | global batch size:    16 | lm loss: 5.226483E+00 | grad norm: 1.428 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4517/  128728 | consumed samples:        72272 | consumed tokens:    148013056 | elapsed time per iteration (s): 15.22 | learning rate: 2.368E-05 | global batch size:    16 | lm loss: 4.994648E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4518/  128728 | consumed samples:        72288 | consumed tokens:    148045824 | elapsed time per iteration (s): 15.19 | learning rate: 2.369E-05 | global batch size:    16 | lm loss: 5.458125E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     4519/  128728 | consumed samples:        72304 | consumed tokens:    148078592 | elapsed time per iteration (s): 15.22 | learning rate: 2.369E-05 | global batch size:    16 | lm loss: 5.218241E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4520/  128728 | consumed samples:        72320 | consumed tokens:    148111360 | elapsed time per iteration (s): 15.21 | learning rate: 2.370E-05 | global batch size:    16 | lm loss: 5.292453E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4521/  128728 | consumed samples:        72336 | consumed tokens:    148144128 | elapsed time per iteration (s): 15.22 | learning rate: 2.370E-05 | global batch size:    16 | lm loss: 5.189533E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4522/  128728 | consumed samples:        72352 | consumed tokens:    148176896 | elapsed time per iteration (s): 15.21 | learning rate: 2.371E-05 | global batch size:    16 | lm loss: 5.006850E+00 | grad norm: 2.157 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4523/  128728 | consumed samples:        72368 | consumed tokens:    148209664 | elapsed time per iteration (s): 15.21 | learning rate: 2.371E-05 | global batch size:    16 | lm loss: 5.264925E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4524/  128728 | consumed samples:        72384 | consumed tokens:    148242432 | elapsed time per iteration (s): 15.22 | learning rate: 2.372E-05 | global batch size:    16 | lm loss: 5.519560E+00 | grad norm: 0.644 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4525/  128728 | consumed samples:        72400 | consumed tokens:    148275200 | elapsed time per iteration (s): 15.21 | learning rate: 2.372E-05 | global batch size:    16 | lm loss: 5.181821E+00 | grad norm: 1.153 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4526/  128728 | consumed samples:        72416 | consumed tokens:    148307968 | elapsed time per iteration (s): 15.22 | learning rate: 2.373E-05 | global batch size:    16 | lm loss: 5.311499E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4527/  128728 | consumed samples:        72432 | consumed tokens:    148340736 | elapsed time per iteration (s): 15.18 | learning rate: 2.373E-05 | global batch size:    16 | lm loss: 5.167645E+00 | grad norm: 0.852 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4528/  128728 | consumed samples:        72448 | consumed tokens:    148373504 | elapsed time per iteration (s): 15.25 | learning rate: 2.374E-05 | global batch size:    16 | lm loss: 5.202123E+00 | grad norm: 1.083 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     4529/  128728 | consumed samples:        72464 | consumed tokens:    148406272 | elapsed time per iteration (s): 15.22 | learning rate: 2.375E-05 | global batch size:    16 | lm loss: 5.369713E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4530/  128728 | consumed samples:        72480 | consumed tokens:    148439040 | elapsed time per iteration (s): 15.23 | learning rate: 2.375E-05 | global batch size:    16 | lm loss: 5.040470E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4531/  128728 | consumed samples:        72496 | consumed tokens:    148471808 | elapsed time per iteration (s): 15.26 | learning rate: 2.376E-05 | global batch size:    16 | lm loss: 5.086207E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4532/  128728 | consumed samples:        72512 | consumed tokens:    148504576 | elapsed time per iteration (s): 15.22 | learning rate: 2.376E-05 | global batch size:    16 | lm loss: 5.150359E+00 | grad norm: 1.260 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4533/  128728 | consumed samples:        72528 | consumed tokens:    148537344 | elapsed time per iteration (s): 15.24 | learning rate: 2.377E-05 | global batch size:    16 | lm loss: 5.247553E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4534/  128728 | consumed samples:        72544 | consumed tokens:    148570112 | elapsed time per iteration (s): 15.23 | learning rate: 2.377E-05 | global batch size:    16 | lm loss: 5.214560E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4535/  128728 | consumed samples:        72560 | consumed tokens:    148602880 | elapsed time per iteration (s): 15.21 | learning rate: 2.378E-05 | global batch size:    16 | lm loss: 5.090154E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4536/  128728 | consumed samples:        72576 | consumed tokens:    148635648 | elapsed time per iteration (s): 15.23 | learning rate: 2.378E-05 | global batch size:    16 | lm loss: 4.961235E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4537/  128728 | consumed samples:        72592 | consumed tokens:    148668416 | elapsed time per iteration (s): 15.20 | learning rate: 2.379E-05 | global batch size:    16 | lm loss: 5.200741E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4538/  128728 | consumed samples:        72608 | consumed tokens:    148701184 | elapsed time per iteration (s): 15.25 | learning rate: 2.379E-05 | global batch size:    16 | lm loss: 5.063721E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.04 |
[default7]: iteration     4539/  128728 | consumed samples:        72624 | consumed tokens:    148733952 | elapsed time per iteration (s): 15.21 | learning rate: 2.380E-05 | global batch size:    16 | lm loss: 5.377962E+00 | grad norm: 0.984 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4540/  128728 | consumed samples:        72640 | consumed tokens:    148766720 | elapsed time per iteration (s): 15.21 | learning rate: 2.380E-05 | global batch size:    16 | lm loss: 5.393027E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4541/  128728 | consumed samples:        72656 | consumed tokens:    148799488 | elapsed time per iteration (s): 15.19 | learning rate: 2.381E-05 | global batch size:    16 | lm loss: 5.115465E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     4542/  128728 | consumed samples:        72672 | consumed tokens:    148832256 | elapsed time per iteration (s): 15.23 | learning rate: 2.381E-05 | global batch size:    16 | lm loss: 5.172780E+00 | grad norm: 1.032 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4543/  128728 | consumed samples:        72688 | consumed tokens:    148865024 | elapsed time per iteration (s): 15.21 | learning rate: 2.382E-05 | global batch size:    16 | lm loss: 5.387748E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4544/  128728 | consumed samples:        72704 | consumed tokens:    148897792 | elapsed time per iteration (s): 15.21 | learning rate: 2.382E-05 | global batch size:    16 | lm loss: 5.250667E+00 | grad norm: 1.622 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4545/  128728 | consumed samples:        72720 | consumed tokens:    148930560 | elapsed time per iteration (s): 15.22 | learning rate: 2.383E-05 | global batch size:    16 | lm loss: 5.358253E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4546/  128728 | consumed samples:        72736 | consumed tokens:    148963328 | elapsed time per iteration (s): 15.21 | learning rate: 2.383E-05 | global batch size:    16 | lm loss: 5.096012E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4547/  128728 | consumed samples:        72752 | consumed tokens:    148996096 | elapsed time per iteration (s): 15.20 | learning rate: 2.384E-05 | global batch size:    16 | lm loss: 4.942961E+00 | grad norm: 1.755 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4548/  128728 | consumed samples:        72768 | consumed tokens:    149028864 | elapsed time per iteration (s): 15.21 | learning rate: 2.384E-05 | global batch size:    16 | lm loss: 5.277761E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4549/  128728 | consumed samples:        72784 | consumed tokens:    149061632 | elapsed time per iteration (s): 15.22 | learning rate: 2.385E-05 | global batch size:    16 | lm loss: 5.401462E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4550/  128728 | consumed samples:        72800 | consumed tokens:    149094400 | elapsed time per iteration (s): 15.25 | learning rate: 2.386E-05 | global batch size:    16 | lm loss: 5.125511E+00 | grad norm: 0.660 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4551/  128728 | consumed samples:        72816 | consumed tokens:    149127168 | elapsed time per iteration (s): 15.23 | learning rate: 2.386E-05 | global batch size:    16 | lm loss: 5.149467E+00 | grad norm: 1.018 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4552/  128728 | consumed samples:        72832 | consumed tokens:    149159936 | elapsed time per iteration (s): 15.22 | learning rate: 2.387E-05 | global batch size:    16 | lm loss: 5.229480E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4553/  128728 | consumed samples:        72848 | consumed tokens:    149192704 | elapsed time per iteration (s): 15.20 | learning rate: 2.387E-05 | global batch size:    16 | lm loss: 5.411103E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4554/  128728 | consumed samples:        72864 | consumed tokens:    149225472 | elapsed time per iteration (s): 15.20 | learning rate: 2.388E-05 | global batch size:    16 | lm loss: 5.420312E+00 | grad norm: 1.593 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4555/  128728 | consumed samples:        72880 | consumed tokens:    149258240 | elapsed time per iteration (s): 15.21 | learning rate: 2.388E-05 | global batch size:    16 | lm loss: 5.258182E+00 | grad norm: 0.650 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4556/  128728 | consumed samples:        72896 | consumed tokens:    149291008 | elapsed time per iteration (s): 15.20 | learning rate: 2.389E-05 | global batch size:    16 | lm loss: 5.368918E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4557/  128728 | consumed samples:        72912 | consumed tokens:    149323776 | elapsed time per iteration (s): 15.21 | learning rate: 2.389E-05 | global batch size:    16 | lm loss: 5.145999E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4558/  128728 | consumed samples:        72928 | consumed tokens:    149356544 | elapsed time per iteration (s): 15.19 | learning rate: 2.390E-05 | global batch size:    16 | lm loss: 5.343250E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4559/  128728 | consumed samples:        72944 | consumed tokens:    149389312 | elapsed time per iteration (s): 15.20 | learning rate: 2.390E-05 | global batch size:    16 | lm loss: 5.249984E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4560/  128728 | consumed samples:        72960 | consumed tokens:    149422080 | elapsed time per iteration (s): 15.23 | learning rate: 2.391E-05 | global batch size:    16 | lm loss: 5.127768E+00 | grad norm: 0.637 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4561/  128728 | consumed samples:        72976 | consumed tokens:    149454848 | elapsed time per iteration (s): 15.20 | learning rate: 2.391E-05 | global batch size:    16 | lm loss: 5.086662E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4562/  128728 | consumed samples:        72992 | consumed tokens:    149487616 | elapsed time per iteration (s): 15.20 | learning rate: 2.392E-05 | global batch size:    16 | lm loss: 5.438632E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4563/  128728 | consumed samples:        73008 | consumed tokens:    149520384 | elapsed time per iteration (s): 15.22 | learning rate: 2.392E-05 | global batch size:    16 | lm loss: 5.137195E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4564/  128728 | consumed samples:        73024 | consumed tokens:    149553152 | elapsed time per iteration (s): 15.17 | learning rate: 2.393E-05 | global batch size:    16 | lm loss: 5.080501E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4565/  128728 | consumed samples:        73040 | consumed tokens:    149585920 | elapsed time per iteration (s): 15.21 | learning rate: 2.393E-05 | global batch size:    16 | lm loss: 5.107949E+00 | grad norm: 1.129 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4566/  128728 | consumed samples:        73056 | consumed tokens:    149618688 | elapsed time per iteration (s): 15.22 | learning rate: 2.394E-05 | global batch size:    16 | lm loss: 5.110487E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4567/  128728 | consumed samples:        73072 | consumed tokens:    149651456 | elapsed time per iteration (s): 15.22 | learning rate: 2.394E-05 | global batch size:    16 | lm loss: 5.108166E+00 | grad norm: 1.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4568/  128728 | consumed samples:        73088 | consumed tokens:    149684224 | elapsed time per iteration (s): 15.25 | learning rate: 2.395E-05 | global batch size:    16 | lm loss: 5.194795E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4569/  128728 | consumed samples:        73104 | consumed tokens:    149716992 | elapsed time per iteration (s): 15.22 | learning rate: 2.395E-05 | global batch size:    16 | lm loss: 5.168123E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4570/  128728 | consumed samples:        73120 | consumed tokens:    149749760 | elapsed time per iteration (s): 15.20 | learning rate: 2.396E-05 | global batch size:    16 | lm loss: 5.355202E+00 | grad norm: 0.956 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4571/  128728 | consumed samples:        73136 | consumed tokens:    149782528 | elapsed time per iteration (s): 15.25 | learning rate: 2.397E-05 | global batch size:    16 | lm loss: 5.210347E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4572/  128728 | consumed samples:        73152 | consumed tokens:    149815296 | elapsed time per iteration (s): 15.17 | learning rate: 2.397E-05 | global batch size:    16 | lm loss: 5.141915E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.07 |
[default7]: iteration     4573/  128728 | consumed samples:        73168 | consumed tokens:    149848064 | elapsed time per iteration (s): 15.19 | learning rate: 2.398E-05 | global batch size:    16 | lm loss: 5.015357E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4574/  128728 | consumed samples:        73184 | consumed tokens:    149880832 | elapsed time per iteration (s): 15.24 | learning rate: 2.398E-05 | global batch size:    16 | lm loss: 5.284767E+00 | grad norm: 1.642 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4575/  128728 | consumed samples:        73200 | consumed tokens:    149913600 | elapsed time per iteration (s): 15.25 | learning rate: 2.399E-05 | global batch size:    16 | lm loss: 5.151593E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4576/  128728 | consumed samples:        73216 | consumed tokens:    149946368 | elapsed time per iteration (s): 15.24 | learning rate: 2.399E-05 | global batch size:    16 | lm loss: 5.201889E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4577/  128728 | consumed samples:        73232 | consumed tokens:    149979136 | elapsed time per iteration (s): 15.19 | learning rate: 2.400E-05 | global batch size:    16 | lm loss: 5.358136E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4578/  128728 | consumed samples:        73248 | consumed tokens:    150011904 | elapsed time per iteration (s): 15.21 | learning rate: 2.400E-05 | global batch size:    16 | lm loss: 5.094169E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4579/  128728 | consumed samples:        73264 | consumed tokens:    150044672 | elapsed time per iteration (s): 15.21 | learning rate: 2.401E-05 | global batch size:    16 | lm loss: 5.261844E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4580/  128728 | consumed samples:        73280 | consumed tokens:    150077440 | elapsed time per iteration (s): 15.24 | learning rate: 2.401E-05 | global batch size:    16 | lm loss: 5.281607E+00 | grad norm: 1.033 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4581/  128728 | consumed samples:        73296 | consumed tokens:    150110208 | elapsed time per iteration (s): 15.23 | learning rate: 2.402E-05 | global batch size:    16 | lm loss: 5.304956E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4582/  128728 | consumed samples:        73312 | consumed tokens:    150142976 | elapsed time per iteration (s): 15.23 | learning rate: 2.402E-05 | global batch size:    16 | lm loss: 4.882883E+00 | grad norm: 1.048 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4583/  128728 | consumed samples:        73328 | consumed tokens:    150175744 | elapsed time per iteration (s): 15.19 | learning rate: 2.403E-05 | global batch size:    16 | lm loss: 4.978672E+00 | grad norm: 0.938 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     4584/  128728 | consumed samples:        73344 | consumed tokens:    150208512 | elapsed time per iteration (s): 15.20 | learning rate: 2.403E-05 | global batch size:    16 | lm loss: 5.311226E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4585/  128728 | consumed samples:        73360 | consumed tokens:    150241280 | elapsed time per iteration (s): 15.18 | learning rate: 2.404E-05 | global batch size:    16 | lm loss: 5.109036E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4586/  128728 | consumed samples:        73376 | consumed tokens:    150274048 | elapsed time per iteration (s): 15.19 | learning rate: 2.404E-05 | global batch size:    16 | lm loss: 5.296421E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     4587/  128728 | consumed samples:        73392 | consumed tokens:    150306816 | elapsed time per iteration (s): 15.23 | learning rate: 2.405E-05 | global batch size:    16 | lm loss: 5.218729E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4588/  128728 | consumed samples:        73408 | consumed tokens:    150339584 | elapsed time per iteration (s): 15.19 | learning rate: 2.405E-05 | global batch size:    16 | lm loss: 5.307782E+00 | grad norm: 0.932 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     4589/  128728 | consumed samples:        73424 | consumed tokens:    150372352 | elapsed time per iteration (s): 15.20 | learning rate: 2.406E-05 | global batch size:    16 | lm loss: 5.305587E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4590/  128728 | consumed samples:        73440 | consumed tokens:    150405120 | elapsed time per iteration (s): 15.21 | learning rate: 2.406E-05 | global batch size:    16 | lm loss: 5.118801E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4591/  128728 | consumed samples:        73456 | consumed tokens:    150437888 | elapsed time per iteration (s): 15.25 | learning rate: 2.407E-05 | global batch size:    16 | lm loss: 5.188321E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4592/  128728 | consumed samples:        73472 | consumed tokens:    150470656 | elapsed time per iteration (s): 15.22 | learning rate: 2.408E-05 | global batch size:    16 | lm loss: 5.110636E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4593/  128728 | consumed samples:        73488 | consumed tokens:    150503424 | elapsed time per iteration (s): 15.23 | learning rate: 2.408E-05 | global batch size:    16 | lm loss: 5.186214E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4594/  128728 | consumed samples:        73504 | consumed tokens:    150536192 | elapsed time per iteration (s): 15.16 | learning rate: 2.409E-05 | global batch size:    16 | lm loss: 5.208476E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     4595/  128728 | consumed samples:        73520 | consumed tokens:    150568960 | elapsed time per iteration (s): 15.21 | learning rate: 2.409E-05 | global batch size:    16 | lm loss: 5.445783E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4596/  128728 | consumed samples:        73536 | consumed tokens:    150601728 | elapsed time per iteration (s): 15.24 | learning rate: 2.410E-05 | global batch size:    16 | lm loss: 5.119749E+00 | grad norm: 1.123 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4597/  128728 | consumed samples:        73552 | consumed tokens:    150634496 | elapsed time per iteration (s): 15.22 | learning rate: 2.410E-05 | global batch size:    16 | lm loss: 5.283437E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4598/  128728 | consumed samples:        73568 | consumed tokens:    150667264 | elapsed time per iteration (s): 15.20 | learning rate: 2.411E-05 | global batch size:    16 | lm loss: 5.217893E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4599/  128728 | consumed samples:        73584 | consumed tokens:    150700032 | elapsed time per iteration (s): 15.19 | learning rate: 2.411E-05 | global batch size:    16 | lm loss: 5.280108E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4600/  128728 | consumed samples:        73600 | consumed tokens:    150732800 | elapsed time per iteration (s): 15.21 | learning rate: 2.412E-05 | global batch size:    16 | lm loss: 5.047978E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4601/  128728 | consumed samples:        73616 | consumed tokens:    150765568 | elapsed time per iteration (s): 15.21 | learning rate: 2.412E-05 | global batch size:    16 | lm loss: 5.097135E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4602/  128728 | consumed samples:        73632 | consumed tokens:    150798336 | elapsed time per iteration (s): 15.23 | learning rate: 2.413E-05 | global batch size:    16 | lm loss: 5.246779E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4603/  128728 | consumed samples:        73648 | consumed tokens:    150831104 | elapsed time per iteration (s): 15.21 | learning rate: 2.413E-05 | global batch size:    16 | lm loss: 5.140010E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4604/  128728 | consumed samples:        73664 | consumed tokens:    150863872 | elapsed time per iteration (s): 15.22 | learning rate: 2.414E-05 | global batch size:    16 | lm loss: 5.305626E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4605/  128728 | consumed samples:        73680 | consumed tokens:    150896640 | elapsed time per iteration (s): 15.21 | learning rate: 2.414E-05 | global batch size:    16 | lm loss: 4.927595E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4606/  128728 | consumed samples:        73696 | consumed tokens:    150929408 | elapsed time per iteration (s): 15.21 | learning rate: 2.415E-05 | global batch size:    16 | lm loss: 5.303552E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4607/  128728 | consumed samples:        73712 | consumed tokens:    150962176 | elapsed time per iteration (s): 15.23 | learning rate: 2.415E-05 | global batch size:    16 | lm loss: 5.152580E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4608/  128728 | consumed samples:        73728 | consumed tokens:    150994944 | elapsed time per iteration (s): 15.21 | learning rate: 2.416E-05 | global batch size:    16 | lm loss: 5.410992E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4609/  128728 | consumed samples:        73744 | consumed tokens:    151027712 | elapsed time per iteration (s): 15.22 | learning rate: 2.416E-05 | global batch size:    16 | lm loss: 5.160129E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4610/  128728 | consumed samples:        73760 | consumed tokens:    151060480 | elapsed time per iteration (s): 15.22 | learning rate: 2.417E-05 | global batch size:    16 | lm loss: 5.234522E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4611/  128728 | consumed samples:        73776 | consumed tokens:    151093248 | elapsed time per iteration (s): 15.22 | learning rate: 2.417E-05 | global batch size:    16 | lm loss: 5.088044E+00 | grad norm: 0.638 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4612/  128728 | consumed samples:        73792 | consumed tokens:    151126016 | elapsed time per iteration (s): 15.22 | learning rate: 2.418E-05 | global batch size:    16 | lm loss: 5.261300E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4613/  128728 | consumed samples:        73808 | consumed tokens:    151158784 | elapsed time per iteration (s): 15.23 | learning rate: 2.419E-05 | global batch size:    16 | lm loss: 5.207508E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4614/  128728 | consumed samples:        73824 | consumed tokens:    151191552 | elapsed time per iteration (s): 15.23 | learning rate: 2.419E-05 | global batch size:    16 | lm loss: 5.234620E+00 | grad norm: 1.571 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4615/  128728 | consumed samples:        73840 | consumed tokens:    151224320 | elapsed time per iteration (s): 15.26 | learning rate: 2.420E-05 | global batch size:    16 | lm loss: 5.073845E+00 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     4616/  128728 | consumed samples:        73856 | consumed tokens:    151257088 | elapsed time per iteration (s): 15.24 | learning rate: 2.420E-05 | global batch size:    16 | lm loss: 4.991200E+00 | grad norm: 1.026 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4617/  128728 | consumed samples:        73872 | consumed tokens:    151289856 | elapsed time per iteration (s): 15.22 | learning rate: 2.421E-05 | global batch size:    16 | lm loss: 5.139315E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4618/  128728 | consumed samples:        73888 | consumed tokens:    151322624 | elapsed time per iteration (s): 15.21 | learning rate: 2.421E-05 | global batch size:    16 | lm loss: 5.159419E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4619/  128728 | consumed samples:        73904 | consumed tokens:    151355392 | elapsed time per iteration (s): 15.20 | learning rate: 2.422E-05 | global batch size:    16 | lm loss: 5.040611E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4620/  128728 | consumed samples:        73920 | consumed tokens:    151388160 | elapsed time per iteration (s): 15.23 | learning rate: 2.422E-05 | global batch size:    16 | lm loss: 5.300824E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4621/  128728 | consumed samples:        73936 | consumed tokens:    151420928 | elapsed time per iteration (s): 15.22 | learning rate: 2.423E-05 | global batch size:    16 | lm loss: 5.181660E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4622/  128728 | consumed samples:        73952 | consumed tokens:    151453696 | elapsed time per iteration (s): 15.20 | learning rate: 2.423E-05 | global batch size:    16 | lm loss: 5.045792E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4623/  128728 | consumed samples:        73968 | consumed tokens:    151486464 | elapsed time per iteration (s): 15.27 | learning rate: 2.424E-05 | global batch size:    16 | lm loss: 4.973166E+00 | grad norm: 0.913 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.02 |
[default7]: iteration     4624/  128728 | consumed samples:        73984 | consumed tokens:    151519232 | elapsed time per iteration (s): 15.28 | learning rate: 2.424E-05 | global batch size:    16 | lm loss: 5.020543E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.047 | TFLOPs: 8.02 |
[default7]: iteration     4625/  128728 | consumed samples:        74000 | consumed tokens:    151552000 | elapsed time per iteration (s): 15.23 | learning rate: 2.425E-05 | global batch size:    16 | lm loss: 5.428620E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4626/  128728 | consumed samples:        74016 | consumed tokens:    151584768 | elapsed time per iteration (s): 15.24 | learning rate: 2.425E-05 | global batch size:    16 | lm loss: 5.210262E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4627/  128728 | consumed samples:        74032 | consumed tokens:    151617536 | elapsed time per iteration (s): 15.23 | learning rate: 2.426E-05 | global batch size:    16 | lm loss: 5.440079E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4628/  128728 | consumed samples:        74048 | consumed tokens:    151650304 | elapsed time per iteration (s): 15.21 | learning rate: 2.426E-05 | global batch size:    16 | lm loss: 5.092575E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4629/  128728 | consumed samples:        74064 | consumed tokens:    151683072 | elapsed time per iteration (s): 15.19 | learning rate: 2.427E-05 | global batch size:    16 | lm loss: 5.193363E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.07 |
[default7]: iteration     4630/  128728 | consumed samples:        74080 | consumed tokens:    151715840 | elapsed time per iteration (s): 15.22 | learning rate: 2.427E-05 | global batch size:    16 | lm loss: 5.114410E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4631/  128728 | consumed samples:        74096 | consumed tokens:    151748608 | elapsed time per iteration (s): 15.24 | learning rate: 2.428E-05 | global batch size:    16 | lm loss: 5.234143E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4632/  128728 | consumed samples:        74112 | consumed tokens:    151781376 | elapsed time per iteration (s): 15.17 | learning rate: 2.429E-05 | global batch size:    16 | lm loss: 5.123685E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4633/  128728 | consumed samples:        74128 | consumed tokens:    151814144 | elapsed time per iteration (s): 15.21 | learning rate: 2.429E-05 | global batch size:    16 | lm loss: 5.050491E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4634/  128728 | consumed samples:        74144 | consumed tokens:    151846912 | elapsed time per iteration (s): 15.18 | learning rate: 2.430E-05 | global batch size:    16 | lm loss: 5.095937E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4635/  128728 | consumed samples:        74160 | consumed tokens:    151879680 | elapsed time per iteration (s): 15.21 | learning rate: 2.430E-05 | global batch size:    16 | lm loss: 5.212114E+00 | grad norm: 1.000 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4636/  128728 | consumed samples:        74176 | consumed tokens:    151912448 | elapsed time per iteration (s): 15.22 | learning rate: 2.431E-05 | global batch size:    16 | lm loss: 5.032464E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4637/  128728 | consumed samples:        74192 | consumed tokens:    151945216 | elapsed time per iteration (s): 15.21 | learning rate: 2.431E-05 | global batch size:    16 | lm loss: 5.050924E+00 | grad norm: 0.609 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4638/  128728 | consumed samples:        74208 | consumed tokens:    151977984 | elapsed time per iteration (s): 15.18 | learning rate: 2.432E-05 | global batch size:    16 | lm loss: 5.182425E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4639/  128728 | consumed samples:        74224 | consumed tokens:    152010752 | elapsed time per iteration (s): 15.22 | learning rate: 2.432E-05 | global batch size:    16 | lm loss: 4.925577E+00 | grad norm: 1.052 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4640/  128728 | consumed samples:        74240 | consumed tokens:    152043520 | elapsed time per iteration (s): 15.24 | learning rate: 2.433E-05 | global batch size:    16 | lm loss: 5.214342E+00 | grad norm: 1.465 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4641/  128728 | consumed samples:        74256 | consumed tokens:    152076288 | elapsed time per iteration (s): 15.22 | learning rate: 2.433E-05 | global batch size:    16 | lm loss: 5.029734E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4642/  128728 | consumed samples:        74272 | consumed tokens:    152109056 | elapsed time per iteration (s): 15.20 | learning rate: 2.434E-05 | global batch size:    16 | lm loss: 5.284323E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4643/  128728 | consumed samples:        74288 | consumed tokens:    152141824 | elapsed time per iteration (s): 15.23 | learning rate: 2.434E-05 | global batch size:    16 | lm loss: 5.124467E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4644/  128728 | consumed samples:        74304 | consumed tokens:    152174592 | elapsed time per iteration (s): 15.25 | learning rate: 2.435E-05 | global batch size:    16 | lm loss: 5.336272E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4645/  128728 | consumed samples:        74320 | consumed tokens:    152207360 | elapsed time per iteration (s): 15.23 | learning rate: 2.435E-05 | global batch size:    16 | lm loss: 5.227530E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4646/  128728 | consumed samples:        74336 | consumed tokens:    152240128 | elapsed time per iteration (s): 15.20 | learning rate: 2.436E-05 | global batch size:    16 | lm loss: 5.086015E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4647/  128728 | consumed samples:        74352 | consumed tokens:    152272896 | elapsed time per iteration (s): 15.22 | learning rate: 2.436E-05 | global batch size:    16 | lm loss: 5.259191E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4648/  128728 | consumed samples:        74368 | consumed tokens:    152305664 | elapsed time per iteration (s): 15.21 | learning rate: 2.437E-05 | global batch size:    16 | lm loss: 5.258114E+00 | grad norm: 1.670 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4649/  128728 | consumed samples:        74384 | consumed tokens:    152338432 | elapsed time per iteration (s): 15.24 | learning rate: 2.437E-05 | global batch size:    16 | lm loss: 4.993548E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4650/  128728 | consumed samples:        74400 | consumed tokens:    152371200 | elapsed time per iteration (s): 15.23 | learning rate: 2.438E-05 | global batch size:    16 | lm loss: 5.435277E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4651/  128728 | consumed samples:        74416 | consumed tokens:    152403968 | elapsed time per iteration (s): 15.17 | learning rate: 2.438E-05 | global batch size:    16 | lm loss: 5.278158E+00 | grad norm: 1.384 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.055 | TFLOPs: 8.08 |
[default7]: iteration     4652/  128728 | consumed samples:        74432 | consumed tokens:    152436736 | elapsed time per iteration (s): 15.24 | learning rate: 2.439E-05 | global batch size:    16 | lm loss: 5.258286E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4653/  128728 | consumed samples:        74448 | consumed tokens:    152469504 | elapsed time per iteration (s): 15.22 | learning rate: 2.440E-05 | global batch size:    16 | lm loss: 5.106120E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4654/  128728 | consumed samples:        74464 | consumed tokens:    152502272 | elapsed time per iteration (s): 15.21 | learning rate: 2.440E-05 | global batch size:    16 | lm loss: 5.292189E+00 | grad norm: 1.181 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4655/  128728 | consumed samples:        74480 | consumed tokens:    152535040 | elapsed time per iteration (s): 15.21 | learning rate: 2.441E-05 | global batch size:    16 | lm loss: 5.328452E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4656/  128728 | consumed samples:        74496 | consumed tokens:    152567808 | elapsed time per iteration (s): 15.23 | learning rate: 2.441E-05 | global batch size:    16 | lm loss: 4.942339E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4657/  128728 | consumed samples:        74512 | consumed tokens:    152600576 | elapsed time per iteration (s): 15.19 | learning rate: 2.442E-05 | global batch size:    16 | lm loss: 5.283966E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4658/  128728 | consumed samples:        74528 | consumed tokens:    152633344 | elapsed time per iteration (s): 15.23 | learning rate: 2.442E-05 | global batch size:    16 | lm loss: 5.224200E+00 | grad norm: 1.609 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4659/  128728 | consumed samples:        74544 | consumed tokens:    152666112 | elapsed time per iteration (s): 15.21 | learning rate: 2.443E-05 | global batch size:    16 | lm loss: 5.286874E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4660/  128728 | consumed samples:        74560 | consumed tokens:    152698880 | elapsed time per iteration (s): 15.24 | learning rate: 2.443E-05 | global batch size:    16 | lm loss: 5.279421E+00 | grad norm: 1.003 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4661/  128728 | consumed samples:        74576 | consumed tokens:    152731648 | elapsed time per iteration (s): 15.22 | learning rate: 2.444E-05 | global batch size:    16 | lm loss: 5.168081E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4662/  128728 | consumed samples:        74592 | consumed tokens:    152764416 | elapsed time per iteration (s): 15.22 | learning rate: 2.444E-05 | global batch size:    16 | lm loss: 4.949018E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4663/  128728 | consumed samples:        74608 | consumed tokens:    152797184 | elapsed time per iteration (s): 15.25 | learning rate: 2.445E-05 | global batch size:    16 | lm loss: 5.186502E+00 | grad norm: 4.359 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4664/  128728 | consumed samples:        74624 | consumed tokens:    152829952 | elapsed time per iteration (s): 15.21 | learning rate: 2.445E-05 | global batch size:    16 | lm loss: 5.038185E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4665/  128728 | consumed samples:        74640 | consumed tokens:    152862720 | elapsed time per iteration (s): 15.20 | learning rate: 2.446E-05 | global batch size:    16 | lm loss: 5.148849E+00 | grad norm: 1.297 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4666/  128728 | consumed samples:        74656 | consumed tokens:    152895488 | elapsed time per iteration (s): 15.20 | learning rate: 2.446E-05 | global batch size:    16 | lm loss: 5.156718E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4667/  128728 | consumed samples:        74672 | consumed tokens:    152928256 | elapsed time per iteration (s): 15.23 | learning rate: 2.447E-05 | global batch size:    16 | lm loss: 5.175311E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4668/  128728 | consumed samples:        74688 | consumed tokens:    152961024 | elapsed time per iteration (s): 15.20 | learning rate: 2.447E-05 | global batch size:    16 | lm loss: 5.137317E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4669/  128728 | consumed samples:        74704 | consumed tokens:    152993792 | elapsed time per iteration (s): 15.22 | learning rate: 2.448E-05 | global batch size:    16 | lm loss: 5.099137E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4670/  128728 | consumed samples:        74720 | consumed tokens:    153026560 | elapsed time per iteration (s): 15.24 | learning rate: 2.448E-05 | global batch size:    16 | lm loss: 5.166493E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4671/  128728 | consumed samples:        74736 | consumed tokens:    153059328 | elapsed time per iteration (s): 15.19 | learning rate: 2.449E-05 | global batch size:    16 | lm loss: 5.057539E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4672/  128728 | consumed samples:        74752 | consumed tokens:    153092096 | elapsed time per iteration (s): 15.22 | learning rate: 2.449E-05 | global batch size:    16 | lm loss: 5.268323E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4673/  128728 | consumed samples:        74768 | consumed tokens:    153124864 | elapsed time per iteration (s): 15.25 | learning rate: 2.450E-05 | global batch size:    16 | lm loss: 4.986012E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4674/  128728 | consumed samples:        74784 | consumed tokens:    153157632 | elapsed time per iteration (s): 15.21 | learning rate: 2.451E-05 | global batch size:    16 | lm loss: 4.991606E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4675/  128728 | consumed samples:        74800 | consumed tokens:    153190400 | elapsed time per iteration (s): 15.22 | learning rate: 2.451E-05 | global batch size:    16 | lm loss: 5.181803E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4676/  128728 | consumed samples:        74816 | consumed tokens:    153223168 | elapsed time per iteration (s): 15.19 | learning rate: 2.452E-05 | global batch size:    16 | lm loss: 5.250779E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4677/  128728 | consumed samples:        74832 | consumed tokens:    153255936 | elapsed time per iteration (s): 15.22 | learning rate: 2.452E-05 | global batch size:    16 | lm loss: 5.169383E+00 | grad norm: 2.261 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4678/  128728 | consumed samples:        74848 | consumed tokens:    153288704 | elapsed time per iteration (s): 15.21 | learning rate: 2.453E-05 | global batch size:    16 | lm loss: 4.975980E+00 | grad norm: 1.189 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4679/  128728 | consumed samples:        74864 | consumed tokens:    153321472 | elapsed time per iteration (s): 15.15 | learning rate: 2.453E-05 | global batch size:    16 | lm loss: 5.177567E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.056 | TFLOPs: 8.08 |
[default7]: iteration     4680/  128728 | consumed samples:        74880 | consumed tokens:    153354240 | elapsed time per iteration (s): 15.24 | learning rate: 2.454E-05 | global batch size:    16 | lm loss: 5.231998E+00 | grad norm: 1.587 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4681/  128728 | consumed samples:        74896 | consumed tokens:    153387008 | elapsed time per iteration (s): 15.22 | learning rate: 2.454E-05 | global batch size:    16 | lm loss: 5.044895E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4682/  128728 | consumed samples:        74912 | consumed tokens:    153419776 | elapsed time per iteration (s): 15.22 | learning rate: 2.455E-05 | global batch size:    16 | lm loss: 5.186457E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4683/  128728 | consumed samples:        74928 | consumed tokens:    153452544 | elapsed time per iteration (s): 15.21 | learning rate: 2.455E-05 | global batch size:    16 | lm loss: 5.240637E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.05 |
[default7]: iteration     4684/  128728 | consumed samples:        74944 | consumed tokens:    153485312 | elapsed time per iteration (s): 15.19 | learning rate: 2.456E-05 | global batch size:    16 | lm loss: 5.068531E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4685/  128728 | consumed samples:        74960 | consumed tokens:    153518080 | elapsed time per iteration (s): 15.23 | learning rate: 2.456E-05 | global batch size:    16 | lm loss: 5.105819E+00 | grad norm: 1.225 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4686/  128728 | consumed samples:        74976 | consumed tokens:    153550848 | elapsed time per iteration (s): 15.22 | learning rate: 2.457E-05 | global batch size:    16 | lm loss: 5.010415E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4687/  128728 | consumed samples:        74992 | consumed tokens:    153583616 | elapsed time per iteration (s): 15.23 | learning rate: 2.457E-05 | global batch size:    16 | lm loss: 5.187891E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4688/  128728 | consumed samples:        75008 | consumed tokens:    153616384 | elapsed time per iteration (s): 15.22 | learning rate: 2.458E-05 | global batch size:    16 | lm loss: 5.193148E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4689/  128728 | consumed samples:        75024 | consumed tokens:    153649152 | elapsed time per iteration (s): 15.22 | learning rate: 2.458E-05 | global batch size:    16 | lm loss: 5.094107E+00 | grad norm: 1.136 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4690/  128728 | consumed samples:        75040 | consumed tokens:    153681920 | elapsed time per iteration (s): 15.23 | learning rate: 2.459E-05 | global batch size:    16 | lm loss: 5.274774E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4691/  128728 | consumed samples:        75056 | consumed tokens:    153714688 | elapsed time per iteration (s): 15.24 | learning rate: 2.459E-05 | global batch size:    16 | lm loss: 5.135740E+00 | grad norm: 4.086 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4692/  128728 | consumed samples:        75072 | consumed tokens:    153747456 | elapsed time per iteration (s): 15.23 | learning rate: 2.460E-05 | global batch size:    16 | lm loss: 5.176250E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.04 |
[default7]: iteration     4693/  128728 | consumed samples:        75088 | consumed tokens:    153780224 | elapsed time per iteration (s): 15.24 | learning rate: 2.460E-05 | global batch size:    16 | lm loss: 5.101857E+00 | grad norm: 1.251 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.050 | TFLOPs: 8.04 |
[default7]: iteration     4694/  128728 | consumed samples:        75104 | consumed tokens:    153812992 | elapsed time per iteration (s): 15.26 | learning rate: 2.461E-05 | global batch size:    16 | lm loss: 5.162605E+00 | grad norm: 1.161 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.048 | TFLOPs: 8.03 |
[default7]: iteration     4695/  128728 | consumed samples:        75120 | consumed tokens:    153845760 | elapsed time per iteration (s): 15.18 | learning rate: 2.462E-05 | global batch size:    16 | lm loss: 5.055284E+00 | grad norm: 1.294 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.054 | TFLOPs: 8.07 |
[default7]: iteration     4696/  128728 | consumed samples:        75136 | consumed tokens:    153878528 | elapsed time per iteration (s): 15.26 | learning rate: 2.462E-05 | global batch size:    16 | lm loss: 5.240335E+00 | grad norm: 1.086 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default7]: iteration     4697/  128728 | consumed samples:        75152 | consumed tokens:    153911296 | elapsed time per iteration (s): 15.19 | learning rate: 2.463E-05 | global batch size:    16 | lm loss: 5.174578E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4698/  128728 | consumed samples:        75168 | consumed tokens:    153944064 | elapsed time per iteration (s): 15.20 | learning rate: 2.463E-05 | global batch size:    16 | lm loss: 5.296353E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.053 | TFLOPs: 8.06 |
[default7]: iteration     4699/  128728 | consumed samples:        75184 | consumed tokens:    153976832 | elapsed time per iteration (s): 15.22 | learning rate: 2.464E-05 | global batch size:    16 | lm loss: 5.088926E+00 | grad norm: 1.058 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4700/  128728 | consumed samples:        75200 | consumed tokens:    154009600 | elapsed time per iteration (s): 15.22 | learning rate: 2.464E-05 | global batch size:    16 | lm loss: 5.067006E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4701/  128728 | consumed samples:        75216 | consumed tokens:    154042368 | elapsed time per iteration (s): 15.23 | learning rate: 2.465E-05 | global batch size:    16 | lm loss: 5.168745E+00 | grad norm: 0.649 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4702/  128728 | consumed samples:        75232 | consumed tokens:    154075136 | elapsed time per iteration (s): 15.20 | learning rate: 2.465E-05 | global batch size:    16 | lm loss: 5.318768E+00 | grad norm: 2.551 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.052 | TFLOPs: 8.06 |
[default7]: iteration     4703/  128728 | consumed samples:        75248 | consumed tokens:    154107904 | elapsed time per iteration (s): 15.23 | learning rate: 2.466E-05 | global batch size:    16 | lm loss: 5.285171E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.051 | TFLOPs: 8.05 |
[default7]: iteration     4704/  128728 | consumed samples:        75264 | consumed tokens:    154140672 | elapsed time per iteration (s): 15.25 | learning rate: 2.466E-05 | global batch size:    16 | lm loss: 5.164697E+00 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.049 | TFLOPs: 8.03 |
[default0]:saving checkpoint at iteration    4704 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default1]:[2022-03-04 01:55:13,827] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/mp_rank_01_model_states.pt
[default0]:[2022-03-04 01:55:14,066] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/mp_rank_00_model_states.pt
[default2]:[2022-03-04 01:55:28,793] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt
[default4]:[2022-03-04 01:55:28,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt
[default1]:[2022-03-04 01:55:29,016] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt
[default6]:[2022-03-04 01:55:29,059] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt
[default3]:[2022-03-04 01:55:29,184] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt
[default5]:[2022-03-04 01:55:29,255] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt
[default7]:[2022-03-04 01:55:29,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt
[default7]:[2022-03-04 01:55:29,306] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt
[default0]:[2022-03-04 01:55:29,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt
[default0]:[2022-03-04 01:55:29,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt
[default6]:[2022-03-04 01:55:29,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt
[default4]:[2022-03-04 01:55:29,969] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt
[default5]:[2022-03-04 01:55:29,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt
[default1]:[2022-03-04 01:55:29,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt
[default2]:[2022-03-04 01:55:30,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt
[default5]:[2022-03-04 01:55:30,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt
[default6]:[2022-03-04 01:55:30,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt
[default4]:[2022-03-04 01:55:30,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt
[default1]:[2022-03-04 01:55:30,271] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt
[default7]:[2022-03-04 01:55:30,473] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt
[default0]:[2022-03-04 01:55:30,452] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt
[default1]:[2022-03-04 01:55:30,560] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt
[default3]:[2022-03-04 01:55:30,558] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt
[default7]:[2022-03-04 01:55:30,639] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt
[default4]:[2022-03-04 01:55:30,666] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt
[default7]:[2022-03-04 01:55:30,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt
[default0]:[2022-03-04 01:55:30,722] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt
[default3]:[2022-03-04 01:55:30,722] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt
[default4]:[2022-03-04 01:55:30,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt
[default6]:[2022-03-04 01:55:30,863] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt
[default1]:[2022-03-04 01:55:30,921] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt
[default5]:[2022-03-04 01:55:30,896] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt
[default2]:[2022-03-04 01:55:31,008] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt
[default3]:[2022-03-04 01:55:30,949] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt
[default6]:[2022-03-04 01:55:31,005] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt
[default1]:[2022-03-04 01:55:31,094] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt
[default0]:[2022-03-04 01:55:31,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[default2]:[2022-03-04 01:55:31,167] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt
[default0]:[2022-03-04 01:55:31,164] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt
[default3]:[2022-03-04 01:55:31,221] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt
[default3]:[2022-03-04 01:55:31,172] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt
[default5]:[2022-03-04 01:55:31,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt
[default2]:[2022-03-04 01:55:31,322] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt
[default5]:[2022-03-04 01:55:31,315] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt
[default3]:[2022-03-04 01:55:31,324] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt
[default6]:[2022-03-04 01:55:31,350] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt
[default7]:[2022-03-04 01:55:31,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt
[default1]:[2022-03-04 01:55:31,401] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt
[default4]:[2022-03-04 01:55:31,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt
[default4]:[2022-03-04 01:55:31,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt
[default2]:[2022-03-04 01:55:31,531] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt
[default7]:[2022-03-04 01:55:31,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt
[default0]:[2022-03-04 01:55:31,474] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt
[default2]:[2022-03-04 01:55:31,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt
[default6]:[2022-03-04 01:55:31,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt
[default1]:[2022-03-04 01:55:31,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt
[default2]:[2022-03-04 01:55:31,568] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt
[default7]:[2022-03-04 01:55:31,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt
[default6]:[2022-03-04 01:55:31,630] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt
[default0]:[2022-03-04 01:55:31,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt
[default6]:[2022-03-04 01:55:31,620] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt
[default0]:[2022-03-04 01:55:31,678] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt
[default0]:[2022-03-04 01:55:31,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt
[default1]:[2022-03-04 01:55:31,669] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt
[default5]:[2022-03-04 01:55:31,753] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt
[default0]:[2022-03-04 01:55:31,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt
[default4]:[2022-03-04 01:55:31,845] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt
[default5]:[2022-03-04 01:55:31,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt
[default4]:[2022-03-04 01:55:31,944] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
[default5]:[2022-03-04 01:55:31,937] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt
[default4]:[2022-03-04 01:55:31,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt
[default6]:[2022-03-04 01:55:32,006] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt
[default3]:[2022-03-04 01:55:31,946] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt
[default1]:[2022-03-04 01:55:32,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt
[default1]:[2022-03-04 01:55:32,093] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt
[default0]:[2022-03-04 01:55:32,090] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt
[default3]:[2022-03-04 01:55:32,109] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt
[default7]:[2022-03-04 01:55:32,088] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt
[default1]:[2022-03-04 01:55:32,142] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt
[default6]:[2022-03-04 01:55:32,155] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt
[default4]:[2022-03-04 01:55:32,167] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt
[default5]:[2022-03-04 01:55:32,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt
[default7]:[2022-03-04 01:55:32,168] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt
[default3]:[2022-03-04 01:55:32,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt
[default0]:[2022-03-04 01:55:32,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt
[default4]:[2022-03-04 01:55:32,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt
[default2]:[2022-03-04 01:55:32,444] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt
[default1]:[2022-03-04 01:55:32,558] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt
[default5]:[2022-03-04 01:55:32,592] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt
[default4]:[2022-03-04 01:55:32,592] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt
[default7]:[2022-03-04 01:55:32,612] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt
[default0]:[2022-03-04 01:55:32,661] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt
[default2]:[2022-03-04 01:55:32,679] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt
[default4]:[2022-03-04 01:55:32,837] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
[default2]:[2022-03-04 01:55:33,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt
[default3]:[2022-03-04 01:55:33,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt
[default3]:[2022-03-04 01:55:33,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt
[default7]:[2022-03-04 01:55:33,242] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt
[default6]:[2022-03-04 01:55:33,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt
[default7]:[2022-03-04 01:55:33,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt
[default0]:[2022-03-04 01:55:33,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt
[default1]:[2022-03-04 01:55:33,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt
[default2]:[2022-03-04 01:55:33,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt
[default4]:[2022-03-04 01:55:33,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt
[default0]:[2022-03-04 01:55:33,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt
[default5]:[2022-03-04 01:55:33,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt
[default5]:[2022-03-04 01:55:33,482] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt
[default4]:[2022-03-04 01:55:33,451] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt
[default3]:[2022-03-04 01:55:33,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt
[default2]:[2022-03-04 01:55:33,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt
[default2]:[2022-03-04 01:55:33,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt
[default5]:[2022-03-04 01:55:33,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt
[default7]:[2022-03-04 01:55:33,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt
[default7]:[2022-03-04 01:55:33,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt
[default6]:[2022-03-04 01:55:33,697] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt
[default7]:[2022-03-04 01:55:33,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt
[default1]:[2022-03-04 01:55:33,752] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt
[default2]:[2022-03-04 01:55:33,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt
[default1]:[2022-03-04 01:55:33,802] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt
[default3]:[2022-03-04 01:55:33,794] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt
[default6]:[2022-03-04 01:55:33,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt
[default6]:[2022-03-04 01:55:33,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt
[default4]:[2022-03-04 01:55:33,814] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt
[default3]:[2022-03-04 01:55:33,935] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt
[default4]:[2022-03-04 01:55:33,922] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt
[default0]:[2022-03-04 01:55:33,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt
[default6]:[2022-03-04 01:55:33,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt
[default1]:[2022-03-04 01:55:33,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt
[default1]:[2022-03-04 01:55:33,995] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt
[default6]:[2022-03-04 01:55:34,077] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt
[default3]:[2022-03-04 01:55:34,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt
[default5]:[2022-03-04 01:55:34,178] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt
[default2]:[2022-03-04 01:55:34,199] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt
[default3]:[2022-03-04 01:55:34,171] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt
[default2]:[2022-03-04 01:55:34,171] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt
[default5]:[2022-03-04 01:55:34,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt
[default6]:[2022-03-04 01:55:34,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt
[default2]:[2022-03-04 01:55:34,212] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt
[default2]:[2022-03-04 01:55:34,275] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt
[default5]:[2022-03-04 01:55:34,325] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt
[default2]:[2022-03-04 01:55:34,348] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt
[default4]:[2022-03-04 01:55:34,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
[default6]:[2022-03-04 01:55:34,300] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt
[default4]:[2022-03-04 01:55:34,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt
[default3]:[2022-03-04 01:55:34,311] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt
[default7]:[2022-03-04 01:55:34,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt
[default7]:[2022-03-04 01:55:34,414] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt
[default5]:[2022-03-04 01:55:34,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt
[default2]:[2022-03-04 01:55:34,375] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt
[default3]:[2022-03-04 01:55:34,435] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt
[default5]:[2022-03-04 01:55:34,500] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt
[default0]:[2022-03-04 01:55:34,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt
[default1]:[2022-03-04 01:55:34,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt
[default3]:[2022-03-04 01:55:34,508] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt
[default4]:[2022-03-04 01:55:34,508] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt
[default3]:[2022-03-04 01:55:34,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt
[default1]:[2022-03-04 01:55:34,534] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt
[default1]:[2022-03-04 01:55:34,623] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt
[default0]:[2022-03-04 01:55:34,571] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
[default2]:[2022-03-04 01:55:34,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt
[default0]:[2022-03-04 01:55:34,615] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt
[default6]:[2022-03-04 01:55:34,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt
[default3]:[2022-03-04 01:55:34,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt
[default2]:[2022-03-04 01:55:34,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt
[default3]:[2022-03-04 01:55:34,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt
[default1]:[2022-03-04 01:55:34,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt
[default6]:[2022-03-04 01:55:34,729] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt
[default1]:[2022-03-04 01:55:34,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt
[default5]:[2022-03-04 01:55:34,856] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt
[default2]:[2022-03-04 01:55:34,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt
[default7]:[2022-03-04 01:55:34,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt
[default0]:[2022-03-04 01:55:34,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt
[default5]:[2022-03-04 01:55:34,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
[default4]:[2022-03-04 01:55:34,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt
[default5]:[2022-03-04 01:55:34,921] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt
[default7]:[2022-03-04 01:55:34,848] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt
[default3]:[2022-03-04 01:55:34,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt
[default3]:[2022-03-04 01:55:34,864] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt
[default1]:[2022-03-04 01:55:34,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt
[default4]:[2022-03-04 01:55:34,965] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt
[default7]:[2022-03-04 01:55:35,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt
[default5]:[2022-03-04 01:55:35,027] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt
[default7]:[2022-03-04 01:55:35,062] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt
[default0]:[2022-03-04 01:55:35,034] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt
[default3]:[2022-03-04 01:55:35,113] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt
[default3]:[2022-03-04 01:55:35,116] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt
[default2]:[2022-03-04 01:55:35,044] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt
[default0]:[2022-03-04 01:55:35,118] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt
[default0]:[2022-03-04 01:55:35,119] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt
[default5]:[2022-03-04 01:55:35,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt
[default1]:[2022-03-04 01:55:35,213] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt
[default2]:[2022-03-04 01:55:35,216] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt
[default6]:[2022-03-04 01:55:35,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt
[default1]:[2022-03-04 01:55:35,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt
[default0]:[2022-03-04 01:55:35,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt
[default3]:[2022-03-04 01:55:35,278] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt
[default0]:[2022-03-04 01:55:35,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
[default2]:[2022-03-04 01:55:35,290] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt
[default7]:[2022-03-04 01:55:35,342] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt
[default2]:[2022-03-04 01:55:35,363] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt
[default2]:[2022-03-04 01:55:35,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt
[default4]:[2022-03-04 01:55:35,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt
[default0]:[2022-03-04 01:55:35,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt
[default1]:[2022-03-04 01:55:35,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt
[default0]:[2022-03-04 01:55:35,508] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt
[default5]:[2022-03-04 01:55:35,504] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt
[default1]:[2022-03-04 01:55:35,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt
[default4]:[2022-03-04 01:55:35,615] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt
[default2]:[2022-03-04 01:55:35,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt
[default6]:[2022-03-04 01:55:35,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt
[default6]:[2022-03-04 01:55:35,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt
[default3]:[2022-03-04 01:55:35,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt
[default5]:[2022-03-04 01:55:35,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt
[default1]:[2022-03-04 01:55:35,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt
[default4]:[2022-03-04 01:55:35,797] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt
[default4]:[2022-03-04 01:55:35,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt
[default1]:[2022-03-04 01:55:35,831] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt
[default3]:[2022-03-04 01:55:35,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt
[default3]:[2022-03-04 01:55:35,966] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt
[default5]:[2022-03-04 01:55:35,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt
[default7]:[2022-03-04 01:55:35,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt
[default6]:[2022-03-04 01:55:35,964] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt
[default3]:[2022-03-04 01:55:36,006] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt
[default0]:[2022-03-04 01:55:36,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt
[default4]:[2022-03-04 01:55:36,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt
[default5]:[2022-03-04 01:55:36,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt
[default7]:[2022-03-04 01:55:36,189] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt
[default0]:[2022-03-04 01:55:36,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt
[default5]:[2022-03-04 01:55:36,277] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt
[default6]:[2022-03-04 01:55:36,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt
[default7]:[2022-03-04 01:55:36,298] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt
[default5]:[2022-03-04 01:55:36,285] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt
[default2]:[2022-03-04 01:55:36,321] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt
[default3]:[2022-03-04 01:55:36,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt
[default7]:[2022-03-04 01:55:36,360] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt
[default4]:[2022-03-04 01:55:36,363] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt
[default7]:[2022-03-04 01:55:36,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt
[default6]:[2022-03-04 01:55:36,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt
[default4]:[2022-03-04 01:55:36,510] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt
[default4]:[2022-03-04 01:55:36,623] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt
[default2]:[2022-03-04 01:55:36,647] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt
[default2]:[2022-03-04 01:55:36,640] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt
[default5]:[2022-03-04 01:55:36,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt
[default2]:[2022-03-04 01:55:36,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt
[default4]:[2022-03-04 01:55:36,789] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt
[default6]:[2022-03-04 01:55:36,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt
[default5]:[2022-03-04 01:55:36,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt
[default5]:[2022-03-04 01:55:36,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt
[default6]:[2022-03-04 01:55:36,850] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt
[default5]:[2022-03-04 01:55:36,923] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt
[default7]:[2022-03-04 01:55:36,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt
[default2]:[2022-03-04 01:55:36,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt
[default7]:[2022-03-04 01:55:37,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt
[default0]:[2022-03-04 01:55:37,018] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt
[default5]:[2022-03-04 01:55:37,043] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt
[default3]:[2022-03-04 01:55:37,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt
[default2]:[2022-03-04 01:55:37,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt
[default4]:[2022-03-04 01:55:37,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt
[default7]:[2022-03-04 01:55:37,125] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt
[default6]:[2022-03-04 01:55:37,225] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt
[default7]:[2022-03-04 01:55:37,213] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt
[default2]:[2022-03-04 01:55:37,257] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt
[default4]:[2022-03-04 01:55:37,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
[default6]:[2022-03-04 01:55:37,311] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt
[default7]:[2022-03-04 01:55:37,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt
[default1]:[2022-03-04 01:55:37,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt
[default7]:[2022-03-04 01:55:37,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt
[default2]:[2022-03-04 01:55:37,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt
[default1]:[2022-03-04 01:55:37,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt
[default6]:[2022-03-04 01:55:37,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt
[default1]:[2022-03-04 01:55:37,883] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt
[default5]:[2022-03-04 01:55:38,074] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt
[default6]:[2022-03-04 01:55:38,063] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt
[default4]:[2022-03-04 01:55:38,038] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt
[default2]:[2022-03-04 01:55:38,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt
[default4]:[2022-03-04 01:55:38,175] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt
[default5]:[2022-03-04 01:55:38,205] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt
[default5]:[2022-03-04 01:55:38,137] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt
[default3]:[2022-03-04 01:55:38,277] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt
[default7]:[2022-03-04 01:55:38,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt
[default0]:[2022-03-04 01:55:38,258] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt
[default1]:[2022-03-04 01:55:38,313] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt
[default7]:[2022-03-04 01:55:38,346] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt
[default0]:[2022-03-04 01:55:38,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt
[default0]:[2022-03-04 01:55:38,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt
[default0]:[2022-03-04 01:55:38,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt
[default4]:[2022-03-04 01:55:38,534] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt
[default7]:[2022-03-04 01:55:38,509] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt
[default0]:[2022-03-04 01:55:38,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt
[default5]:[2022-03-04 01:55:38,714] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt
[default0]:[2022-03-04 01:55:38,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt
[default0]:[2022-03-04 01:55:38,723] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt
[default6]:[2022-03-04 01:55:38,774] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt
[default1]:[2022-03-04 01:55:38,811] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt
[default0]:[2022-03-04 01:55:38,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt
[default7]:[2022-03-04 01:55:38,767] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt
[default6]:[2022-03-04 01:55:38,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt
[default6]:[2022-03-04 01:55:39,028] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt
[default6]:[2022-03-04 01:55:39,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt
[default7]:[2022-03-04 01:55:39,014] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt
[default3]:[2022-03-04 01:55:39,083] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt
[default7]:[2022-03-04 01:55:39,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt
[default3]:[2022-03-04 01:55:39,189] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt
[default1]:[2022-03-04 01:55:39,163] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt
[default1]:[2022-03-04 01:55:39,258] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt
[default6]:[2022-03-04 01:55:39,203] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt
[default3]:[2022-03-04 01:55:39,317] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt
[default4]:[2022-03-04 01:55:39,312] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt
[default6]:[2022-03-04 01:55:39,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt
[default2]:[2022-03-04 01:55:39,462] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt
[default2]:[2022-03-04 01:55:39,493] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt
[default7]:[2022-03-04 01:55:39,455] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt
[default3]:[2022-03-04 01:55:39,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt
[default6]:[2022-03-04 01:55:39,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt
[default3]:[2022-03-04 01:55:39,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt
[default1]:[2022-03-04 01:55:39,633] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt
[default2]:[2022-03-04 01:55:39,621] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt
[default4]:[2022-03-04 01:55:39,628] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt
[default3]:[2022-03-04 01:55:39,630] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt
[default6]:[2022-03-04 01:55:39,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt
[default2]:[2022-03-04 01:55:39,677] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt
[default4]:[2022-03-04 01:55:39,721] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt
[default3]:[2022-03-04 01:55:39,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt
[default5]:[2022-03-04 01:55:39,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt
[default7]:[2022-03-04 01:55:39,837] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt
[default5]:[2022-03-04 01:55:39,808] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt
[default4]:[2022-03-04 01:55:39,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt
[default1]:[2022-03-04 01:55:39,959] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt
[default1]:[2022-03-04 01:55:39,912] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt
[default6]:[2022-03-04 01:55:40,193] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt
[default6]:[2022-03-04 01:55:40,160] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt
[default7]:[2022-03-04 01:55:40,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt
[default1]:[2022-03-04 01:55:40,230] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt
[default0]:[2022-03-04 01:55:40,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt
[default3]:[2022-03-04 01:55:40,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt
[default2]:[2022-03-04 01:55:40,360] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt
[default3]:[2022-03-04 01:55:40,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt
[default3]:[2022-03-04 01:55:40,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt
[default2]:[2022-03-04 01:55:40,577] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt
[default2]:[2022-03-04 01:55:40,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt
[default4]:[2022-03-04 01:55:40,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt
[default3]:[2022-03-04 01:55:40,782] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt
[default0]:[2022-03-04 01:55:40,851] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
[default1]:[2022-03-04 01:55:40,941] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt
[default6]:[2022-03-04 01:55:41,166] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt
[default5]:[2022-03-04 01:55:41,216] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt
[default5]:[2022-03-04 01:55:41,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt
[default3]:[2022-03-04 01:55:41,368] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt
[default4]:[2022-03-04 01:55:41,420] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt
[default3]:[2022-03-04 01:55:41,488] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt
[default5]:[2022-03-04 01:55:41,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt
[default4]:[2022-03-04 01:55:41,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt
[default2]:[2022-03-04 01:55:41,552] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt
[default6]:[2022-03-04 01:55:41,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt
[default0]:[2022-03-04 01:55:41,701] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt
[default1]:[2022-03-04 01:55:41,792] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt
[default7]:[2022-03-04 01:55:41,746] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt
[default0]:[2022-03-04 01:55:41,830] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt
[default7]:[2022-03-04 01:55:41,937] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt
[default1]:[2022-03-04 01:55:42,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt
[default1]:[2022-03-04 01:55:42,069] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt
[default2]:[2022-03-04 01:55:42,096] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt
[default0]:[2022-03-04 01:55:42,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt
[default0]:[2022-03-04 01:55:42,068] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt
[default0]:[2022-03-04 01:55:42,286] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt
[default1]:[2022-03-04 01:55:42,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt
[default4]:[2022-03-04 01:55:42,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt
[default5]:[2022-03-04 01:55:42,591] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt
[default5]:[2022-03-04 01:55:42,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt
[default0]:[2022-03-04 01:55:42,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt
[default6]:[2022-03-04 01:55:42,948] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt
[default1]:[2022-03-04 01:55:43,046] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt
[default7]:[2022-03-04 01:55:43,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt
[default4]:[2022-03-04 01:55:44,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt
[default6]:[2022-03-04 01:55:44,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt
[default7]:[2022-03-04 01:55:44,606] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt
[default0]:[2022-03-04 01:55:44,809] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt
[default1]:[2022-03-04 01:55:44,832] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt
[default5]:[2022-03-04 01:55:45,596] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt
[default4]:[2022-03-04 01:55:45,568] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt
[default6]:[2022-03-04 01:55:45,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt
[default7]:[2022-03-04 01:55:46,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt
[default4]:[2022-03-04 01:55:51,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt
[default5]:[2022-03-04 01:55:51,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt
[default0]:  successfully saved checkpoint at iteration    4704 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:[exiting program after 1190.0250424226126 minutes] datetime: 2022-03-04 01:55:51 
[default7]:time (ms) | save-checkpoint: 44797.43
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[default7]:> setting tensorboard ...
[default0]:Offline mode: forcing local_files_only=True
[default0]:using world size: 384, data-parallel-size: 8, tensor-model-parallel size: 4, pipeline-model-parallel size: 12 
[default0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:PretrainedFromHF
[default0]:accumulate and all-reduce gradients in fp32 for bfloat16 data type.
[default0]:using torch.bfloat16 for parameters ...
[default0]:------------------------ arguments ------------------------
[default0]:  abort_on_unmet_fused_kernel_constraints ......... True
[default0]:  accumulate_allreduce_grads_in_fp32 .............. True
[default0]:  adam_beta1 ...................................... 0.9
[default0]:  adam_beta2 ...................................... 0.95
[default0]:  adam_eps ........................................ 1e-08
[default0]:  adlr_autoresume ................................. False
[default0]:  adlr_autoresume_interval ........................ 1000
[default0]:  apply_query_key_layer_scaling ................... True
[default0]:  apply_residual_connection_post_layernorm ........ False
[default0]:  attention_dropout ............................... 0.1
[default0]:  attention_softmax_in_fp32 ....................... False
[default0]:  bert_binary_head ................................ True
[default0]:  bert_load ....................................... None
[default0]:  bf16 ............................................ True
[default0]:  bias_dropout_fusion ............................. True
[default0]:  bias_gelu_fusion ................................ True
[default0]:  biencoder_projection_dim ........................ 0
[default0]:  biencoder_shared_query_context_model ............ False
[default0]:  block_data_path ................................. None
[default0]:  checkpoint_activations .......................... True
[default0]:  checkpoint_in_cpu ............................... False
[default0]:  checkpoint_num_layers ........................... 1
[default0]:  clip_grad ....................................... 1.0
[default0]:  codecarbon_dir .................................. None
[default0]:  consumed_train_samples .......................... 0
[default0]:  consumed_train_tokens ........................... 0
[default0]:  consumed_valid_samples .......................... 0
[default0]:  contigious_checkpointing ........................ False
[default0]:  cpu_optimizer ................................... False
[default0]:  cpu_torch_adam .................................. False
[default0]:  curriculum_learning ............................. False
[default0]:  data_impl ....................................... mmap
[default0]:  data_parallel_size .............................. 8
[default0]:  data_path ....................................... None
[default0]:  dataloader_type ................................. single
[default0]:  DDP_impl ........................................ local
[default0]:  decoder_seq_length .............................. None
[default0]:  deepscale ....................................... False
[default0]:  deepscale_config ................................ None
[default0]:  deepspeed ....................................... True
[default0]:  deepspeed_activation_checkpointing .............. True
[default0]:  deepspeed_config ................................ ./ds_config.202316.json
[default0]:  deepspeed_mpi ................................... False
[default0]:  distribute_checkpointed_activations ............. False
[default0]:  distributed_backend ............................. nccl
[default0]:  embed_layernorm ................................. True
[default0]:  embedding_path .................................. None
[default0]:  encoder_seq_length .............................. 2048
[default0]:  eod_mask_loss ................................... False
[default0]:  eval_interval ................................... 1000
[default0]:  eval_iters ...................................... 10
[default0]:  eval_only ....................................... None
[default0]:  evidence_data_path .............................. None
[default0]:  exit_duration_in_mins ........................... 5990
[default0]:  exit_interval ................................... None
[default0]:  ffn_hidden_size ................................. 57344
[default0]:  finetune ........................................ False
[default0]:  fp16 ............................................ False
[default0]:  fp16_lm_cross_entropy ........................... False
[default0]:  fp32_residual_connection ........................ False
[default0]:  gigaflos_no_embeds .............................. 0
[default0]:  global_batch_size ............................... 2048
[default0]:  glu_activation .................................. None
[default0]:  hidden_dropout .................................. 0.1
[default0]:  hidden_size ..................................... 14336
[default0]:  hysteresis ...................................... 2
[default0]:  ict_head_size ................................... None
[default0]:  ict_load ........................................ None
[default0]:  img_dim ......................................... 224
[default0]:  indexer_batch_size .............................. 128
[default0]:  indexer_log_interval ............................ 1000
[default0]:  init_method_std ................................. 0.0048
[default0]:  init_method_xavier_uniform ...................... False
[default0]:  initial_loss_scale .............................. 4294967296
[default0]:  kill_switch_path ................................ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/kill-switch-tr11-176B-exp1
[default0]:  kv_channels ..................................... 128
[default0]:  layernorm_epsilon ............................... 1e-05
[default0]:  lazy_mpu_init ................................... None
[default0]:  load ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:  local_rank ...................................... None
[default0]:  log_batch_size_to_tensorboard ................... True
[default0]:  log_interval .................................... 1
[default0]:  log_learning_rate_to_tensorboard ................ True
[default0]:  log_level ....................................... None
[default0]:  log_level_replica ............................... None
[default0]:  log_loss_scale_to_tensorboard ................... True
[default0]:  log_num_zeros_in_grad ........................... False
[default0]:  log_params_norm ................................. False
[default0]:  log_path ........................................ None
[default0]:  log_timers_to_tensorboard ....................... True
[default0]:  log_validation_ppl_to_tensorboard ............... True
[default0]:  loss_on_targets_only ............................ False
[default0]:  loss_scale ...................................... None
[default0]:  loss_scale_window ............................... 1000
[default0]:  lr .............................................. 6e-05
[default0]:  lr_decay_iters .................................. None
[default0]:  lr_decay_samples ................................ 200000000
[default0]:  lr_decay_style .................................. cosine
[default0]:  lr_decay_tokens ................................. None
[default0]:  lr_warmup_fraction .............................. None
[default0]:  lr_warmup_iters ................................. 0
[default0]:  lr_warmup_samples ............................... 183105
[default0]:  make_vocab_size_divisible_by .................... 128
[default0]:  mask_prob ....................................... 0.15
[default0]:  masked_softmax_fusion ........................... True
[default0]:  max_position_embeddings ......................... 2048
[default0]:  memory_centric_tiled_linear ..................... False
[default0]:  merge_file ...................................... None
[default0]:  micro_batch_size ................................ 2
[default0]:  min_loss_scale .................................. 1.0
[default0]:  min_lr .......................................... 6e-06
[default0]:  mmap_warmup ..................................... False
[default0]:  no_load_optim ................................... None
[default0]:  no_load_rng ..................................... None
[default0]:  no_save_optim ................................... None
[default0]:  no_save_rng ..................................... None
[default0]:  num_attention_heads ............................. 112
[default0]:  num_channels .................................... 3
[default0]:  num_classes ..................................... 1000
[default0]:  num_layers ...................................... 70
[default0]:  num_layers_per_virtual_pipeline_stage ........... None
[default0]:  num_workers ..................................... 2
[default0]:  onnx_safe ....................................... None
[default0]:  openai_gelu ..................................... False
[default0]:  optimizer ....................................... adam
[default0]:  override_lr_scheduler ........................... False
[default0]:  pad_vocab_size_to ............................... 250880
[default0]:  params_dtype .................................... torch.bfloat16
[default0]:  partition_activations ........................... False
[default0]:  patch_dim ....................................... 16
[default0]:  pipeline_model_parallel_size .................... 12
[default0]:  position_embedding_type ......................... PositionEmbeddingType.alibi
[default0]:  pp_partition_method ............................. type:transformer|embedding
[default0]:  profile_backward ................................ False
[default0]:  query_in_block_prob ............................. 0.1
[default0]:  rampup_batch_size ............................... ['16', '16', '9_765_625']
[default0]:  rank ............................................ 0
[default0]:  remote_device ................................... none
[default0]:  reset_attention_mask ............................ False
[default0]:  reset_position_ids .............................. False
[default0]:  retriever_report_topk_accuracies ................ []
[default0]:  retriever_score_scaling ......................... False
[default0]:  retriever_seq_length ............................ 256
[default0]:  reweight_loss_based_on_position_frequency ....... False
[default0]:  sample_rate ..................................... 1.0
[default0]:  save ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:  save_interval ................................... 500
[default0]:  scatter_gather_tensors_in_pipeline .............. True
[default0]:  scattered_embeddings ............................ False
[default0]:  seed ............................................ 42
[default0]:  seq_length ...................................... 2048
[default0]:  sgd_momentum .................................... 0.9
[default0]:  short_seq_prob .................................. 0.1
[default0]:  skip_train_iteration_range ...................... None
[default0]:  split ........................................... None
[default0]:  split_transformers .............................. False
[default0]:  synchronize_each_layer .......................... False
[default0]:  tensor_model_parallel_size ...................... 4
[default0]:  tensorboard_dir ................................. /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/tr11-176B-ml-logs/tensorboard
[default0]:  tensorboard_log_interval ........................ 1
[default0]:  tensorboard_queue_size .......................... 5
[default0]:  test_weighted_split_names ....................... ['test']
[default0]:  test_weighted_split_paths ....................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  test_weighted_split_paths_path .................. None
[default0]:  test_weighted_split_splits ...................... [['0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0']]
[default0]:  test_weighted_split_weights ..................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  tile_factor ..................................... 1
[default0]:  titles_data_path ................................ None
[default0]:  tokenizer_name_or_path .......................... bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k
[default0]:  tokenizer_type .................................. PretrainedFromHF
[default0]:  train_iters ..................................... None
[default0]:  train_samples ................................... 220000000
[default0]:  train_tokens .................................... None
[default0]:  train_weighted_split_names ...................... ['train']
[default0]:  train_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  train_weighted_split_paths_path ................. None
[default0]:  train_weighted_split_splits ..................... [['0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949']]
[default0]:  train_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  use_bnb_optimizer ............................... False
[default0]:  use_checkpoint_lr_scheduler ..................... False
[default0]:  use_contiguous_buffers_in_ddp ................... True
[default0]:  use_cpu_initialization .......................... None
[default0]:  use_one_sent_docs ............................... False
[default0]:  use_pin_memory .................................. False
[default0]:  valid_weighted_split_names ...................... ['valid']
[default0]:  valid_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  valid_weighted_split_paths_path ................. None
[default0]:  valid_weighted_split_splits ..................... [['0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999']]
[default0]:  valid_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  virtual_pipeline_model_parallel_size ............ None
[default0]:  vocab_extra_ids ................................. 0
[default0]:  vocab_file ...................................... None
[default0]:  weight_decay .................................... 0.1
[default0]:  world_size ...................................... 384
[default0]:  zero_allgather_bucket_size ...................... 0.0
[default0]:  zero_contigious_gradients ....................... False
[default0]:  zero_reduce_bucket_size ......................... 0.0
[default0]:  zero_reduce_scatter ............................. False
[default0]:  zero_stage ...................................... 0
[default0]:-------------------- end of arguments ---------------------
[default0]:will use batch size rampup starting from global batch size 16 to global batch size 2048 with batch size increments 16 over 9765625 samples.
[default0]:> building PretrainedFromHF tokenizer ...
[default0]: vocab file is un-used. loading tokenizer from pre-trained model
[default0]:Offline mode: forcing local_files_only=True
[default0]:Can't load following files from cache: ['added_tokens_file'] and cannot check if these files are necessary for the tokenizer to operate.
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/special_tokens_map.json from cache at /gpfswork/rech/six/commun/models/b0b3428eb9bea3ef62a6e9983742117e4860f4ec1af66eebce1702b8ec7cb364.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer_config.json from cache at /gpfswork/rech/six/commun/models/31fb66a88196017b3a12c4798e55bcf8a11b312b42dd9429c83f7237c0a8a807.e683c1a11fe6388761e34fd7cddbcd77f3552cefb70e9aca4a4cc72c027c8f40
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer.json from cache at /gpfswork/rech/six/commun/models/b28b4c1d8aed4c72b765cce6a9a7ce8c5460d05a5b4ea6fa5855dff6a721d171.397b0d7316cb89fa15f0bebce2bd6c5e71e92a14e95de167940173a60253b03e
[default0]: > padded vocab (size: 250680) with 200 dummy tokens (new size: 250880)
[default0]:DeepSpeed general environment info:
[default0]:torch install path ............... ['/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch']
[default0]:torch version .................... 1.11.0+cu115
[default0]:torch cuda version ............... 11.5
[default0]:nvcc version ..................... 11.4
[default0]:deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed']
[default0]:deepspeed info ................... 0.6.0+ed26ef4, ed26ef4, olruwase/bf16-updates
[default0]:deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
[default0]:**** Git info for Megatron: git_hash=0415583 git_branch=sync-meg-lm ****
[default0]:> initializing torch distributed ...
[default0]:> initializing tensor model parallel with size 4
[default0]:> initializing pipeline model parallel with size 12
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252031 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252032 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289979 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251437 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263080 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252333 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251438 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251439 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252033 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251440 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289980 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254984 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289981 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251441 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254820 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289982 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88310 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254821 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251442 closing signal SIGTERM
slurmstepd: error: *** STEP 202316.0 ON jean-zay-iam01 CANCELLED AT 2022-03-04T03:56:57 ***
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252334 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253385 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263081 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251443 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252335 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226151 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263082 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251444 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227877 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252336 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254985 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263083 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227878 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252337 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256749 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254986 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263084 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252193 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230320 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254822 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254987 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263085 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229012 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253386 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88311 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252194 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254988 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263086 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254823 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289983 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254989 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226152 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268918 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252034 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253387 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263087 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289984 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254990 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88312 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252318 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227879 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253388 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226153 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289985 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252035 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254991 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252036 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226154 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289986 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253389 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285878 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88313 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252037 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253390 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256750 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226155 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76675 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254417 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252338 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226156 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252038 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230321 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244719 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253391 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226157 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253392 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252195 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76676 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89307 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226158 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230322 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256751 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229013 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76677 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268919 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252339 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252340 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254824 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252196 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246841 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244720 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254825 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128661 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230323 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252197 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88314 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254826 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268920 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88315 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244607 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230324 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285879 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254827 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252198 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249592 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88316 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252319 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209773 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230325 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285880 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88317 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252199 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268921 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246157 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268922 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285881 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76678 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254418 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230326 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256752 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252200 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89308 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285882 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230327 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244721 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227880 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285883 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229014 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246158 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76679 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285884 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258609 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254419 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89309 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244722 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256753 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246842 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76680 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128662 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250113 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244723 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256754 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285885 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229015 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258610 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76681 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254420 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230965 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244724 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232320 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249593 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256755 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229016 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249003 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247864 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254421 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247917 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249594 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256756 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246843 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244608 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227813 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268923 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247676 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227881 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247121 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209774 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252320 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249595 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246159 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246844 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268924 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252321 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227882 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244609 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268925 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227814 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252322 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246160 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128663 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227883 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209775 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244610 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252323 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258611 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242420 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250114 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247280 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227884 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227815 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89310 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229017 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246161 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128664 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107811 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220558 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76682 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252324 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253853 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89311 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250115 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247865 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254422 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244611 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247918 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249004 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128665 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252325 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227816 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89312 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232321 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229018 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258612 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246162 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247866 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230966 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250116 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89313 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258613 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249005 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246845 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247867 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247919 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232322 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247677 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227817 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89314 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247122 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247868 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246559 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258614 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230967 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227818 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247920 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128666 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247869 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249596 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258615 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247762 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246846 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255033 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242106 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209776 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227819 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247679 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247870 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232323 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247921 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258616 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247871 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230968 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227820 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247281 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209777 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244612 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247680 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232325 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253854 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242421 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229019 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107812 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247282 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220559 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244613 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246163 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247681 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254423 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253855 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250117 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244614 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246164 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247682 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247283 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220560 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107813 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209779 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254424 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128667 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220561 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247683 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253856 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209780 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128668 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250118 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246560 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107814 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242422 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247284 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220562 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247123 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247684 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247763 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242107 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220563 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246847 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250119 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247285 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253857 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246561 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242423 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242108 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247764 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107815 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220564 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244725 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255034 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247124 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247286 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249006 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242424 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250120 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220565 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242109 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247287 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244726 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247125 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246848 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232326 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247765 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107816 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242425 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242110 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242426 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247922 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255035 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247766 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107817 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242111 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242427 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232327 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242112 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107818 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230969 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242113 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230970 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232328 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246562 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230971 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249007 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230972 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246563 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249008 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247126 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253858 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246564 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249009 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255036 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246565 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249010 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247923 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246566 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247767 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247924 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255037 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247768 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247769 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255038 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255039 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255040 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247127 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247128 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253859 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253860 closing signal SIGTERM
WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296307 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296308 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296309 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296310 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296311 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296312 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296313 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296314 closing signal SIGTERM
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    result = agent.run()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    time.sleep(monitor_interval)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
torch.distributed.elastic.multiprocessing.api.SignalException: Process 253741 got signal: 15
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    result = agent.run()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    result = f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    result = self._invoke_run(role)
    result = agent.run()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run
    result = f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    time.sleep(monitor_interval)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 285766 got signal: 15
    result = self._invoke_run(role)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run
    time.sleep(monitor_interval)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 246447 got signal: 15
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    result = agent.run()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run
    time.sleep(monitor_interval)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 209662 got signal: 15
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[default0]:using world size: 384, data-parallel-size: 8, tensor-model-parallel size: 4, pipeline-model-parallel size: 12 
[default0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:PretrainedFromHF
[default0]:accumulate and all-reduce gradients in fp32 for bfloat16 data type.
[default0]:using torch.bfloat16 for parameters ...
[default0]:------------------------ arguments ------------------------
[default0]:  abort_on_unmet_fused_kernel_constraints ......... True
[default0]:  accumulate_allreduce_grads_in_fp32 .............. True
[default0]:  adam_beta1 ...................................... 0.9
[default0]:  adam_beta2 ...................................... 0.95
[default0]:  adam_eps ........................................ 1e-08
[default0]:  adlr_autoresume ................................. False
[default0]:  adlr_autoresume_interval ........................ 1000
[default0]:  apply_query_key_layer_scaling ................... True
[default0]:  apply_residual_connection_post_layernorm ........ False
[default0]:  attention_dropout ............................... 0.1
[default0]:  attention_softmax_in_fp32 ....................... False
[default0]:  bert_binary_head ................................ True
[default0]:  bert_load ....................................... None
[default0]:  bf16 ............................................ True
[default0]:  bias_dropout_fusion ............................. True
[default0]:  bias_gelu_fusion ................................ True
[default0]:  biencoder_projection_dim ........................ 0
[default0]:  biencoder_shared_query_context_model ............ False
[default0]:  block_data_path ................................. None
[default0]:  checkpoint_activations .......................... True
[default0]:  checkpoint_in_cpu ............................... False
[default0]:  checkpoint_num_layers ........................... 1
[default0]:  clip_grad ....................................... 1.0
[default0]:  codecarbon_dir .................................. None
[default0]:  consumed_train_samples .......................... 0
[default0]:  consumed_train_tokens ........................... 0
[default0]:  consumed_valid_samples .......................... 0
[default0]:  contigious_checkpointing ........................ False
[default0]:  cpu_optimizer ................................... False
[default0]:  cpu_torch_adam .................................. False
[default0]:  curriculum_learning ............................. False
[default0]:  data_impl ....................................... mmap
[default0]:  data_parallel_size .............................. 8
[default0]:  data_path ....................................... None
[default0]:  dataloader_type ................................. single
[default0]:  DDP_impl ........................................ local
[default0]:  decoder_seq_length .............................. None
[default0]:  deepscale ....................................... False
[default0]:  deepscale_config ................................ None
[default0]:  deepspeed ....................................... True
[default0]:  deepspeed_activation_checkpointing .............. True
[default0]:  deepspeed_config ................................ ./ds_config.202322.json
[default0]:  deepspeed_mpi ................................... False
[default0]:  distribute_checkpointed_activations ............. False
[default0]:  distributed_backend ............................. nccl
[default0]:  embed_layernorm ................................. True
[default0]:  embedding_path .................................. None
[default0]:  encoder_seq_length .............................. 2048
[default0]:  eod_mask_loss ................................... False
[default0]:  eval_interval ................................... 1000
[default0]:  eval_iters ...................................... 10
[default0]:  eval_only ....................................... None
[default0]:  evidence_data_path .............................. None
[default0]:  exit_duration_in_mins ........................... 5990
[default0]:  exit_interval ................................... None
[default0]:  ffn_hidden_size ................................. 57344
[default0]:  finetune ........................................ False
[default0]:  fp16 ............................................ False
[default0]:  fp16_lm_cross_entropy ........................... False
[default0]:  fp32_residual_connection ........................ False
[default0]:  gigaflos_no_embeds .............................. 0
[default0]:  global_batch_size ............................... 2048
[default0]:  glu_activation .................................. None
[default0]:  hidden_dropout .................................. 0.1
[default0]:  hidden_size ..................................... 14336
[default0]:  hysteresis ...................................... 2
[default0]:  ict_head_size ................................... None
[default0]:  ict_load ........................................ None
[default0]:  img_dim ......................................... 224
[default0]:  indexer_batch_size .............................. 128
[default0]:  indexer_log_interval ............................ 1000
[default0]:  init_method_std ................................. 0.0048
[default0]:  init_method_xavier_uniform ...................... False
[default0]:  initial_loss_scale .............................. 4294967296
[default0]:  kill_switch_path ................................ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/kill-switch-tr11-176B-exp1
[default0]:  kv_channels ..................................... 128
[default0]:  layernorm_epsilon ............................... 1e-05
[default0]:  lazy_mpu_init ................................... None
[default0]:  load ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:  local_rank ...................................... None
[default0]:  log_batch_size_to_tensorboard ................... True
[default0]:  log_interval .................................... 1
[default0]:  log_learning_rate_to_tensorboard ................ True
[default0]:  log_level ....................................... None
[default0]:  log_level_replica ............................... None
[default0]:  log_loss_scale_to_tensorboard ................... True
[default0]:  log_num_zeros_in_grad ........................... False
[default0]:  log_params_norm ................................. False
[default0]:  log_path ........................................ None
[default0]:  log_timers_to_tensorboard ....................... True
[default0]:  log_validation_ppl_to_tensorboard ............... True
[default0]:  loss_on_targets_only ............................ False
[default0]:  loss_scale ...................................... None
[default0]:  loss_scale_window ............................... 1000
[default0]:  lr .............................................. 6e-05
[default0]:  lr_decay_iters .................................. None
[default0]:  lr_decay_samples ................................ 200000000
[default0]:  lr_decay_style .................................. cosine
[default0]:  lr_decay_tokens ................................. None
[default0]:  lr_warmup_fraction .............................. None
[default0]:  lr_warmup_iters ................................. 0
[default0]:  lr_warmup_samples ............................... 183105
[default0]:  make_vocab_size_divisible_by .................... 128
[default0]:  mask_prob ....................................... 0.15
[default0]:  masked_softmax_fusion ........................... True
[default0]:  max_position_embeddings ......................... 2048
[default0]:  memory_centric_tiled_linear ..................... False
[default0]:  merge_file ...................................... None
[default0]:  micro_batch_size ................................ 2
[default0]:  min_loss_scale .................................. 1.0
[default0]:  min_lr .......................................... 6e-06
[default0]:  mmap_warmup ..................................... False
[default0]:  no_load_optim ................................... None
[default0]:  no_load_rng ..................................... None
[default0]:  no_save_optim ................................... None
[default0]:  no_save_rng ..................................... None
[default0]:  num_attention_heads ............................. 112
[default0]:  num_channels .................................... 3
[default0]:  num_classes ..................................... 1000
[default0]:  num_layers ...................................... 70
[default0]:  num_layers_per_virtual_pipeline_stage ........... None
[default0]:  num_workers ..................................... 2
[default0]:  onnx_safe ....................................... None
[default0]:  openai_gelu ..................................... False
[default0]:  optimizer ....................................... adam
[default0]:  override_lr_scheduler ........................... False
[default0]:  pad_vocab_size_to ............................... 250880
[default0]:  params_dtype .................................... torch.bfloat16
[default0]:  partition_activations ........................... False
[default0]:  patch_dim ....................................... 16
[default0]:  pipeline_model_parallel_size .................... 12
[default0]:  position_embedding_type ......................... PositionEmbeddingType.alibi
[default0]:  pp_partition_method ............................. type:transformer|embedding
[default0]:  profile_backward ................................ False
[default0]:  query_in_block_prob ............................. 0.1
[default0]:  rampup_batch_size ............................... ['16', '16', '9_765_625']
[default0]:  rank ............................................ 0
[default0]:  remote_device ................................... none
[default0]:  reset_attention_mask ............................ False
[default0]:  reset_position_ids .............................. False
[default0]:  retriever_report_topk_accuracies ................ []
[default0]:  retriever_score_scaling ......................... False
[default0]:  retriever_seq_length ............................ 256
[default0]:  reweight_loss_based_on_position_frequency ....... False
[default0]:  sample_rate ..................................... 1.0
[default0]:  save ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:  save_interval ................................... 500
[default0]:  scatter_gather_tensors_in_pipeline .............. True
[default0]:  scattered_embeddings ............................ False
[default0]:  seed ............................................ 42
[default0]:  seq_length ...................................... 2048
[default0]:  sgd_momentum .................................... 0.9
[default0]:  short_seq_prob .................................. 0.1
[default0]:  skip_train_iteration_range ...................... None
[default0]:  split ........................................... None
[default0]:  split_transformers .............................. False
[default0]:  synchronize_each_layer .......................... False
[default0]:  tensor_model_parallel_size ...................... 4
[default0]:  tensorboard_dir ................................. /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/tr11-176B-ml-logs/tensorboard
[default0]:  tensorboard_log_interval ........................ 1
[default0]:  tensorboard_queue_size .......................... 5
[default0]:  test_weighted_split_names ....................... ['test']
[default0]:  test_weighted_split_paths ....................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  test_weighted_split_paths_path .................. None
[default0]:  test_weighted_split_splits ...................... [['0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0']]
[default0]:  test_weighted_split_weights ..................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  tile_factor ..................................... 1
[default0]:  titles_data_path ................................ None
[default0]:  tokenizer_name_or_path .......................... bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k
[default0]:  tokenizer_type .................................. PretrainedFromHF
[default0]:  train_iters ..................................... None
[default0]:  train_samples ................................... 220000000
[default0]:  train_tokens .................................... None
[default0]:  train_weighted_split_names ...................... ['train']
[default0]:  train_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  train_weighted_split_paths_path ................. None
[default0]:  train_weighted_split_splits ..................... [['0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949']]
[default0]:  train_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  use_bnb_optimizer ............................... False
[default0]:  use_checkpoint_lr_scheduler ..................... False
[default0]:  use_contiguous_buffers_in_ddp ................... True
[default0]:  use_cpu_initialization .......................... None
[default0]:  use_one_sent_docs ............................... False
[default0]:  use_pin_memory .................................. False
[default0]:  valid_weighted_split_names ...................... ['valid']
[default0]:  valid_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  valid_weighted_split_paths_path ................. None
[default0]:  valid_weighted_split_splits ..................... [['0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999']]
[default0]:  valid_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  virtual_pipeline_model_parallel_size ............ None
[default0]:  vocab_extra_ids ................................. 0
[default0]:  vocab_file ...................................... None
[default0]:  weight_decay .................................... 0.1
[default0]:  world_size ...................................... 384
[default0]:  zero_allgather_bucket_size ...................... 0.0
[default0]:  zero_contigious_gradients ....................... False
[default0]:  zero_reduce_bucket_size ......................... 0.0
[default0]:  zero_reduce_scatter ............................. False
[default0]:  zero_stage ...................................... 0
[default0]:-------------------- end of arguments ---------------------
[default0]:will use batch size rampup starting from global batch size 16 to global batch size 2048 with batch size increments 16 over 9765625 samples.
[default0]:> building PretrainedFromHF tokenizer ...
[default0]: vocab file is un-used. loading tokenizer from pre-trained model
[default0]:Offline mode: forcing local_files_only=True
[default0]:Offline mode: forcing local_files_only=True
[default0]:Can't load following files from cache: ['added_tokens_file'] and cannot check if these files are necessary for the tokenizer to operate.
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/special_tokens_map.json from cache at /gpfswork/rech/six/commun/models/b0b3428eb9bea3ef62a6e9983742117e4860f4ec1af66eebce1702b8ec7cb364.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer_config.json from cache at /gpfswork/rech/six/commun/models/31fb66a88196017b3a12c4798e55bcf8a11b312b42dd9429c83f7237c0a8a807.e683c1a11fe6388761e34fd7cddbcd77f3552cefb70e9aca4a4cc72c027c8f40
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer.json from cache at /gpfswork/rech/six/commun/models/b28b4c1d8aed4c72b765cce6a9a7ce8c5460d05a5b4ea6fa5855dff6a721d171.397b0d7316cb89fa15f0bebce2bd6c5e71e92a14e95de167940173a60253b03e
[default7]:> setting tensorboard ...
[default0]: > padded vocab (size: 250680) with 200 dummy tokens (new size: 250880)
[default0]:DeepSpeed general environment info:
[default0]:torch install path ............... ['/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch']
[default0]:torch version .................... 1.11.0+cu115
[default0]:torch cuda version ............... 11.5
[default0]:nvcc version ..................... 11.4
[default0]:deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed']
[default0]:deepspeed info ................... 0.6.0+ed26ef4, ed26ef4, olruwase/bf16-updates
[default0]:deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
[default0]:**** Git info for Megatron: git_hash=0415583 git_branch=sync-meg-lm ****
[default0]:> initializing torch distributed ...
[default0]:> initializing tensor model parallel with size 4
[default0]:> initializing pipeline model parallel with size 12
[default0]:> setting random seeds to 42 ...
[default0]:[2022-03-04 04:02:43,890] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2760 and data parallel seed: 42
[default0]:> compiling dataset index builder ...
[default0]:make: Entering directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data'
[default0]:make: Nothing to be done for 'default'.
[default0]:make: Leaving directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data'
[default0]:>>> done with dataset index builder. Compilation time: 0.103 seconds
[default0]:> compiling and loading fused kernels ...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module scaled_upper_triang_masked_softmax_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module scaled_upper_triang_masked_softmax_cuda...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module scaled_masked_softmax_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module scaled_masked_softmax_cuda...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module fused_mix_prec_layer_norm_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module fused_mix_prec_layer_norm_cuda...
[default0]:>>> done with compiling and loading fused kernels. Compilation time: 9.563 seconds
[default0]:time to initialize megatron (seconds): 93.559
[default0]:[after megatron is initialized] datetime: 2022-03-04 04:02:53 
[default0]:building GPT model ...
[default0]:[2022-03-04 04:02:53,586] [INFO] [utils.py:828:see_memory_usage] Before Building Model
[default0]:[2022-03-04 04:02:53,587] [INFO] [utils.py:829:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[default0]:[2022-03-04 04:02:53,587] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.25 GB, percent = 8.6%
[default0]:SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
[default0]:Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=0, model=1): 1, ProcessCoord(pipe=0, data=0, model=2): 2, ProcessCoord(pipe=0, data=0, model=3): 3, ProcessCoord(pipe=0, data=1, model=0): 4, ProcessCoord(pipe=0, data=1, model=1): 5, ProcessCoord(pipe=0, data=1, model=2): 6, ProcessCoord(pipe=0, data=1, model=3): 7, ProcessCoord(pipe=0, data=2, model=0): 8, ProcessCoord(pipe=0, data=2, model=1): 9, ProcessCoord(pipe=0, data=2, model=2): 10, ProcessCoord(pipe=0, data=2, model=3): 11, ProcessCoord(pipe=0, data=3, model=0): 12, ProcessCoord(pipe=0, data=3, model=1): 13, ProcessCoord(pipe=0, data=3, model=2): 14, ProcessCoord(pipe=0, data=3, model=3): 15, ProcessCoord(pipe=0, data=4, model=0): 16, ProcessCoord(pipe=0, data=4, model=1): 17, ProcessCoord(pipe=0, data=4, model=2): 18, ProcessCoord(pipe=0, data=4, model=3): 19, ProcessCoord(pipe=0, data=5, model=0): 20, ProcessCoord(pipe=0, data=5, model=1): 21, ProcessCoord(pipe=0, data=5, model=2): 22, ProcessCoord(pipe=0, data=5, model=3): 23, ProcessCoord(pipe=0, data=6, model=0): 24, ProcessCoord(pipe=0, data=6, model=1): 25, ProcessCoord(pipe=0, data=6, model=2): 26, ProcessCoord(pipe=0, data=6, model=3): 27, ProcessCoord(pipe=0, data=7, model=0): 28, ProcessCoord(pipe=0, data=7, model=1): 29, ProcessCoord(pipe=0, data=7, model=2): 30, ProcessCoord(pipe=0, data=7, model=3): 31, ProcessCoord(pipe=1, data=0, model=0): 32, ProcessCoord(pipe=1, data=0, model=1): 33, ProcessCoord(pipe=1, data=0, model=2): 34, ProcessCoord(pipe=1, data=0, model=3): 35, ProcessCoord(pipe=1, data=1, model=0): 36, ProcessCoord(pipe=1, data=1, model=1): 37, ProcessCoord(pipe=1, data=1, model=2): 38, ProcessCoord(pipe=1, data=1, model=3): 39, ProcessCoord(pipe=1, data=2, model=0): 40, ProcessCoord(pipe=1, data=2, model=1): 41, ProcessCoord(pipe=1, data=2, model=2): 42, ProcessCoord(pipe=1, data=2, model=3): 43, ProcessCoord(pipe=1, data=3, model=0): 44, ProcessCoord(pipe=1, data=3, model=1): 45, ProcessCoord(pipe=1, data=3, model=2): 46, ProcessCoord(pipe=1, data=3, model=3): 47, ProcessCoord(pipe=1, data=4, model=0): 48, ProcessCoord(pipe=1, data=4, model=1): 49, ProcessCoord(pipe=1, data=4, model=2): 50, ProcessCoord(pipe=1, data=4, model=3): 51, ProcessCoord(pipe=1, data=5, model=0): 52, ProcessCoord(pipe=1, data=5, model=1): 53, ProcessCoord(pipe=1, data=5, model=2): 54, ProcessCoord(pipe=1, data=5, model=3): 55, ProcessCoord(pipe=1, data=6, model=0): 56, ProcessCoord(pipe=1, data=6, model=1): 57, ProcessCoord(pipe=1, data=6, model=2): 58, ProcessCoord(pipe=1, data=6, model=3): 59, ProcessCoord(pipe=1, data=7, model=0): 60, ProcessCoord(pipe=1, data=7, model=1): 61, ProcessCoord(pipe=1, data=7, model=2): 62, ProcessCoord(pipe=1, data=7, model=3): 63, ProcessCoord(pipe=2, data=0, model=0): 64, ProcessCoord(pipe=2, data=0, model=1): 65, ProcessCoord(pipe=2, data=0, model=2): 66, ProcessCoord(pipe=2, data=0, model=3): 67, ProcessCoord(pipe=2, data=1, model=0): 68, ProcessCoord(pipe=2, data=1, model=1): 69, ProcessCoord(pipe=2, data=1, model=2): 70, ProcessCoord(pipe=2, data=1, model=3): 71, ProcessCoord(pipe=2, data=2, model=0): 72, ProcessCoord(pipe=2, data=2, model=1): 73, ProcessCoord(pipe=2, data=2, model=2): 74, ProcessCoord(pipe=2, data=2, model=3): 75, ProcessCoord(pipe=2, data=3, model=0): 76, ProcessCoord(pipe=2, data=3, model=1): 77, ProcessCoord(pipe=2, data=3, model=2): 78, ProcessCoord(pipe=2, data=3, model=3): 79, ProcessCoord(pipe=2, data=4, model=0): 80, ProcessCoord(pipe=2, data=4, model=1): 81, ProcessCoord(pipe=2, data=4, model=2): 82, ProcessCoord(pipe=2, data=4, model=3): 83, ProcessCoord(pipe=2, data=5, model=0): 84, ProcessCoord(pipe=2, data=5, model=1): 85, ProcessCoord(pipe=2, data=5, model=2): 86, ProcessCoord(pipe=2, data=5, model=3): 87, ProcessCoord(pipe=2, data=6, model=0): 88, ProcessCoord(pipe=2, data=6, model=1): 89, ProcessCoord(pipe=2, data=6, model=2): 90, ProcessCoord(pipe=2, data=6, model=3): 91, ProcessCoord(pipe=2, data=7, model=0): 92, ProcessCoord(pipe=2, data=7, model=1): 93, ProcessCoord(pipe=2, data=7, model=2): 94, ProcessCoord(pipe=2, data=7, model=3): 95, ProcessCoord(pipe=3, data=0, model=0): 96, ProcessCoord(pipe=3, data=0, model=1): 97, ProcessCoord(pipe=3, data=0, model=2): 98, ProcessCoord(pipe=3, data=0, model=3): 99, ProcessCoord(pipe=3, data=1, model=0): 100, ProcessCoord(pipe=3, data=1, model=1): 101, ProcessCoord(pipe=3, data=1, model=2): 102, ProcessCoord(pipe=3, data=1, model=3): 103, ProcessCoord(pipe=3, data=2, model=0): 104, ProcessCoord(pipe=3, data=2, model=1): 105, ProcessCoord(pipe=3, data=2, model=2): 106, ProcessCoord(pipe=3, data=2, model=3): 107, ProcessCoord(pipe=3, data=3, model=0): 108, ProcessCoord(pipe=3, data=3, model=1): 109, ProcessCoord(pipe=3, data=3, model=2): 110, ProcessCoord(pipe=3, data=3, model=3): 111, ProcessCoord(pipe=3, data=4, model=0): 112, ProcessCoord(pipe=3, data=4, model=1): 113, ProcessCoord(pipe=3, data=4, model=2): 114, ProcessCoord(pipe=3, data=4, model=3): 115, ProcessCoord(pipe=3, data=5, model=0): 116, ProcessCoord(pipe=3, data=5, model=1): 117, ProcessCoord(pipe=3, data=5, model=2): 118, ProcessCoord(pipe=3, data=5, model=3): 119, ProcessCoord(pipe=3, data=6, model=0): 120, ProcessCoord(pipe=3, data=6, model=1): 121, ProcessCoord(pipe=3, data=6, model=2): 122, ProcessCoord(pipe=3, data=6, model=3): 123, ProcessCoord(pipe=3, data=7, model=0): 124, ProcessCoord(pipe=3, data=7, model=1): 125, ProcessCoord(pipe=3, data=7, model=2): 126, ProcessCoord(pipe=3, data=7, model=3): 127, ProcessCoord(pipe=4, data=0, model=0): 128, ProcessCoord(pipe=4, data=0, model=1): 129, ProcessCoord(pipe=4, data=0, model=2): 130, ProcessCoord(pipe=4, data=0, model=3): 131, ProcessCoord(pipe=4, data=1, model=0): 132, ProcessCoord(pipe=4, data=1, model=1): 133, ProcessCoord(pipe=4, data=1, model=2): 134, ProcessCoord(pipe=4, data=1, model=3): 135, ProcessCoord(pipe=4, data=2, model=0): 136, ProcessCoord(pipe=4, data=2, model=1): 137, ProcessCoord(pipe=4, data=2, model=2): 138, ProcessCoord(pipe=4, data=2, model=3): 139, ProcessCoord(pipe=4, data=3, model=0): 140, ProcessCoord(pipe=4, data=3, model=1): 141, ProcessCoord(pipe=4, data=3, model=2): 142, ProcessCoord(pipe=4, data=3, model=3): 143, ProcessCoord(pipe=4, data=4, model=0): 144, ProcessCoord(pipe=4, data=4, model=1): 145, ProcessCoord(pipe=4, data=4, model=2): 146, ProcessCoord(pipe=4, data=4, model=3): 147, ProcessCoord(pipe=4, data=5, model=0): 148, ProcessCoord(pipe=4, data=5, model=1): 149, ProcessCoord(pipe=4, data=5, model=2): 150, ProcessCoord(pipe=4, data=5, model=3): 151, ProcessCoord(pipe=4, data=6, model=0): 152, ProcessCoord(pipe=4, data=6, model=1): 153, ProcessCoord(pipe=4, data=6, model=2): 154, ProcessCoord(pipe=4, data=6, model=3): 155, ProcessCoord(pipe=4, data=7, model=0): 156, ProcessCoord(pipe=4, data=7, model=1): 157, ProcessCoord(pipe=4, data=7, model=2): 158, ProcessCoord(pipe=4, data=7, model=3): 159, ProcessCoord(pipe=5, data=0, model=0): 160, ProcessCoord(pipe=5, data=0, model=1): 161, ProcessCoord(pipe=5, data=0, model=2): 162, ProcessCoord(pipe=5, data=0, model=3): 163, ProcessCoord(pipe=5, data=1, model=0): 164, ProcessCoord(pipe=5, data=1, model=1): 165, ProcessCoord(pipe=5, data=1, model=2): 166, ProcessCoord(pipe=5, data=1, model=3): 167, ProcessCoord(pipe=5, data=2, model=0): 168, ProcessCoord(pipe=5, data=2, model=1): 169, ProcessCoord(pipe=5, data=2, model=2): 170, ProcessCoord(pipe=5, data=2, model=3): 171, ProcessCoord(pipe=5, data=3, model=0): 172, ProcessCoord(pipe=5, data=3, model=1): 173, ProcessCoord(pipe=5, data=3, model=2): 174, ProcessCoord(pipe=5, data=3, model=3): 175, ProcessCoord(pipe=5, data=4, model=0): 176, ProcessCoord(pipe=5, data=4, model=1): 177, ProcessCoord(pipe=5, data=4, model=2): 178, ProcessCoord(pipe=5, data=4, model=3): 179, ProcessCoord(pipe=5, data=5, model=0): 180, ProcessCoord(pipe=5, data=5, model=1): 181, ProcessCoord(pipe=5, data=5, model=2): 182, ProcessCoord(pipe=5, data=5, model=3): 183, ProcessCoord(pipe=5, data=6, model=0): 184, ProcessCoord(pipe=5, data=6, model=1): 185, ProcessCoord(pipe=5, data=6, model=2): 186, ProcessCoord(pipe=5, data=6, model=3): 187, ProcessCoord(pipe=5, data=7, model=0): 188, ProcessCoord(pipe=5, data=7, model=1): 189, ProcessCoord(pipe=5, data=7, model=2): 190, ProcessCoord(pipe=5, data=7, model=3): 191, ProcessCoord(pipe=6, data=0, model=0): 192, ProcessCoord(pipe=6, data=0, model=1): 193, ProcessCoord(pipe=6, data=0, model=2): 194, ProcessCoord(pipe=6, data=0, model=3): 195, ProcessCoord(pipe=6, data=1, model=0): 196, ProcessCoord(pipe=6, data=1, model=1): 197, ProcessCoord(pipe=6, data=1, model=2): 198, ProcessCoord(pipe=6, data=1, model=3): 199, ProcessCoord(pipe=6, data=2, model=0): 200, ProcessCoord(pipe=6, data=2, model=1): 201, ProcessCoord(pipe=6, data=2, model=2): 202, ProcessCoord(pipe=6, data=2, model=3): 203, ProcessCoord(pipe=6, data=3, model=0): 204, ProcessCoord(pipe=6, data=3, model=1): 205, ProcessCoord(pipe=6, data=3, model=2): 206, ProcessCoord(pipe=6, data=3, model=3): 207, ProcessCoord(pipe=6, data=4, model=0): 208, ProcessCoord(pipe=6, data=4, model=1): 209, ProcessCoord(pipe=6, data=4, model=2): 210, ProcessCoord(pipe=6, data=4, model=3): 211, ProcessCoord(pipe=6, data=5, model=0): 212, ProcessCoord(pipe=6, data=5, model=1): 213, ProcessCoord(pipe=6, data=5, model=2): 214, ProcessCoord(pipe=6, data=5, model=3): 215, ProcessCoord(pipe=6, data=6, model=0): 216, ProcessCoord(pipe=6, data=6, model=1): 217, ProcessCoord(pipe=6, data=6, model=2): 218, ProcessCoord(pipe=6, data=6, model=3): 219, ProcessCoord(pipe=6, data=7, model=0): 220, ProcessCoord(pipe=6, data=7, model=1): 221, ProcessCoord(pipe=6, data=7, model=2): 222, ProcessCoord(pipe=6, data=7, model=3): 223, ProcessCoord(pipe=7, data=0, model=0): 224, ProcessCoord(pipe=7, data=0, model=1): 225, ProcessCoord(pipe=7, data=0, model=2): 226, ProcessCoord(pipe=7, data=0, model=3): 227, ProcessCoord(pipe=7, data=1, model=0): 228, ProcessCoord(pipe=7, data=1, model=1): 229, ProcessCoord(pipe=7, data=1, model=2): 230, ProcessCoord(pipe=7, data=1, model=3): 231, ProcessCoord(pipe=7, data=2, model=0): 232, ProcessCoord(pipe=7, data=2, model=1): 233, ProcessCoord(pipe=7, data=2, model=2): 234, ProcessCoord(pipe=7, data=2, model=3): 235, ProcessCoord(pipe=7, data=3, model=0): 236, ProcessCoord(pipe=7, data=3, model=1): 237, ProcessCoord(pipe=7, data=3, model=2): 238, ProcessCoord(pipe=7, data=3, model=3): 239, ProcessCoord(pipe=7, data=4, model=0): 240, ProcessCoord(pipe=7, data=4, model=1): 241, ProcessCoord(pipe=7, data=4, model=2): 242, ProcessCoord(pipe=7, data=4, model=3): 243, ProcessCoord(pipe=7, data=5, model=0): 244, ProcessCoord(pipe=7, data=5, model=1): 245, ProcessCoord(pipe=7, data=5, model=2): 246, ProcessCoord(pipe=7, data=5, model=3): 247, ProcessCoord(pipe=7, data=6, model=0): 248, ProcessCoord(pipe=7, data=6, model=1): 249, ProcessCoord(pipe=7, data=6, model=2): 250, ProcessCoord(pipe=7, data=6, model=3): 251, ProcessCoord(pipe=7, data=7, model=0): 252, ProcessCoord(pipe=7, data=7, model=1): 253, ProcessCoord(pipe=7, data=7, model=2): 254, ProcessCoord(pipe=7, data=7, model=3): 255, ProcessCoord(pipe=8, data=0, model=0): 256, ProcessCoord(pipe=8, data=0, model=1): 257, ProcessCoord(pipe=8, data=0, model=2): 258, ProcessCoord(pipe=8, data=0, model=3): 259, ProcessCoord(pipe=8, data=1, model=0): 260, ProcessCoord(pipe=8, data=1, model=1): 261, ProcessCoord(pipe=8, data=1, model=2): 262, ProcessCoord(pipe=8, data=1, model=3): 263, ProcessCoord(pipe=8, data=2, model=0): 264, ProcessCoord(pipe=8, data=2, model=1): 265, ProcessCoord(pipe=8, data=2, model=2): 266, ProcessCoord(pipe=8, data=2, model=3): 267, ProcessCoord(pipe=8, data=3, model=0): 268, ProcessCoord(pipe=8, data=3, model=1): 269, ProcessCoord(pipe=8, data=3, model=2): 270, ProcessCoord(pipe=8, data=3, model=3): 271, ProcessCoord(pipe=8, data=4, model=0): 272, ProcessCoord(pipe=8, data=4, model=1): 273, ProcessCoord(pipe=8, data=4, model=2): 274, ProcessCoord(pipe=8, data=4, model=3): 275, ProcessCoord(pipe=8, data=5, model=0): 276, ProcessCoord(pipe=8, data=5, model=1): 277, ProcessCoord(pipe=8, data=5, model=2): 278, ProcessCoord(pipe=8, data=5, model=3): 279, ProcessCoord(pipe=8, data=6, model=0): 280, ProcessCoord(pipe=8, data=6, model=1): 281, ProcessCoord(pipe=8, data=6, model=2): 282, ProcessCoord(pipe=8, data=6, model=3): 283, ProcessCoord(pipe=8, data=7, model=0): 284, ProcessCoord(pipe=8, data=7, model=1): 285, ProcessCoord(pipe=8, data=7, model=2): 286, ProcessCoord(pipe=8, data=7, model=3): 287, ProcessCoord(pipe=9, data=0, model=0): 288, ProcessCoord(pipe=9, data=0, model=1): 289, ProcessCoord(pipe=9, data=0, model=2): 290, ProcessCoord(pipe=9, data=0, model=3): 291, ProcessCoord(pipe=9, data=1, model=0): 292, ProcessCoord(pipe=9, data=1, model=1): 293, ProcessCoord(pipe=9, data=1, model=2): 294, ProcessCoord(pipe=9, data=1, model=3): 295, ProcessCoord(pipe=9, data=2, model=0): 296, ProcessCoord(pipe=9, data=2, model=1): 297, ProcessCoord(pipe=9, data=2, model=2): 298, ProcessCoord(pipe=9, data=2, model=3): 299, ProcessCoord(pipe=9, data=3, model=0): 300, ProcessCoord(pipe=9, data=3, model=1): 301, ProcessCoord(pipe=9, data=3, model=2): 302, ProcessCoord(pipe=9, data=3, model=3): 303, ProcessCoord(pipe=9, data=4, model=0): 304, ProcessCoord(pipe=9, data=4, model=1): 305, ProcessCoord(pipe=9, data=4, model=2): 306, ProcessCoord(pipe=9, data=4, model=3): 307, ProcessCoord(pipe=9, data=5, model=0): 308, ProcessCoord(pipe=9, data=5, model=1): 309, ProcessCoord(pipe=9, data=5, model=2): 310, ProcessCoord(pipe=9, data=5, model=3): 311, ProcessCoord(pipe=9, data=6, model=0): 312, ProcessCoord(pipe=9, data=6, model=1): 313, ProcessCoord(pipe=9, data=6, model=2): 314, ProcessCoord(pipe=9, data=6, model=3): 315, ProcessCoord(pipe=9, data=7, model=0): 316, ProcessCoord(pipe=9, data=7, model=1): 317, ProcessCoord(pipe=9, data=7, model=2): 318, ProcessCoord(pipe=9, data=7, model=3): 319, ProcessCoord(pipe=10, data=0, model=0): 320, ProcessCoord(pipe=10, data=0, model=1): 321, ProcessCoord(pipe=10, data=0, model=2): 322, ProcessCoord(pipe=10, data=0, model=3): 323, ProcessCoord(pipe=10, data=1, model=0): 324, ProcessCoord(pipe=10, data=1, model=1): 325, ProcessCoord(pipe=10, data=1, model=2): 326, ProcessCoord(pipe=10, data=1, model=3): 327, ProcessCoord(pipe=10, data=2, model=0): 328, ProcessCoord(pipe=10, data=2, model=1): 329, ProcessCoord(pipe=10, data=2, model=2): 330, ProcessCoord(pipe=10, data=2, model=3): 331, ProcessCoord(pipe=10, data=3, model=0): 332, ProcessCoord(pipe=10, data=3, model=1): 333, ProcessCoord(pipe=10, data=3, model=2): 334, ProcessCoord(pipe=10, data=3, model=3): 335, ProcessCoord(pipe=10, data=4, model=0): 336, ProcessCoord(pipe=10, data=4, model=1): 337, ProcessCoord(pipe=10, data=4, model=2): 338, ProcessCoord(pipe=10, data=4, model=3): 339, ProcessCoord(pipe=10, data=5, model=0): 340, ProcessCoord(pipe=10, data=5, model=1): 341, ProcessCoord(pipe=10, data=5, model=2): 342, ProcessCoord(pipe=10, data=5, model=3): 343, ProcessCoord(pipe=10, data=6, model=0): 344, ProcessCoord(pipe=10, data=6, model=1): 345, ProcessCoord(pipe=10, data=6, model=2): 346, ProcessCoord(pipe=10, data=6, model=3): 347, ProcessCoord(pipe=10, data=7, model=0): 348, ProcessCoord(pipe=10, data=7, model=1): 349, ProcessCoord(pipe=10, data=7, model=2): 350, ProcessCoord(pipe=10, data=7, model=3): 351, ProcessCoord(pipe=11, data=0, model=0): 352, ProcessCoord(pipe=11, data=0, model=1): 353, ProcessCoord(pipe=11, data=0, model=2): 354, ProcessCoord(pipe=11, data=0, model=3): 355, ProcessCoord(pipe=11, data=1, model=0): 356, ProcessCoord(pipe=11, data=1, model=1): 357, ProcessCoord(pipe=11, data=1, model=2): 358, ProcessCoord(pipe=11, data=1, model=3): 359, ProcessCoord(pipe=11, data=2, model=0): 360, ProcessCoord(pipe=11, data=2, model=1): 361, ProcessCoord(pipe=11, data=2, model=2): 362, ProcessCoord(pipe=11, data=2, model=3): 363, ProcessCoord(pipe=11, data=3, model=0): 364, ProcessCoord(pipe=11, data=3, model=1): 365, ProcessCoord(pipe=11, data=3, model=2): 366, ProcessCoord(pipe=11, data=3, model=3): 367, ProcessCoord(pipe=11, data=4, model=0): 368, ProcessCoord(pipe=11, data=4, model=1): 369, ProcessCoord(pipe=11, data=4, model=2): 370, ProcessCoord(pipe=11, data=4, model=3): 371, ProcessCoord(pipe=11, data=5, model=0): 372, ProcessCoord(pipe=11, data=5, model=1): 373, ProcessCoord(pipe=11, data=5, model=2): 374, ProcessCoord(pipe=11, data=5, model=3): 375, ProcessCoord(pipe=11, data=6, model=0): 376, ProcessCoord(pipe=11, data=6, model=1): 377, ProcessCoord(pipe=11, data=6, model=2): 378, ProcessCoord(pipe=11, data=6, model=3): 379, ProcessCoord(pipe=11, data=7, model=0): 380, ProcessCoord(pipe=11, data=7, model=1): 381, ProcessCoord(pipe=11, data=7, model=2): 382, ProcessCoord(pipe=11, data=7, model=3): 383}
[default0]:[2022-03-04 04:02:55,582] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method type:transformer|embedding
[default0]:stage=0 layers=8
[default0]:     0: _to_float16
[default0]:     1: EmbeddingPipe
[default0]:     2: <lambda>
[default0]:     3: ParallelTransformerLayerPipe
[default0]:     4: ParallelTransformerLayerPipe
[default0]:     5: ParallelTransformerLayerPipe
[default0]:     6: ParallelTransformerLayerPipe
[default0]:     7: ParallelTransformerLayerPipe
[default0]:stage=1 layers=6
[default0]:     8: ParallelTransformerLayerPipe
[default0]:     9: ParallelTransformerLayerPipe
[default0]:    10: ParallelTransformerLayerPipe
[default0]:    11: ParallelTransformerLayerPipe
[default0]:    12: ParallelTransformerLayerPipe
[default0]:    13: ParallelTransformerLayerPipe
[default0]:stage=2 layers=6
[default0]:    14: ParallelTransformerLayerPipe
[default0]:    15: ParallelTransformerLayerPipe
[default0]:    16: ParallelTransformerLayerPipe
[default0]:    17: ParallelTransformerLayerPipe
[default0]:    18: ParallelTransformerLayerPipe
[default0]:    19: ParallelTransformerLayerPipe
[default0]:stage=3 layers=6
[default0]:    20: ParallelTransformerLayerPipe
[default0]:    21: ParallelTransformerLayerPipe
[default0]:    22: ParallelTransformerLayerPipe
[default0]:    23: ParallelTransformerLayerPipe
[default0]:    24: ParallelTransformerLayerPipe
[default0]:    25: ParallelTransformerLayerPipe
[default0]:stage=4 layers=6
[default0]:    26: ParallelTransformerLayerPipe
[default0]:    27: ParallelTransformerLayerPipe
[default0]:    28: ParallelTransformerLayerPipe
[default0]:    29: ParallelTransformerLayerPipe
[default0]:    30: ParallelTransformerLayerPipe
[default0]:    31: ParallelTransformerLayerPipe
[default0]:stage=5 layers=6
[default0]:    32: ParallelTransformerLayerPipe
[default0]:    33: ParallelTransformerLayerPipe
[default0]:    34: ParallelTransformerLayerPipe
[default0]:    35: ParallelTransformerLayerPipe
[default0]:    36: ParallelTransformerLayerPipe
[default0]:    37: ParallelTransformerLayerPipe
[default0]:stage=6 layers=6
[default0]:    38: ParallelTransformerLayerPipe
[default0]:    39: ParallelTransformerLayerPipe
[default0]:    40: ParallelTransformerLayerPipe
[default0]:    41: ParallelTransformerLayerPipe
[default0]:    42: ParallelTransformerLayerPipe
[default0]:    43: ParallelTransformerLayerPipe
[default0]:stage=7 layers=6
[default0]:    44: ParallelTransformerLayerPipe
[default0]:    45: ParallelTransformerLayerPipe
[default0]:    46: ParallelTransformerLayerPipe
[default0]:    47: ParallelTransformerLayerPipe
[default0]:    48: ParallelTransformerLayerPipe
[default0]:    49: ParallelTransformerLayerPipe
[default0]:stage=8 layers=6
[default0]:    50: ParallelTransformerLayerPipe
[default0]:    51: ParallelTransformerLayerPipe
[default0]:    52: ParallelTransformerLayerPipe
[default0]:    53: ParallelTransformerLayerPipe
[default0]:    54: ParallelTransformerLayerPipe
[default0]:    55: ParallelTransformerLayerPipe
[default0]:stage=9 layers=6
[default0]:    56: ParallelTransformerLayerPipe
[default0]:    57: ParallelTransformerLayerPipe
[default0]:    58: ParallelTransformerLayerPipe
[default0]:    59: ParallelTransformerLayerPipe
[default0]:    60: ParallelTransformerLayerPipe
[default0]:    61: ParallelTransformerLayerPipe
[default0]:stage=10 layers=6
[default0]:    62: ParallelTransformerLayerPipe
[default0]:    63: ParallelTransformerLayerPipe
[default0]:    64: ParallelTransformerLayerPipe
[default0]:    65: ParallelTransformerLayerPipe
[default0]:    66: ParallelTransformerLayerPipe
[default0]:    67: ParallelTransformerLayerPipe
[default0]:stage=11 layers=9
[default0]:    68: ParallelTransformerLayerPipe
[default0]:    69: ParallelTransformerLayerPipe
[default0]:    70: ParallelTransformerLayerPipe
[default0]:    71: ParallelTransformerLayerPipe
[default0]:    72: ParallelTransformerLayerPipe
[default0]:    73: <lambda>
[default0]:    74: MixedFusedLayerNorm
[default0]:    75: EmbeddingPipe
[default0]:    76: float16_to_fp32
[default0]:  loss: CrossEntropy
[default0]:[2022-03-04 04:02:56,733] [INFO] [utils.py:828:see_memory_usage] After Building Model
[default0]:[2022-03-04 04:02:56,734] [INFO] [utils.py:829:see_memory_usage] MA 7.43 GB         Max_MA 7.43 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-04 04:02:56,734] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.65 GB, percent = 8.7%
[default0]:setting training iterations to 128728
[default0]:> learning rate decay style: cosine
[default0]:DeepSpeed is enabled.
[default0]:[2022-03-04 04:02:56,755] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+ed26ef4, git-hash=ed26ef4, git-branch=olruwase/bf16-updates
[default0]:[2022-03-04 04:02:58,559] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[default0]:[2022-03-04 04:02:58,559] [INFO] [engine.py:1092:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[default0]:[2022-03-04 04:02:58,559] [INFO] [engine.py:1098:_configure_optimizer] Using client Optimizer as basic optimizer
[default0]:[2022-03-04 04:02:58,560] [INFO] [engine.py:1114:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
[default0]:[2022-03-04 04:02:58,560] [INFO] [engine.py:1328:_configure_bf16_optimizer] Creating unfused BF16 optimizer
[default0]:[2022-03-04 04:02:58,619] [INFO] [utils.py:828:see_memory_usage] begin bf16_optimizer
[default0]:[2022-03-04 04:02:58,620] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB         Max_MA 7.43 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-04 04:02:58,620] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 44.0 GB, percent = 8.7%
[default0]:[2022-03-04 04:02:58,646] [INFO] [utils.py:828:see_memory_usage] before initializing group 0
[default0]:[2022-03-04 04:02:58,646] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB         Max_MA 7.42 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-04 04:02:58,646] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 44.0 GB, percent = 8.7%
[default0]:[2022-03-04 04:02:58,701] [INFO] [utils.py:828:see_memory_usage] after initializing group 0
[default0]:[2022-03-04 04:02:58,702] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB         Max_MA 17.01 GB         CA 20.23 GB         Max_CA 20 GB 
[default0]:[2022-03-04 04:02:58,702] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 44.0 GB, percent = 8.7%
[default0]:[2022-03-04 04:02:58,728] [INFO] [utils.py:828:see_memory_usage] before initializing group 1
[default0]:[2022-03-04 04:02:58,728] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB         Max_MA 17.01 GB         CA 20.23 GB         Max_CA 20 GB 
[default0]:[2022-03-04 04:02:58,729] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 44.0 GB, percent = 8.7%
[default0]:[2022-03-04 04:02:58,775] [INFO] [utils.py:828:see_memory_usage] after initializing group 1
[default0]:[2022-03-04 04:02:58,775] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB         Max_MA 24.11 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-04 04:02:58,775] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 44.0 GB, percent = 8.7%
[default0]:[2022-03-04 04:02:58,799] [INFO] [utils.py:828:see_memory_usage] before initializing group 2
[default0]:[2022-03-04 04:02:58,799] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB         Max_MA 24.11 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-04 04:02:58,799] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 44.0 GB, percent = 8.7%
[default0]:[2022-03-04 04:02:58,824] [INFO] [utils.py:828:see_memory_usage] after initializing group 2
[default0]:[2022-03-04 04:02:58,824] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB         Max_MA 24.12 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-04 04:02:58,824] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 44.0 GB, percent = 8.7%
[default0]:[2022-03-04 04:02:58,847] [INFO] [utils.py:828:see_memory_usage] before initialize_optimizer
[default0]:[2022-03-04 04:02:58,848] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB         Max_MA 24.12 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-04 04:02:58,848] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 44.0 GB, percent = 8.7%
[default0]:[2022-03-04 04:02:58,898] [INFO] [utils.py:828:see_memory_usage] end initialize_optimizer
[default0]:[2022-03-04 04:02:58,898] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB         Max_MA 27.82 GB         CA 34.21 GB         Max_CA 34 GB 
[default0]:[2022-03-04 04:02:58,898] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 44.0 GB, percent = 8.7%
[default0]:[2022-03-04 04:02:58,920] [INFO] [utils.py:828:see_memory_usage] end bf16_optimizer
[default0]:[2022-03-04 04:02:58,920] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB         Max_MA 27.82 GB         CA 34.21 GB         Max_CA 34 GB 
[default0]:[2022-03-04 04:02:58,920] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 44.0 GB, percent = 8.7%
[default0]:[2022-03-04 04:02:58,920] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[default0]:[2022-03-04 04:02:58,921] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[default0]:[2022-03-04 04:02:58,921] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x14b8b4aa15b0>
[default0]:[2022-03-04 04:02:58,921] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1057:print] DeepSpeedEngine configuration:
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   activation_checkpointing_config  {
[default0]:    "partition_activations": false, 
[default0]:    "contiguous_memory_optimization": false, 
[default0]:    "cpu_checkpointing": false, 
[default0]:    "number_checkpoints": null, 
[default0]:    "synchronize_checkpoint_boundary": false, 
[default0]:    "profile": false
[default0]:}
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   amp_enabled .................. False
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   amp_params ................... False
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   autotuning_config ............ {
[default0]:    "enabled": false, 
[default0]:    "start_step": null, 
[default0]:    "end_step": null, 
[default0]:    "metric_path": null, 
[default0]:    "arg_mappings": null, 
[default0]:    "metric": "throughput", 
[default0]:    "model_info": null, 
[default0]:    "results_dir": null, 
[default0]:    "exps_dir": null, 
[default0]:    "overwrite": true, 
[default0]:    "fast": true, 
[default0]:    "start_profile_step": 3, 
[default0]:    "end_profile_step": 5, 
[default0]:    "tuner_type": "gridsearch", 
[default0]:    "tuner_early_stopping": 5, 
[default0]:    "tuner_num_trials": 50, 
[default0]:    "model_info_path": null, 
[default0]:    "mp_size": 1, 
[default0]:    "max_train_batch_size": null, 
[default0]:    "min_train_batch_size": 1, 
[default0]:    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
[default0]:    "min_train_micro_batch_size_per_gpu": 1, 
[default0]:    "num_tuning_micro_batch_sizes": 3
[default0]:}
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   bfloat16_enabled ............. True
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   checkpoint_tag_validation_enabled  True
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   checkpoint_tag_validation_fail  False
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   communication_data_type ...... None
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   curriculum_enabled ........... False
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   curriculum_params ............ False
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   dataloader_drop_last ......... False
[default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print]   disable_allgather ............ False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   dump_state ................... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   dynamic_loss_scale_args ...... None
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   eigenvalue_enabled ........... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   eigenvalue_gas_boundary_resolution  1
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   eigenvalue_layer_name ........ bert.encoder.layer
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   eigenvalue_layer_num ......... 0
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   eigenvalue_max_iter .......... 100
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   eigenvalue_stability ......... 1e-06
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   eigenvalue_tol ............... 0.01
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   eigenvalue_verbose ........... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   elasticity_enabled ........... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   flops_profiler_config ........ {
[default0]:    "enabled": false, 
[default0]:    "profile_step": 1, 
[default0]:    "module_depth": -1, 
[default0]:    "top_modules": 1, 
[default0]:    "detailed": true, 
[default0]:    "output_file": null
[default0]:}
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   fp16_enabled ................. False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   fp16_master_weights_and_gradients  False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   fp16_mixed_quantize .......... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   global_rank .................. 0
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   gradient_accumulation_steps .. 128
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   gradient_clipping ............ 1.0
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   gradient_predivide_factor .... 1.0
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   initial_dynamic_scale ........ 1
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   loss_scale ................... 1.0
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   memory_breakdown ............. False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   optimizer_legacy_fusion ...... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   optimizer_name ............... None
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   optimizer_params ............. None
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   pld_enabled .................. False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   pld_params ................... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   prescale_gradients ........... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   quantize_change_rate ......... 0.001
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   quantize_groups .............. 1
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   quantize_offset .............. 1000
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   quantize_period .............. 1000
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   quantize_rounding ............ 0
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   quantize_start_bits .......... 16
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   quantize_target_bits ......... 8
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   quantize_training_enabled .... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   quantize_type ................ 0
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   quantize_verbose ............. False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   scheduler_name ............... None
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   scheduler_params ............. None
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   sparse_attention ............. None
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   sparse_gradients_enabled ..... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   steps_per_print .............. 2000
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   tensorboard_enabled .......... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   tensorboard_job_name ......... DeepSpeedJobName
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   tensorboard_output_path ...... 
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   train_batch_size ............. 2048
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   train_micro_batch_size_per_gpu  2
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   use_quantizer_kernel ......... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   wall_clock_breakdown ......... False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   world_size ................... 8
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   zero_allow_untested_optimizer  False
[default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print]   zero_config .................. {
[default0]:    "stage": 0, 
[default0]:    "contiguous_gradients": true, 
[default0]:    "reduce_scatter": true, 
[default0]:    "reduce_bucket_size": 5.000000e+08, 
[default0]:    "allgather_partitions": true, 
[default0]:    "allgather_bucket_size": 5.000000e+08, 
[default0]:    "overlap_comm": false, 
[default0]:    "load_from_fp32_weights": true, 
[default0]:    "elastic_checkpoint": false, 
[default0]:    "offload_param": null, 
[default0]:    "offload_optimizer": null, 
[default0]:    "sub_group_size": 1.000000e+09, 
[default0]:    "prefetch_bucket_size": 5.000000e+07, 
[default0]:    "param_persistence_threshold": 1.000000e+05, 
[default0]:    "max_live_parameters": 1.000000e+09, 
[default0]:    "max_reuse_distance": 1.000000e+09, 
[default0]:    "gather_16bit_weights_on_model_save": false, 
[default0]:    "ignore_unused_parameters": true, 
[default0]:    "round_robin_gradients": false, 
[default0]:    "legacy_stage1": false
[default0]:}
[default0]:[2022-03-04 04:02:58,923] [INFO] [config.py:1061:print]   zero_enabled ................. False
[default0]:[2022-03-04 04:02:58,923] [INFO] [config.py:1061:print]   zero_optimization_stage ...... 0
[default0]:[2022-03-04 04:02:58,923] [INFO] [config.py:1063:print]   json = {
[default0]:    "train_micro_batch_size_per_gpu": 2, 
[default0]:    "train_batch_size": 2.048000e+03, 
[default0]:    "gradient_clipping": 1.0, 
[default0]:    "zero_optimization": {
[default0]:        "stage": 0
[default0]:    }, 
[default0]:    "bf16": {
[default0]:        "enabled": true
[default0]:    }, 
[default0]:    "steps_per_print": 2.000000e+03, 
[default0]:    "wall_clock_breakdown": false
[default0]:}
[default0]:[2022-03-04 04:02:58,923] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=128 micro_batch_size=2
[default1]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=97 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=193 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=98 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=99 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=35 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=32 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=33 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=225 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=34 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=64 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=131 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=130 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=129 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=128 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=290 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=289 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=291 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=288 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=160 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=2 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=1 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=3 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=257 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=256 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=259 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=258 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=162 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=161 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=66 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=163 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=320 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=322 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=323 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=321 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=192 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=194 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=195 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=65 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=226 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=224 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=227 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=353 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=355 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=354 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=352 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=67 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=96 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]: > using checkpoint value 6e-05 for learning rate
[default0]: > using checkpoint value 6e-06 for minimum learning rate
[default0]: > using checkpoint value 183105 for warmup iterations
[default0]: > using checkpoint value 200000000 for total number of iterations
[default0]: > using checkpoint value cosine for decay style
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:[2022-03-04 04:03:13,219] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 276
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:13,438] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 272
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:13,878] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 277
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:14,053] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 356
[default0]:[2022-03-04 04:03:14,141] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 352
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:14,487] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 279
[default1]:[2022-03-04 04:03:14,986] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 273
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:14,955] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 274
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:15,080] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 136
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:15,221] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 336
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:15,287] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 140
[default3]:[2022-03-04 04:03:15,253] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 275
[default0]:[2022-03-04 04:03:15,499] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 328
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:15,563] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 278
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247065 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247066 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247067 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247068 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247070 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246692 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246693 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246694 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246695 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247071 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246698 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246699 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247072 closing signal SIGTERM
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:15,830] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 340
[default2]:[2022-03-04 04:03:16,518] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 138
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:16,438] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 252
[default4]:[2022-03-04 04:03:16,479] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 4
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:16,688] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 200
[default2]:[2022-03-04 04:03:16,668] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 314
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:16,638] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 308
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:16,828] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 348
[default0]:[2022-03-04 04:03:16,969] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 312
[default0]:[2022-03-04 04:03:16,951] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 120
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:17,091] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 196
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:17,045] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 344
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:17,078] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 305
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:17,186] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 248
[default4]:[2022-03-04 04:03:17,207] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 156
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:17,298] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 184
[default4]:[2022-03-04 04:03:17,279] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 124
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:17,247] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 122
[default7]:[2022-03-04 04:03:17,291] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 343
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:17,323] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 339
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 246696) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default2]:[2022-03-04 04:03:17,384] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 250
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:17,445] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 36
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:[2022-03-04 04:03:17,573] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 141
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:17,539] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 47
[default2]:[2022-03-04 04:03:17,531] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 170
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:17,530] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 152
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:17,627] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 176
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:17,701] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 168
[default5]:[2022-03-04 04:03:17,705] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 125
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:17,781] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 249
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:[2022-03-04 04:03:17,748] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 306
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:17,877] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 331
[default0]:[2022-03-04 04:03:17,874] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 280
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:[2022-03-04 04:03:17,913] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 351
[default3]:[2022-03-04 04:03:17,922] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 315
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:17,872] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 253
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:17,849] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 121
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:17,986] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 324
[default4]:[2022-03-04 04:03:17,945] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 372
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:17,933] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 173
[default4]:[2022-03-04 04:03:18,048] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 300
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:18,119] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 345
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:18,115] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 304
[default0]:[2022-03-04 04:03:18,174] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 296
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 247069) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:18,202] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 160
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:18,150] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 164
[default0]:[2022-03-04 04:03:18,230] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 320
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:18,258] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 137
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:18,322] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 334
[default4]:[2022-03-04 04:03:18,389] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 236
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:18,375] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 244
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:18,327] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 332
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:18,364] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 319
[default5]:[2022-03-04 04:03:18,426] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 349
[default1]:[2022-03-04 04:03:18,465] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 329
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:18,475] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 347
[default7]:[2022-03-04 04:03:18,437] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 335
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:18,466] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 310
[default3]:[2022-03-04 04:03:18,460] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 307
[default3]:[2022-03-04 04:03:18,485] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 251
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:18,493] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 188
[default4]:[2022-03-04 04:03:18,522] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 172
[default3]:[2022-03-04 04:03:18,525] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 171
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:18,462] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 161
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:18,502] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 0
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:18,519] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 143
[default1]:[2022-03-04 04:03:18,582] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 241
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:18,596] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 142
[default4]:[2022-03-04 04:03:18,551] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 76
[default1]:[2022-03-04 04:03:18,568] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 313
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:18,567] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 255
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:18,630] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 338
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:18,626] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 322
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:[2022-03-04 04:03:18,704] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 321
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:18,636] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 175
[default4]:Traceback (most recent call last):
[default4]:[2022-03-04 04:03:18,708] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 108
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:18,629] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 201
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:18,669] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 368
[default3]:[2022-03-04 04:03:18,665] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 203
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:18,789] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 194
[default2]:[2022-03-04 04:03:18,784] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 330
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:18,794] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 285
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:18,759] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 174
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:18,902] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 32
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:18,891] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 72
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:18,867] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 205
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:[2022-03-04 04:03:18,860] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 204
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:[2022-03-04 04:03:18,830] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 104
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:18,910] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 264
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:18,922] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 326
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:19,016] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 254
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:18,949] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 127
[default6]:[2022-03-04 04:03:19,054] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 318
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:[2022-03-04 04:03:19,072] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 144
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:[2022-03-04 04:03:19,061] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 284
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default5]:[2022-03-04 04:03:19,064] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 317
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:19,118] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 342
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:19,185] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 37
[default3]:[2022-03-04 04:03:19,131] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 139
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:19,148] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 191
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:[2022-03-04 04:03:19,186] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 341
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:19,317] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 178
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:[2022-03-04 04:03:19,305] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 208
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:[2022-03-04 04:03:19,263] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 40
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:19,234] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 24
[default1]:[2022-03-04 04:03:19,323] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 281
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:19,331] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 28
[default4]:[2022-03-04 04:03:19,425] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 44
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:19,408] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 202
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:19,401] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 126
[default5]:[2022-03-04 04:03:19,412] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 53
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:19,435] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 325
[default3]:[2022-03-04 04:03:19,418] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 323
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:19,466] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 192
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:19,427] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 198
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:Traceback (most recent call last):
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:    main()
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:19,445] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 350
[default1]:[2022-03-04 04:03:19,484] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 169
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:19,524] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 333
[default6]:[2022-03-04 04:03:19,507] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 158
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:19,435] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 206
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:19,593] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 327
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:19,538] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 346
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:19,611] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 207
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default5]:[2022-03-04 04:03:19,571] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 189
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:19,594] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 309
[default2]:[2022-03-04 04:03:19,591] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 154
[default4]:[2022-03-04 04:03:19,560] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 316
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:19,606] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 167
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:19,592] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 337
[default0]:[2022-03-04 04:03:19,714] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 360
[default3]:[2022-03-04 04:03:19,655] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 179
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:19,653] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 33
[default3]:[2022-03-04 04:03:19,702] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 283
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:19,678] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 187
[default7]:[2022-03-04 04:03:19,635] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 119
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:19,712] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 50
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:19,775] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 16
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:19,789] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 232
[default7]:[2022-03-04 04:03:19,757] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 303
[default0]:[2022-03-04 04:03:19,786] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 376
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:19,769] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 268
[default3]:[2022-03-04 04:03:19,798] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 123
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:19,891] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 214
[default7]:[2022-03-04 04:03:19,881] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 199
[default5]:[2022-03-04 04:03:19,875] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 245
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default3]:[2022-03-04 04:03:19,836] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 43
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:19,869] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 117
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:[2022-03-04 04:03:19,838] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 287
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:[2022-03-04 04:03:19,838] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 159
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:19,848] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 157
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:[2022-03-04 04:03:19,930] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 49
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:[2022-03-04 04:03:19,989] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 180
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:19,929] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 302
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:[2022-03-04 04:03:19,972] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 193
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default1]:[2022-03-04 04:03:19,965] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 41
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:[2022-03-04 04:03:19,939] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 38
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:19,966] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 155
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:[2022-03-04 04:03:19,967] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 373
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default2]:[2022-03-04 04:03:20,064] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 18
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:20,069] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 197
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default0]:[2022-03-04 04:03:20,082] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 240
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:[2022-03-04 04:03:20,085] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 20
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default5]:[2022-03-04 04:03:20,074] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 45
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:20,083] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 292
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:[2022-03-04 04:03:20,114] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 34
[default1]:    return f(*args, **kwargs)
[default5]:[2022-03-04 04:03:20,045] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 165
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:[2022-03-04 04:03:20,116] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 12
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:20,092] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 162
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:[2022-03-04 04:03:20,216] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 182
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:[2022-03-04 04:03:20,216] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 183
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default0]:[2022-03-04 04:03:20,309] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 80
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:[2022-03-04 04:03:20,307] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 21
[default6]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:20,326] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 114
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:[2022-03-04 04:03:20,314] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 380
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:[2022-03-04 04:03:20,264] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 186
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:[2022-03-04 04:03:20,252] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 10
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default7]:[2022-03-04 04:03:20,254] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 311
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:20,396] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 235
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:[2022-03-04 04:03:20,409] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 195
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:[2022-03-04 04:03:20,341] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 46
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:20,378] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 60
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:[2022-03-04 04:03:20,388] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 286
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default5]:[2022-03-04 04:03:20,346] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 109
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:20,375] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 105
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:[2022-03-04 04:03:20,372] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 260
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:[2022-03-04 04:03:20,341] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 163
[default1]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:20,418] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 54
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default1]:[2022-03-04 04:03:20,464] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 297
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:[2022-03-04 04:03:20,429] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 299
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default1]:[2022-03-04 04:03:20,450] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 177
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:20,471] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 84
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:[2022-03-04 04:03:20,472] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 116
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:[2022-03-04 04:03:20,488] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 98
[default5]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:20,531] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 153
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:[2022-03-04 04:03:20,504] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 223
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:[2022-03-04 04:03:20,525] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 211
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:20,574] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 39
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:[2022-03-04 04:03:20,562] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 42
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default1]:[2022-03-04 04:03:20,550] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 113
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:20,597] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 26
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:[2022-03-04 04:03:20,572] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 30
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:[2022-03-04 04:03:20,605] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 282
[default2]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:20,582] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 106
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:[2022-03-04 04:03:20,588] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 190
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default5]:[2022-03-04 04:03:20,585] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 301
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:[2022-03-04 04:03:20,559] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 166
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:20,628] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 55
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:KeyError: 'clip_grad'
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:20,636] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 181
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:[2022-03-04 04:03:20,674] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 73
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default3]:[2022-03-04 04:03:20,656] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 35
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:20,703] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 111
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:[2022-03-04 04:03:20,679] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 185
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default4]:[2022-03-04 04:03:20,684] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 52
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254451 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254454 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245945 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245946 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287078 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287079 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287080 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245948 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287081 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287082 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245950 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287083 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287084 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245951 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245952 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245863 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245867 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245868 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253186 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253187 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253188 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253189 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253191 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253192 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253193 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248196 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248197 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248198 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248199 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248200 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248202 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248203 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230217 closing signal SIGTERM
[default2]:[2022-03-04 04:03:20,776] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 298
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259744 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249069 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259745 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249070 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259746 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242641 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259747 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249071 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242642 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259748 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259749 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242643 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249072 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259750 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242644 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242646 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249074 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249075 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249076 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242647 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242648 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231484 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231485 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231486 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231487 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231489 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231490 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231491 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 210933 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 210934 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229013 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229014 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229015 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229017 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229018 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229019 closing signal SIGTERM
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:20,729] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 375
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:[2022-03-04 04:03:20,794] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 371
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:20,820] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 107
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297992 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297993 closing signal SIGTERM
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:[2022-03-04 04:03:20,805] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 110
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297994 closing signal SIGTERM
[default6]:[2022-03-04 04:03:20,787] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 118
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297996 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297997 closing signal SIGTERM
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297998 closing signal SIGTERM
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 257949 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 257952 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248427 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248431 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 108419 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255609 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255610 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255612 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255614 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255616 closing signal SIGTERM
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:[2022-03-04 04:03:20,858] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 364
[default4]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:20,878] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 77
[default3]:[2022-03-04 04:03:20,878] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 19
[default7]:[2022-03-04 04:03:20,840] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 79
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 250780) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:20,899] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 270
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:20,883] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 263
[default0]:[2022-03-04 04:03:20,920] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 8
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:20,905] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 271
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 228356 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 228359 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 228361 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 228363 closing signal SIGTERM
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248343 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248347 closing signal SIGTERM
[default7]:[2022-03-04 04:03:20,959] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 239
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:20,977] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 233
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:21,002] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 213
[default2]:[2022-03-04 04:03:20,947] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 210
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:[2022-03-04 04:03:20,943] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 148
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:[2022-03-04 04:03:20,961] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 31
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:20,974] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 25
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:20,959] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 17
[default3]:[2022-03-04 04:03:20,988] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 267
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:21,001] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 294
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:20,998] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 9
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:21,014] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 13
[default6]:[2022-03-04 04:03:20,982] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 382
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:20,971] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 220
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:20,999] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 48
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:21,087] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 237
[default1]:[2022-03-04 04:03:21,081] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 89
[default6]:[2022-03-04 04:03:21,090] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 22
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:21,084] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 62
[default1]:Traceback (most recent call last):
[default4]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:[2022-03-04 04:03:21,103] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 215
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:21,070] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 81
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:21,109] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 75
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:21,100] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 112
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:[2022-03-04 04:03:21,079] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 212
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:21,036] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 115
[default4]:[2022-03-04 04:03:21,081] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 92
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:21,079] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 15
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:21,116] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 295
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:[2022-03-04 04:03:21,060] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 51
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:21,179] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 102
[default1]:[2022-03-04 04:03:21,195] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 361
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:21,212] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 58
[default2]:[2022-03-04 04:03:21,141] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 82
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:21,209] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 27
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:21,139] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 219
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:21,203] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 383
[default2]:[2022-03-04 04:03:21,148] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 74
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:21,175] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 259
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:21,165] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 377
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:21,254] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 363
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:21,243] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 150
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 230214) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:21,264] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 11
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:[2022-03-04 04:03:21,317] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 269
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:21,322] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 93
[default0]:[2022-03-04 04:03:21,371] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 56
[default6]:[2022-03-04 04:03:21,324] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 366
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:21,347] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 23
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:21,359] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 63
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:21,431] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 265
[default0]:[2022-03-04 04:03:21,391] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 288
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:21,403] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 262
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:[2022-03-04 04:03:21,432] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 379
[default3]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:21,376] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 216
[default0]:[2022-03-04 04:03:21,479] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 88
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:21,492] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 57
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:[2022-03-04 04:03:21,432] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 362
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:[2022-03-04 04:03:21,437] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 365
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default5]:Traceback (most recent call last):
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:KeyError: 'clip_grad'
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 254449) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default1]:[2022-03-04 04:03:21,519] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 209
[default2]:[2022-03-04 04:03:21,514] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 266
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 108415) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:21,511] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 293
[default6]:[2022-03-04 04:03:21,512] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 14
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:[2022-03-04 04:03:21,497] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 258
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:[2022-03-04 04:03:21,460] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 378
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:21,519] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 218
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:[2022-03-04 04:03:21,549] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 91
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:21,562] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 367
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:21,536] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 103
[default5]:[2022-03-04 04:03:21,546] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 61
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:21,562] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 85
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:[2022-03-04 04:03:21,542] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 147
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default4]:[2022-03-04 04:03:21,539] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 100
[default6]:[2022-03-04 04:03:21,559] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 78
[default6]:[2022-03-04 04:03:21,538] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 222
[default1]:[2022-03-04 04:03:21,633] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 217
[default5]:[2022-03-04 04:03:21,544] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 221
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 248426) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:21,642] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 238
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:21,644] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 87
[default7]:[2022-03-04 04:03:21,644] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 95
[default2]:[2022-03-04 04:03:21,708] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 234
[default3]:[2022-03-04 04:03:21,686] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 83
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:21,711] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 29
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 248345) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default6]:[2022-03-04 04:03:21,652] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 94
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:21,640] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 289
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:21,696] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 291
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:[2022-03-04 04:03:21,645] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 256
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:[2022-03-04 04:03:21,662] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 257
[default0]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:21,691] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 261
[default5]:[2022-03-04 04:03:21,755] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 381
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:21,782] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 96
[default1]:[2022-03-04 04:03:21,801] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 97
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:21,766] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 59
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:[2022-03-04 04:03:21,790] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 90
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:21,791] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 86
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:[2022-03-04 04:03:21,789] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 146
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:21,823] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 145
[default5]:[2022-03-04 04:03:21,778] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 149
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:21,760] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 290
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 210935) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 257946) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:21,888] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 101
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:21,907] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 64
[default3]:[2022-03-04 04:03:21,981] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 99
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:21,970] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 151
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 245861) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default4]:Traceback (most recent call last):
[default4]:[2022-03-04 04:03:21,970] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 132
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
[default0]:[2022-03-04 04:03:22,033] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 224
[default2]:[2022-03-04 04:03:22,108] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 226
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 255613) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default4]:Traceback (most recent call last):
[default4]:[2022-03-04 04:03:22,186] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 68
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default2]:[2022-03-04 04:03:22,220] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 66
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 228360) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:22,249] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 229
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:22,306] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 69
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
[default7]:[2022-03-04 04:03:22,256] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 231
[default4]:[2022-03-04 04:03:22,372] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 228
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:22,370] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 67
[default4]:Traceback (most recent call last):
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default4]:    main()
[default4]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default4]:    return f(*args, **kwargs)
[default4]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default4]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default4]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default4]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default4]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default4]:    success = self._load_zero_checkpoint(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default4]:    self.optimizer.load_state_dict(
[default4]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default4]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default4]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:22,380] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 230
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
[default6]:[2022-03-04 04:03:22,425] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 134
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default3]:[2022-03-04 04:03:22,424] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 227
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 229012) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default0]:[2022-03-04 04:03:22,474] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 128
[default0]:Traceback (most recent call last):
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default0]:    main()
[default0]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default0]:    return f(*args, **kwargs)
[default0]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default0]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default0]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default0]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default0]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default0]:    success = self._load_zero_checkpoint(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default0]:    self.optimizer.load_state_dict(
[default0]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default0]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default0]:KeyError: 'clip_grad'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 231488) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default6]:Traceback (most recent call last):
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default6]:    main()
[default6]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default6]:    return f(*args, **kwargs)
[default6]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default6]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default6]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default6]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default6]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default6]:    success = self._load_zero_checkpoint(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default6]:    self.optimizer.load_state_dict(
[default6]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default6]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default6]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:22,584] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 65
[default7]:[2022-03-04 04:03:22,576] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 71
[default6]:[2022-03-04 04:03:22,562] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 70
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 245949) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default7]:Traceback (most recent call last):
[default1]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:[2022-03-04 04:03:22,545] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 225
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 5 (pid: 248201) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default3]:[2022-03-04 04:03:22,559] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 131
[default3]:Traceback (most recent call last):
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default3]:    main()
[default3]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default3]:    return f(*args, **kwargs)
[default3]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default3]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default3]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default3]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default3]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default3]:    success = self._load_zero_checkpoint(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default3]:    self.optimizer.load_state_dict(
[default3]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default3]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default3]:KeyError: 'clip_grad'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 253190) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 259743) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default1]:Traceback (most recent call last):
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default1]:    main()
[default1]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default1]:    return f(*args, **kwargs)
[default1]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default1]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default1]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default1]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default1]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default1]:    success = self._load_zero_checkpoint(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default1]:    self.optimizer.load_state_dict(
[default1]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default1]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default1]:KeyError: 'clip_grad'
[default1]:[2022-03-04 04:03:22,696] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 129
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 297995) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default5]:Traceback (most recent call last):
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default5]:    main()
[default7]:[2022-03-04 04:03:22,820] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 135
[default5]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default5]:    return f(*args, **kwargs)
[default5]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default5]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default5]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default5]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default5]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default5]:    success = self._load_zero_checkpoint(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default5]:    self.optimizer.load_state_dict(
[default5]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default5]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default5]:KeyError: 'clip_grad'
[default5]:[2022-03-04 04:03:22,772] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 133
[default7]:Traceback (most recent call last):
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default7]:    main()
[default7]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default7]:    return f(*args, **kwargs)
[default7]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default7]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default7]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default7]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default7]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default7]:    success = self._load_zero_checkpoint(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default7]:    self.optimizer.load_state_dict(
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default7]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default7]:KeyError: 'clip_grad'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 7 (pid: 287085) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
[default2]:[2022-03-04 04:03:22,932] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 130
[default2]:Traceback (most recent call last):
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module>
[default2]:    main()
[default2]:  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
[default2]:    return f(*args, **kwargs)
[default2]:  File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
[default2]:    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
[default2]:    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
[default2]:    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
[default2]:    loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
[default2]:    success = self._load_zero_checkpoint(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
[default2]:    self.optimizer.load_state_dict(
[default2]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
[default2]:    self.clip_grad = current_rank_sd[CLIP_GRAD]
[default2]:KeyError: 'clip_grad'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 242645) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 249073) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253346 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253347 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253349 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253350 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251307 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251308 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251311 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254537 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253463 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 221874 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256020 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253488 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 129885 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89534 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89535 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89537 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89538 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 291330 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231555 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 247801) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 90537) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 250175) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 264226) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 252588) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 233472) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227344 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 270064) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 77365) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 247683) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 255667) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 242925) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 291331) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 227343) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 256186) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'jean-zay-iam45-ib0_246954_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousTimeoutError.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'jean-zay-iam35-ib0_246581_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousTimeoutError.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 129889) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 5 (pid: 221879) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 231559) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 253467) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 254538) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 247377) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 253492) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 256022) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 251309) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 253348) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
Fatal Python error: Segmentation fault

Current thread 0x0000145ce13c5700 (most recent call first):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/linecache.py", line 74 in checkcache
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 783 in findsource
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1477 in getframeinfo
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1503 in getouterframes
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1526 in stack
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 42 in get_method_name
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 585 in _record
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 619 in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1143 in _keep_alive
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1133 in _keep_alive_weak
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 255 in _run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 870 in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x0000145dea52e600 (most recent call first):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131 in __call__
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715 in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724 in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728 in <module>
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87 in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194 in _run_module_as_main
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 89536) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python
Fatal Python error: Segmentation fault

Current thread 0x000014630de05700 (most recent call first):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/genericpath.py", line 19 in exists
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 705 in getsourcefile
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1473 in getframeinfo
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1503 in getouterframes
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1526 in stack
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 42 in get_method_name
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 585 in _record
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 667 in _keep_alive
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 645 in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1143 in _keep_alive
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1133 in _keep_alive_weak
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 255 in _run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 870 in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x0000146416f6e600 (most recent call first):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier
Fatal Python error: Segmentation fault

Current thread 0x0000151b9b765700 (most recent call first):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/genericpath.py", line 19 in exists
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 705 in getsourcefile
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1473 in getframeinfo
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1503 in getouterframes
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1526 in stack
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 42 in get_method_name
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 585 in _record
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 667 in _keep_alive
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 645 in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1143 in _keep_alive
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1133 in _keep_alive_weak
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 255 in _run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 870 in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x0000151ca48ce600 (most recent call first):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    exec(code, run_globals)
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    exec(code, run_globals)
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    return _run_code(code, main_globals, None,
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
    exec(code, run_globals)
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    return _run_code(code, main_globals, None,
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    exec(code, run_globals)
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
    main()
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    main()
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    elastic_launch(
    main()
    exec(code, run_globals)
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    exec(code, run_globals)
    exec(code, run_globals)
    return f(*args, **kwargs)
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    return f(*args, **kwargs)
    return _run_code(code, main_globals, None,
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    exec(code, run_globals)
    return f(*args, **kwargs)
    return f(*args, **kwargs)
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    run(args)
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    raise ChildFailedError(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    run(args)
    main()
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam10-ib0
  rank      : 73 (local_rank: 1)
  exitcode  : 1 (pid: 255668)
  error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    main()
    run(args)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    return _run_code(code, main_globals, None,
    run(args)
    main()
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam10-ib0
  rank      : 74 (local_rank: 2)
  exitcode  : 1 (pid: 255669)
  error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    run(args)
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    return f(*args, **kwargs)
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam10-ib0
  rank      : 75 (local_rank: 3)
  exitcode  : 1 (pid: 255670)
  error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    run(args)
    exec(code, run_globals)
    return f(*args, **kwargs)
    run(args)
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    elastic_launch(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam10-ib0
  rank      : 76 (local_rank: 4)
  exitcode  : 1 (pid: 255671)
  error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    return launch_agent(self._config, self._entrypoint, list(args))
    return f(*args, **kwargs)
    return _run_code(code, main_globals, None,
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam10-ib0
  rank      : 77 (local_rank: 5)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  exitcode  : 1 (pid: 255672)
  error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    elastic_launch(
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam10-ib0
  rank      : 78 (local_rank: 6)
  exitcode  : 1 (pid: 255673)
    elastic_launch(
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
  error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    elastic_launch(
    return _run_code(code, main_globals, None,
    exec(code, run_globals)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam10-ib0
  rank      : 79 (local_rank: 7)
  exitcode  : 1 (pid: 255674)
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    raise ChildFailedError(
    run(args)
    raise ChildFailedError(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    run(args)
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
  error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
    main()
    return launch_agent(self._config, self._entrypoint, list(args))
    run(args)
    return launch_agent(self._config, self._entrypoint, list(args))
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:13
  host      : jean-zay-iam35-ib0
  rank      : 277 (local_rank: 5)
  exitcode  : 1 (pid: 246697)
  error_file: /tmp/torchelastic_73ef0in3/none_rjpdp1c0/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    elastic_launch(
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam29-ib0
  rank      : 227 (local_rank: 3)
  exitcode  : 1 (pid: 251310)
  error_file: /tmp/torchelastic_pbe9bxkf/none_i9gj4lo6/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    raise ChildFailedError(
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam10-ib0
  rank      : 72 (local_rank: 0)
  exitcode  : 1 (pid: 255667)
    return launch_agent(self._config, self._entrypoint, list(args))
    return launch_agent(self._config, self._entrypoint, list(args))
    raise ChildFailedError(
    return f(*args, **kwargs)
    raise ChildFailedError(
    raise ChildFailedError(
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:13
  host      : jean-zay-iam35-ib0
  rank      : 276 (local_rank: 4)
  exitcode  : 1 (pid: 246696)
  error_file: /tmp/torchelastic_73ef0in3/none_rjpdp1c0/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam29-ib0
  rank      : 229 (local_rank: 5)
  exitcode  : 1 (pid: 251312)
  error_file: /tmp/torchelastic_pbe9bxkf/none_i9gj4lo6/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    return _run_code(code, main_globals, None,
    run(args)
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam12-ib0
  rank      : 90 (local_rank: 2)
  exitcode  : 1 (pid: 254539)
  error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    run(args)
  error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam04-ib0
  rank      : 25 (local_rank: 1)
  exitcode  : 1 (pid: 252589)
  error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam19-ib0
  rank      : 146 (local_rank: 2)
  exitcode  : 1 (pid: 227345)
  error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam13-ib0
  rank      : 97 (local_rank: 1)
  exitcode  : 1 (pid: 256021)
  error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    raise ChildFailedError(
    raise ChildFailedError(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    elastic_launch(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    raise ChildFailedError(
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    elastic_launch(
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    exec(code, run_globals)
    main()
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    elastic_launch(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam12-ib0
  rank      : 91 (local_rank: 3)
  exitcode  : 1 (pid: 254540)
  error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam23-ib0
  rank      : 177 (local_rank: 1)
  exitcode  : 1 (pid: 90538)
  error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam24-ib0
  rank      : 184 (local_rank: 0)
  exitcode  : 1 (pid: 259743)
  error_file: /tmp/torchelastic_iy55snta/none_9rp61zrm/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam04-ib0
  rank      : 26 (local_rank: 2)
  exitcode  : 1 (pid: 252590)
  error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam19-ib0
  rank      : 147 (local_rank: 3)
  exitcode  : 1 (pid: 227346)
  error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam13-ib0
  rank      : 99 (local_rank: 3)
  exitcode  : 1 (pid: 256023)
  error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    exec(code, run_globals)
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam27-ib0
  rank      : 209 (local_rank: 1)
  exitcode  : 1 (pid: 233473)
  error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam29-ib0
  rank      : 230 (local_rank: 6)
  exitcode  : 1 (pid: 251313)
  error_file: /tmp/torchelastic_pbe9bxkf/none_i9gj4lo6/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    return f(*args, **kwargs)
    elastic_launch(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    run(args)
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    return launch_agent(self._config, self._entrypoint, list(args))
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    return f(*args, **kwargs)
    elastic_launch(
    exec(code, run_globals)
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam23-ib0
  rank      : 178 (local_rank: 2)
  exitcode  : 1 (pid: 90539)
  error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam27-ib0
  rank      : 210 (local_rank: 2)
  exitcode  : 1 (pid: 233474)
  error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam12-ib0
  rank      : 92 (local_rank: 4)
  exitcode  : 1 (pid: 254541)
  error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam04-ib0
  rank      : 27 (local_rank: 3)
  exitcode  : 1 (pid: 252591)
  error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam19-ib0
  rank      : 148 (local_rank: 4)
  exitcode  : 1 (pid: 227347)
  error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    raise ChildFailedError(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam13-ib0
  rank      : 100 (local_rank: 4)
  exitcode  : 1 (pid: 256024)
  error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    raise ChildFailedError(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam09-ib0
  rank      : 69 (local_rank: 5)
  exitcode  : 1 (pid: 253351)
  error_file: /tmp/torchelastic_jxwxxlos/none_0y5r9d70/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    main()
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam03-ib0
  rank      : 17 (local_rank: 1)
  exitcode  : 1 (pid: 264227)
  error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    return launch_agent(self._config, self._entrypoint, list(args))
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam12-ib0
  rank      : 93 (local_rank: 5)
  exitcode  : 1 (pid: 254542)
  error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam23-ib0
  rank      : 179 (local_rank: 3)
  exitcode  : 1 (pid: 90540)
  error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam04-ib0
  rank      : 28 (local_rank: 4)
  exitcode  : 1 (pid: 252592)
  error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam19-ib0
  rank      : 149 (local_rank: 5)
  exitcode  : 1 (pid: 227348)
  error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam13-ib0
  rank      : 101 (local_rank: 5)
  exitcode  : 1 (pid: 256025)
  error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam27-ib0
  rank      : 211 (local_rank: 3)
  exitcode  : 1 (pid: 233475)
  error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam37-ib0
  rank      : 289 (local_rank: 1)
  exitcode  : 1 (pid: 231556)
  error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    raise ChildFailedError(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam28-ib0
  rank      : 217 (local_rank: 1)
  exitcode  : 1 (pid: 129886)
  error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam09-ib0
  rank      : 70 (local_rank: 6)
  exitcode  : 1 (pid: 253352)
  error_file: /tmp/torchelastic_jxwxxlos/none_0y5r9d70/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam03-ib0
  rank      : 18 (local_rank: 2)
  exitcode  : 1 (pid: 264228)
  error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam26-ib0
  rank      : 201 (local_rank: 1)
  exitcode  : 1 (pid: 245862)
  error_file: /tmp/torchelastic_btjb0pc4/none_te9ioczi/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam12-ib0
  rank      : 94 (local_rank: 6)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam23-ib0
  rank      : 180 (local_rank: 4)
  exitcode  : 1 (pid: 90541)
  error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam04-ib0
  rank      : 29 (local_rank: 5)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam19-ib0
  rank      : 150 (local_rank: 6)
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam13-ib0
  rank      : 102 (local_rank: 6)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam27-ib0
  rank      : 212 (local_rank: 4)
  exitcode  : 1 (pid: 233476)
  error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[1]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam37-ib0
  rank      : 290 (local_rank: 2)
  exitcode  : 1 (pid: 231557)
  error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[1]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam28-ib0
  rank      : 218 (local_rank: 2)
  exitcode  : 1 (pid: 129887)
  error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
  exitcode  : 1 (pid: 254543)
  error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    run(args)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  exitcode  : 1 (pid: 252593)
  error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    elastic_launch(
  exitcode  : 1 (pid: 227349)
  error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  exitcode  : 1 (pid: 256026)
  error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam26-ib0
  rank      : 203 (local_rank: 3)
  exitcode  : 1 (pid: 245864)
  error_file: /tmp/torchelastic_btjb0pc4/none_te9ioczi/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    run(args)
    raise ChildFailedError(
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam12-ib0
  rank      : 95 (local_rank: 7)
  exitcode  : 1 (pid: 254544)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam23-ib0
  rank      : 181 (local_rank: 5)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam04-ib0
  rank      : 30 (local_rank: 6)
  exitcode  : 1 (pid: 252594)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam19-ib0
  rank      : 151 (local_rank: 7)
  exitcode  : 1 (pid: 227350)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam09-ib0
  rank      : 71 (local_rank: 7)
  exitcode  : 1 (pid: 253353)
  error_file: /tmp/torchelastic_jxwxxlos/none_0y5r9d70/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam13-ib0
  rank      : 103 (local_rank: 7)
  exitcode  : 1 (pid: 256027)
    return f(*args, **kwargs)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam27-ib0
  rank      : 213 (local_rank: 5)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam03-ib0
  rank      : 19 (local_rank: 3)
  exitcode  : 1 (pid: 264229)
  error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  exitcode  : 1 (pid: 90542)
  error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  exitcode  : 1 (pid: 233477)
  error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam37-ib0
  rank      : 291 (local_rank: 3)
  exitcode  : 1 (pid: 231558)
  error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam28-ib0
  rank      : 219 (local_rank: 3)
  exitcode  : 1 (pid: 129888)
  error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam23-ib0
  rank      : 182 (local_rank: 6)
  exitcode  : 1 (pid: 90543)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam04-ib0
  rank      : 31 (local_rank: 7)
  exitcode  : 1 (pid: 252595)
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:19
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam09-ib0
  rank      : 66 (local_rank: 2)
  exitcode  : 1 (pid: 253348)
  error_file: /tmp/torchelastic_jxwxxlos/none_0y5r9d70/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[1]:
  time      : 2022-03-04_04:03:20
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam27-ib0
  rank      : 214 (local_rank: 6)
  exitcode  : 1 (pid: 233478)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam03-ib0
  rank      : 20 (local_rank: 4)
  exitcode  : 1 (pid: 264230)
  error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam26-ib0
  rank      : 204 (local_rank: 4)
  exitcode  : 1 (pid: 245865)
  error_file: /tmp/torchelastic_btjb0pc4/none_te9ioczi/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
  host      : jean-zay-iam19-ib0
  rank      : 144 (local_rank: 0)
  exitcode  : 1 (pid: 227343)
  error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/0/error.json
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
  host      : jean-zay-iam13-ib0
  rank      : 98 (local_rank: 2)
  exitcode  : 1 (pid: 256022)
  error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/2/error.json
  error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam37-ib0
  rank      : 293 (local_rank: 5)
  exitcode  : 1 (pid: 231560)
  error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam30-ib0
  rank      : 232 (local_rank: 0)
  exitcode  : 1 (pid: 250171)
  error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam28-ib0
  rank      : 221 (local_rank: 5)
  exitcode  : 1 (pid: 129890)
  error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam23-ib0
  rank      : 183 (local_rank: 7)
  exitcode  : 1 (pid: 90544)
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam27-ib0
  rank      : 215 (local_rank: 7)
  exitcode  : 1 (pid: 233479)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam03-ib0
  rank      : 21 (local_rank: 5)
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam26-ib0
  rank      : 205 (local_rank: 5)
  exitcode  : 1 (pid: 245866)
  error_file: /tmp/torchelastic_btjb0pc4/none_te9ioczi/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam04-ib0
  rank      : 24 (local_rank: 0)
  exitcode  : 1 (pid: 252588)
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
============================================================
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam37-ib0
  rank      : 294 (local_rank: 6)
    run(args)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[1]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam30-ib0
  rank      : 233 (local_rank: 1)
  exitcode  : 1 (pid: 250172)
  error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    return f(*args, **kwargs)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam28-ib0
  rank      : 222 (local_rank: 6)
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
  error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    raise ChildFailedError(
  exitcode  : 1 (pid: 231561)
  error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  exitcode  : 1 (pid: 129891)
  error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam23-ib0
  rank      : 176 (local_rank: 0)
  exitcode  : 1 (pid: 90537)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam37-ib0
  rank      : 295 (local_rank: 7)
  exitcode  : 1 (pid: 231562)
  time      : 2022-03-04_04:03:16
  host      : jean-zay-iam26-ib0
  rank      : 200 (local_rank: 0)
  exitcode  : 1 (pid: 245861)
  error_file: /tmp/torchelastic_btjb0pc4/none_te9ioczi/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam28-ib0
  rank      : 223 (local_rank: 7)
  exitcode  : 1 (pid: 129892)
  error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam30-ib0
  rank      : 234 (local_rank: 2)
  exitcode  : 1 (pid: 250173)
  error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    elastic_launch(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam34-ib0
  rank      : 265 (local_rank: 1)
  exitcode  : 1 (pid: 247684)
  error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    main()
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[3]:
  time      : 2022-03-04_04:03:20
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    elastic_launch(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[3]:
  time      : 2022-03-04_04:03:20
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
  host      : jean-zay-iam37-ib0
  rank      : 292 (local_rank: 4)
  exitcode  : 1 (pid: 231559)
  error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/4/error.json
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam30-ib0
  rank      : 235 (local_rank: 3)
  exitcode  : 1 (pid: 250174)
  error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  host      : jean-zay-iam28-ib0
  rank      : 220 (local_rank: 4)
  exitcode  : 1 (pid: 129889)
  error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/4/error.json
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam34-ib0
  rank      : 266 (local_rank: 2)
  exitcode  : 1 (pid: 247685)
  error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam30-ib0
  rank      : 237 (local_rank: 5)
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  exitcode  : 1 (pid: 250176)
  error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam34-ib0
  rank      : 267 (local_rank: 3)
  exitcode  : 1 (pid: 247686)
  error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    raise ChildFailedError(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    elastic_launch(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam34-ib0
  rank      : 268 (local_rank: 4)
  exitcode  : 1 (pid: 247687)
  error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    run(args)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam34-ib0
  rank      : 269 (local_rank: 5)
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam33-ib0
  rank      : 256 (local_rank: 0)
  exitcode  : 1 (pid: 247373)
  error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  exitcode  : 1 (pid: 247688)
  error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return f(*args, **kwargs)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    exec(code, run_globals)
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam34-ib0
  rank      : 270 (local_rank: 6)
  exitcode  : 1 (pid: 247689)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[1]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam33-ib0
  rank      : 257 (local_rank: 1)
  exitcode  : 1 (pid: 247374)
  error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    raise ChildFailedError(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
  error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    elastic_launch(
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam02-ib0
  rank      : 9 (local_rank: 1)
  exitcode  : 1 (pid: 256187)
  error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam34-ib0
  rank      : 271 (local_rank: 7)
  exitcode  : 1 (pid: 247690)
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam33-ib0
  rank      : 258 (local_rank: 2)
  exitcode  : 1 (pid: 247375)
  error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam02-ib0
  rank      : 10 (local_rank: 2)
  exitcode  : 1 (pid: 256188)
  error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    raise ChildFailedError(
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam34-ib0
  rank      : 264 (local_rank: 0)
  exitcode  : 1 (pid: 247683)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam33-ib0
  rank      : 259 (local_rank: 3)
  exitcode  : 1 (pid: 247376)
  error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam02-ib0
  rank      : 11 (local_rank: 3)
  exitcode  : 1 (pid: 256189)
  error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam33-ib0
  rank      : 261 (local_rank: 5)
    elastic_launch(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    exec(code, run_globals)
  exitcode  : 1 (pid: 247378)
  error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam02-ib0
  rank      : 12 (local_rank: 4)
  exitcode  : 1 (pid: 256190)
  error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam11-ib0
  rank      : 81 (local_rank: 1)
  exitcode  : 1 (pid: 253489)
  error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam33-ib0
  rank      : 262 (local_rank: 6)
  exitcode  : 1 (pid: 247379)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    main()
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam02-ib0
  rank      : 13 (local_rank: 5)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[1]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam11-ib0
  rank      : 82 (local_rank: 2)
  exitcode  : 1 (pid: 253490)
  error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam33-ib0
  rank      : 263 (local_rank: 7)
  exitcode  : 1 (pid: 247380)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
  exitcode  : 1 (pid: 256191)
  error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam02-ib0
  rank      : 14 (local_rank: 6)
  exitcode  : 1 (pid: 256192)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
  error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam11-ib0
  rank      : 83 (local_rank: 3)
  exitcode  : 1 (pid: 253491)
  error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[4]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam33-ib0
  rank      : 260 (local_rank: 4)
  exitcode  : 1 (pid: 247377)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam02-ib0
  rank      : 15 (local_rank: 7)
  exitcode  : 1 (pid: 256193)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam11-ib0
  rank      : 85 (local_rank: 5)
  exitcode  : 1 (pid: 253493)
  error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    return f(*args, **kwargs)
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam02-ib0
  rank      : 8 (local_rank: 0)
  exitcode  : 1 (pid: 256186)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam11-ib0
  rank      : 86 (local_rank: 6)
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  exitcode  : 1 (pid: 253494)
  error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam11-ib0
  rank      : 87 (local_rank: 7)
  exitcode  : 1 (pid: 253495)
  error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[3]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam11-ib0
  rank      : 84 (local_rank: 4)
  exitcode  : 1 (pid: 253492)
  error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    elastic_launch(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam46-ib0
  rank      : 361 (local_rank: 1)
  exitcode  : 1 (pid: 247802)
  error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam46-ib0
  rank      : 362 (local_rank: 2)
  exitcode  : 1 (pid: 247803)
  error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    return launch_agent(self._config, self._entrypoint, list(args))
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
    return launch_agent(self._config, self._entrypoint, list(args))
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam46-ib0
  rank      : 363 (local_rank: 3)
  exitcode  : 1 (pid: 247804)
  error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    raise ChildFailedError(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam46-ib0
  rank      : 364 (local_rank: 4)
  exitcode  : 1 (pid: 247805)
  error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam15-ib0
  rank      : 113 (local_rank: 1)
  exitcode  : 1 (pid: 221875)
  error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    raise ChildFailedError(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    raise ChildFailedError(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam46-ib0
  rank      : 365 (local_rank: 5)
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam17-ib0
  rank      : 133 (local_rank: 5)
  exitcode  : 1 (pid: 89539)
  error_file: /tmp/torchelastic_9wudqz0i/none_6bors0rf/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[1]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam15-ib0
  rank      : 114 (local_rank: 2)
  exitcode  : 1 (pid: 221876)
  error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
  exitcode  : 1 (pid: 247806)
  error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    elastic_launch(
    return launch_agent(self._config, self._entrypoint, list(args))
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam42-ib0
  rank      : 329 (local_rank: 1)
  exitcode  : 1 (pid: 254450)
  error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam08-ib0
  rank      : 57 (local_rank: 1)
  exitcode  : 1 (pid: 253464)
  error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam46-ib0
  rank      : 366 (local_rank: 6)
  exitcode  : 1 (pid: 247807)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam17-ib0
  rank      : 134 (local_rank: 6)
  exitcode  : 1 (pid: 89540)
  error_file: /tmp/torchelastic_9wudqz0i/none_6bors0rf/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
  error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam42-ib0
  rank      : 331 (local_rank: 3)
  exitcode  : 1 (pid: 254452)
  error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam15-ib0
  rank      : 115 (local_rank: 3)
  exitcode  : 1 (pid: 221877)
  error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[1]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam08-ib0
  rank      : 58 (local_rank: 2)
  exitcode  : 1 (pid: 253465)
  error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam46-ib0
  rank      : 367 (local_rank: 7)
  exitcode  : 1 (pid: 247808)
    raise ChildFailedError(
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    return launch_agent(self._config, self._entrypoint, list(args))
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam17-ib0
  rank      : 135 (local_rank: 7)
  exitcode  : 1 (pid: 89541)
  error_file: /tmp/torchelastic_9wudqz0i/none_6bors0rf/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam15-ib0
  rank      : 116 (local_rank: 4)
  exitcode  : 1 (pid: 221878)
  error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    raise ChildFailedError(
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam42-ib0
  rank      : 332 (local_rank: 4)
  exitcode  : 1 (pid: 254453)
  error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam08-ib0
  rank      : 59 (local_rank: 3)
  exitcode  : 1 (pid: 253466)
  error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam46-ib0
  rank      : 360 (local_rank: 0)
  exitcode  : 1 (pid: 247801)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    exec(code, run_globals)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam17-ib0
  rank      : 130 (local_rank: 2)
  exitcode  : 1 (pid: 89536)
  error_file: /tmp/torchelastic_9wudqz0i/none_6bors0rf/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam15-ib0
  rank      : 118 (local_rank: 6)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam14-ib0
  rank      : 105 (local_rank: 1)
  exitcode  : 1 (pid: 270065)
  error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam42-ib0
  rank      : 334 (local_rank: 6)
  exitcode  : 1 (pid: 254455)
  error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
  exitcode  : 1 (pid: 221880)
  error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam08-ib0
  rank      : 61 (local_rank: 5)
  exitcode  : 1 (pid: 253468)
  error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam15-ib0
  rank      : 119 (local_rank: 7)
  exitcode  : 1 (pid: 221881)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam07-ib0
  rank      : 50 (local_rank: 2)
  exitcode  : 1 (pid: 291332)
  error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam42-ib0
  rank      : 335 (local_rank: 7)
============================================================
  error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam08-ib0
  rank      : 62 (local_rank: 6)
    raise ChildFailedError(
  exitcode  : 1 (pid: 254456)
  error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[4]:
  time      : 2022-03-04_04:03:19
  exitcode  : 1 (pid: 253469)
  error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam14-ib0
  rank      : 106 (local_rank: 2)
  exitcode  : 1 (pid: 270066)
  error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:15
  host      : jean-zay-iam15-ib0
  rank      : 117 (local_rank: 5)
  exitcode  : 1 (pid: 221879)
  error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/5/error.json
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam08-ib0
  rank      : 63 (local_rank: 7)
  exitcode  : 1 (pid: 253470)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam07-ib0
  rank      : 51 (local_rank: 3)
  exitcode  : 1 (pid: 291333)
  error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
  host      : jean-zay-iam42-ib0
  rank      : 328 (local_rank: 0)
  exitcode  : 1 (pid: 254449)
  error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
  error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[3]:
  time      : 2022-03-04_04:03:20
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam14-ib0
  rank      : 107 (local_rank: 3)
  exitcode  : 1 (pid: 270067)
  error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  host      : jean-zay-iam08-ib0
  rank      : 60 (local_rank: 4)
  exitcode  : 1 (pid: 253467)
  error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/4/error.json
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam38-ib0
  rank      : 297 (local_rank: 1)
  exitcode  : 1 (pid: 77366)
  error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam07-ib0
  rank      : 52 (local_rank: 4)
  exitcode  : 1 (pid: 291334)
  error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam14-ib0
  rank      : 108 (local_rank: 4)
  exitcode  : 1 (pid: 270068)
  error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam07-ib0
  rank      : 53 (local_rank: 5)
  exitcode  : 1 (pid: 291335)
  error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam14-ib0
  rank      : 109 (local_rank: 5)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  exitcode  : 1 (pid: 270069)
  error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam07-ib0
  rank      : 54 (local_rank: 6)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam38-ib0
  rank      : 298 (local_rank: 2)
  exitcode  : 1 (pid: 77367)
  error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam14-ib0
  rank      : 110 (local_rank: 6)
  exitcode  : 1 (pid: 270070)
  exitcode  : 1 (pid: 291336)
  error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam07-ib0
  rank      : 55 (local_rank: 7)
  exitcode  : 1 (pid: 291337)
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam14-ib0
  rank      : 111 (local_rank: 7)
  exitcode  : 1 (pid: 270071)
  error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:19
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam38-ib0
  rank      : 299 (local_rank: 3)
  exitcode  : 1 (pid: 77368)
  error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
  host      : jean-zay-iam07-ib0
  rank      : 49 (local_rank: 1)
  exitcode  : 1 (pid: 291331)
  error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/1/error.json
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam14-ib0
  rank      : 104 (local_rank: 0)
  exitcode  : 1 (pid: 270064)
    main()
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam38-ib0
  rank      : 300 (local_rank: 4)
  exitcode  : 1 (pid: 77369)
  error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam38-ib0
  rank      : 301 (local_rank: 5)
  exitcode  : 1 (pid: 77370)
  error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    return launch_agent(self._config, self._entrypoint, list(args))
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
    raise ChildFailedError(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam48-ib0
  rank      : 377 (local_rank: 1)
  exitcode  : 1 (pid: 242926)
  error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam48-ib0
  rank      : 378 (local_rank: 2)
  exitcode  : 1 (pid: 242927)
  error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
    exec(code, run_globals)
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam48-ib0
  rank      : 379 (local_rank: 3)
  exitcode  : 1 (pid: 242928)
  error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam48-ib0
  rank      : 380 (local_rank: 4)
  exitcode  : 1 (pid: 242929)
  error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    main()
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    return f(*args, **kwargs)
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    raise ChildFailedError(
    raise ChildFailedError(
    raise ChildFailedError(
    return launch_agent(self._config, self._entrypoint, list(args))
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam16-ib0
  rank      : 121 (local_rank: 1)
  exitcode  : 1 (pid: 257947)
  error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam31-ib0
  rank      : 244 (local_rank: 4)
  exitcode  : 1 (pid: 249073)
  error_file: /tmp/torchelastic_akv0smqd/none_n2qqul8m/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam20-ib0
  rank      : 156 (local_rank: 4)
  exitcode  : 1 (pid: 229016)
  error_file: /tmp/torchelastic_ynh8uw7t/none_o9onshx8/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam16-ib0
  rank      : 122 (local_rank: 2)
  exitcode  : 1 (pid: 257948)
  error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam20-ib0
  rank      : 152 (local_rank: 0)
  exitcode  : 1 (pid: 229012)
  error_file: /tmp/torchelastic_ynh8uw7t/none_o9onshx8/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
    raise ChildFailedError(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam16-ib0
  rank      : 124 (local_rank: 4)
  exitcode  : 1 (pid: 257950)
  error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:14
  host      : jean-zay-iam45-ib0
  rank      : 356 (local_rank: 4)
  exitcode  : 1 (pid: 247069)
  error_file: /tmp/torchelastic_u3pn7nlm/none_93danifp/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam16-ib0
  rank      : 125 (local_rank: 5)
  exitcode  : 1 (pid: 257951)
  error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam16-ib0
  rank      : 127 (local_rank: 7)
  exitcode  : 1 (pid: 257953)
  error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:16
  host      : jean-zay-iam16-ib0
  rank      : 120 (local_rank: 0)
  exitcode  : 1 (pid: 257946)
  error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam25-ib0
  rank      : 194 (local_rank: 2)
  exitcode  : 1 (pid: 245947)
  error_file: /tmp/torchelastic_zgly6wyk/none_ur58uh_8/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[1]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam25-ib0
  rank      : 196 (local_rank: 4)
  exitcode  : 1 (pid: 245949)
  error_file: /tmp/torchelastic_zgly6wyk/none_ur58uh_8/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam36-ib0
  rank      : 285 (local_rank: 5)
  exitcode  : 1 (pid: 248201)
  error_file: /tmp/torchelastic_e_0ppd2k/none_85qwmbg_/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam21-ib0
  rank      : 164 (local_rank: 4)
  exitcode  : 1 (pid: 231488)
  error_file: /tmp/torchelastic_o3ff8z3d/none_yg9aink9/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam06-ib0
  rank      : 47 (local_rank: 7)
  exitcode  : 1 (pid: 287085)
  error_file: /tmp/torchelastic_mui43ycr/none_de5dz9l5/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam22-ib0
  rank      : 171 (local_rank: 3)
  exitcode  : 1 (pid: 210936)
  error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam22-ib0
  rank      : 172 (local_rank: 4)
  exitcode  : 1 (pid: 210937)
  error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam22-ib0
  rank      : 173 (local_rank: 5)
  exitcode  : 1 (pid: 210938)
  error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam22-ib0
  rank      : 174 (local_rank: 6)
  exitcode  : 1 (pid: 210939)
  error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam22-ib0
  rank      : 175 (local_rank: 7)
  exitcode  : 1 (pid: 210940)
  error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam22-ib0
  rank      : 170 (local_rank: 2)
  exitcode  : 1 (pid: 210935)
  error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam32-ib0
  rank      : 248 (local_rank: 0)
  exitcode  : 1 (pid: 250776)
  error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[1]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam32-ib0
  rank      : 249 (local_rank: 1)
  exitcode  : 1 (pid: 250777)
  error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam32-ib0
  rank      : 250 (local_rank: 2)
  exitcode  : 1 (pid: 250778)
  error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam32-ib0
  rank      : 251 (local_rank: 3)
  exitcode  : 1 (pid: 250779)
  error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam32-ib0
  rank      : 253 (local_rank: 5)
  exitcode  : 1 (pid: 250781)
  error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam39-ib0
  rank      : 305 (local_rank: 1)
  exitcode  : 1 (pid: 228357)
  error_file: /tmp/torchelastic_vcfl8_ed/none_1ofckoam/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[1]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam39-ib0
  rank      : 306 (local_rank: 2)
  exitcode  : 1 (pid: 228358)
  error_file: /tmp/torchelastic_vcfl8_ed/none_1ofckoam/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam39-ib0
  rank      : 310 (local_rank: 6)
  exitcode  : 1 (pid: 228362)
  error_file: /tmp/torchelastic_vcfl8_ed/none_1ofckoam/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[2]:
  time      : 2022-03-04_04:03:16
  host      : jean-zay-iam39-ib0
  rank      : 308 (local_rank: 4)
  exitcode  : 1 (pid: 228360)
  error_file: /tmp/torchelastic_vcfl8_ed/none_1ofckoam/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam01-ib0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 297991)
  error_file: /tmp/torchelastic_u0xq61is/none_jbeh2bpz/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[1]:
  time      : 2022-03-04_04:03:16
  host      : jean-zay-iam01-ib0
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 297995)
  error_file: /tmp/torchelastic_u0xq61is/none_jbeh2bpz/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[0]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam44-ib0
  rank      : 344 (local_rank: 0)
  exitcode  : 1 (pid: 248341)
  error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[1]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam44-ib0
  rank      : 345 (local_rank: 1)
  exitcode  : 1 (pid: 248342)
  error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam44-ib0
  rank      : 347 (local_rank: 3)
  exitcode  : 1 (pid: 248344)
  error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam44-ib0
  rank      : 349 (local_rank: 5)
  exitcode  : 1 (pid: 248346)
  error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam44-ib0
  rank      : 351 (local_rank: 7)
  exitcode  : 1 (pid: 248348)
  error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[3]:
  time      : 2022-03-04_04:03:16
  host      : jean-zay-iam44-ib0
  rank      : 348 (local_rank: 4)
  exitcode  : 1 (pid: 248345)
  error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam43-ib0
  rank      : 338 (local_rank: 2)
  exitcode  : 1 (pid: 248428)
  error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam43-ib0
  rank      : 339 (local_rank: 3)
  exitcode  : 1 (pid: 248429)
  error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:15
  host      : jean-zay-iam43-ib0
  rank      : 340 (local_rank: 4)
  exitcode  : 1 (pid: 248430)
  error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam43-ib0
  rank      : 342 (local_rank: 6)
  exitcode  : 1 (pid: 248433)
  error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam43-ib0
  rank      : 343 (local_rank: 7)
  exitcode  : 1 (pid: 248434)
  error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:15
  host      : jean-zay-iam43-ib0
  rank      : 336 (local_rank: 0)
  exitcode  : 1 (pid: 248426)
  error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam40-ib0
  rank      : 313 (local_rank: 1)
  exitcode  : 1 (pid: 108416)
  error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/1/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[2]:
  time      : 2022-03-04_04:03:16
  host      : jean-zay-iam40-ib0
  rank      : 314 (local_rank: 2)
  exitcode  : 1 (pid: 108417)
  error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[3]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam40-ib0
  rank      : 315 (local_rank: 3)
  exitcode  : 1 (pid: 108418)
  error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/3/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
Traceback (most recent call last):
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code
  error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam29-ib0
  rank      : 231 (local_rank: 7)
  exitcode  : 1 (pid: 251314)
  error_file: /tmp/torchelastic_pbe9bxkf/none_i9gj4lo6/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:21
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam27-ib0
  rank      : 208 (local_rank: 0)
  exitcode  : 1 (pid: 233472)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
  host      : jean-zay-iam12-ib0
  rank      : 89 (local_rank: 1)
  exitcode  : 1 (pid: 254538)
  error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/1/error.json
  error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    exec(code, run_globals)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
  time      : 2022-03-04_04:03:22
  host      : jean-zay-iam29-ib0
  rank      : 226 (local_rank: 2)
  exitcode  : 1 (pid: 251309)
  error_file: /tmp/torchelastic_pbe9bxkf/none_i9gj4lo6/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    main()
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
  exitcode  : 1 (pid: 264231)
  error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam30-ib0
  rank      : 238 (local_rank: 6)
  exitcode  : 1 (pid: 250177)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam03-ib0
  rank      : 22 (local_rank: 6)
  exitcode  : 1 (pid: 264232)
  error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam30-ib0
  rank      : 239 (local_rank: 7)
  exitcode  : 1 (pid: 250178)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam03-ib0
  rank      : 23 (local_rank: 7)
  exitcode  : 1 (pid: 264233)
  error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
  error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[4]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam30-ib0
  rank      : 236 (local_rank: 4)
  exitcode  : 1 (pid: 250175)
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam03-ib0
  rank      : 16 (local_rank: 0)
  exitcode  : 1 (pid: 264226)
  error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:17
  host      : jean-zay-iam47-ib0
  rank      : 372 (local_rank: 4)
  exitcode  : 1 (pid: 242645)
  error_file: /tmp/torchelastic_z75m30zs/none_ouaynzs8/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam38-ib0
  rank      : 302 (local_rank: 6)
  exitcode  : 1 (pid: 77371)
  error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam38-ib0
  rank      : 303 (local_rank: 7)
  exitcode  : 1 (pid: 77372)
  error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam38-ib0
  rank      : 296 (local_rank: 0)
  exitcode  : 1 (pid: 77365)
  error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam48-ib0
  rank      : 381 (local_rank: 5)
  exitcode  : 1 (pid: 242930)
  error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:20
  host      : jean-zay-iam48-ib0
  rank      : 382 (local_rank: 6)
  exitcode  : 1 (pid: 242931)
  error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:21
  host      : jean-zay-iam48-ib0
  rank      : 383 (local_rank: 7)
  exitcode  : 1 (pid: 242932)
  error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam48-ib0
  rank      : 376 (local_rank: 0)
  exitcode  : 1 (pid: 242925)
  error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam32-ib0
  rank      : 254 (local_rank: 6)
  exitcode  : 1 (pid: 250782)
  error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[7]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam32-ib0
  rank      : 255 (local_rank: 7)
  exitcode  : 1 (pid: 250783)
  error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[4]:
  time      : 2022-03-04_04:03:16
  host      : jean-zay-iam32-ib0
  rank      : 252 (local_rank: 4)
  exitcode  : 1 (pid: 250780)
  error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[4]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam40-ib0
  rank      : 317 (local_rank: 5)
  exitcode  : 1 (pid: 108420)
  error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/5/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[5]:
  time      : 2022-03-04_04:03:19
  host      : jean-zay-iam40-ib0
  rank      : 318 (local_rank: 6)
  exitcode  : 1 (pid: 108421)
  error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/6/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
[6]:
  time      : 2022-03-04_04:03:18
  host      : jean-zay-iam40-ib0
  rank      : 319 (local_rank: 7)
  exitcode  : 1 (pid: 108422)
  error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-04_04:03:16
  host      : jean-zay-iam40-ib0
  rank      : 312 (local_rank: 0)
  exitcode  : 1 (pid: 108415)
  error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main
      pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain
      model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer
      args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint
      loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint
      success = self._load_zero_checkpoint(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint
      self.optimizer.load_state_dict(
    File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict
      self.clip_grad = current_rank_sd[CLIP_GRAD]
  KeyError: 'clip_grad'
  
============================================================
srun: error: jean-zay-iam17: task 16: Exited with exit code 1
srun: Terminating job step 202322.0
srun: error: jean-zay-iam29: task 28: Exited with exit code 1
srun: error: jean-zay-iam46: task 45: Exited with exit code 1
srun: error: jean-zay-iam25: task 24: Exited with exit code 1
slurmstepd: error: *** STEP 202322.0 ON jean-zay-iam01 CANCELLED AT 2022-03-04T04:03:27 ***
srun: error: jean-zay-iam09: task 8: Exited with exit code 1
srun: error: jean-zay-iam42: task 41: Exited with exit code 1
srun: error: jean-zay-iam37: task 36: Exited with exit code 1
srun: error: jean-zay-iam21: task 20: Exited with exit code 1
srun: error: jean-zay-iam31: task 30: Exited with exit code 1
srun: error: jean-zay-iam22: task 21: Exited with exit code 1
srun: error: jean-zay-iam19: task 18: Exited with exit code 1
srun: error: jean-zay-iam15: task 14: Exited with exit code 1
srun: error: jean-zay-iam35: task 34: Exited with exit code 1
srun: error: jean-zay-iam03: task 2: Exited with exit code 1
srun: error: jean-zay-iam12: task 11: Exited with exit code 1
srun: error: jean-zay-iam45: task 44: Exited with exit code 1
srun: error: jean-zay-iam10: task 9: Exited with exit code 1
srun: error: jean-zay-iam20: task 19: Exited with exit code 1
srun: error: jean-zay-iam33: task 32: Exited with exit code 1
srun: error: jean-zay-iam27: task 26: Exited with exit code 1
srun: error: jean-zay-iam02: task 1: Exited with exit code 1
srun: error: jean-zay-iam14: task 13: Exited with exit code 1
srun: error: jean-zay-iam28: task 27: Exited with exit code 1
srun: error: jean-zay-iam11: task 10: Exited with exit code 1
srun: error: jean-zay-iam34: task 33: Exited with exit code 1
srun: error: jean-zay-iam38: task 37: Exited with exit code 1
srun: error: jean-zay-iam24: task 23: Exited with exit code 1
srun: error: jean-zay-iam48: task 47: Exited with exit code 1
srun: error: jean-zay-iam36: task 35: Exited with exit code 1
srun: error: jean-zay-iam32: task 31: Exited with exit code 1
srun: error: jean-zay-iam04: task 3: Exited with exit code 1
srun: error: jean-zay-iam30: task 29: Exited with exit code 1
srun: error: jean-zay-iam39: task 38: Exited with exit code 1
srun: error: jean-zay-iam08: task 7: Exited with exit code 1
srun: error: jean-zay-iam23: task 22: Exited with exit code 1
srun: error: jean-zay-iam16: task 15: Exited with exit code 1
srun: error: jean-zay-iam44: task 43: Exited with exit code 1
srun: error: jean-zay-iam26: task 25: Exited with exit code 1
srun: error: jean-zay-iam13: task 12: Exited with exit code 1
srun: error: jean-zay-iam06: task 5: Exited with exit code 1
srun: error: jean-zay-iam40: task 39: Exited with exit code 1
srun: error: jean-zay-iam07: task 6: Exited with exit code 1
srun: error: jean-zay-iam43: task 42: Exited with exit code 1
srun: error: jean-zay-iam01: task 0: Exited with exit code 1
srun: error: jean-zay-iam47: task 46: Exited with exit code 1
  File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/pyth  File "/gpfswork/rsrun: error: jean-zay-iam18: task 17: Segmentation fault (core dumped)
srun: error: jean-zay-iam05: task 4: Segmentation fault (core dumped)
srun: error: jean-zay-iam41: task 40: Segmentation fault (core dumped)
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[default0]:using world size: 384, data-parallel-size: 8, tensor-model-parallel size: 4, pipeline-model-parallel size: 12 
[default0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:PretrainedFromHF
[default0]:accumulate and all-reduce gradients in fp32 for bfloat16 data type.
[default0]:using torch.bfloat16 for parameters ...
[default0]:------------------------ arguments ------------------------
[default0]:  abort_on_unmet_fused_kernel_constraints ......... True
[default0]:  accumulate_allreduce_grads_in_fp32 .............. True
[default0]:  adam_beta1 ...................................... 0.9
[default0]:  adam_beta2 ...................................... 0.95
[default0]:  adam_eps ........................................ 1e-08
[default0]:  adlr_autoresume ................................. False
[default0]:  adlr_autoresume_interval ........................ 1000
[default0]:  apply_query_key_layer_scaling ................... True
[default0]:  apply_residual_connection_post_layernorm ........ False
[default0]:  attention_dropout ............................... 0.1
[default0]:  attention_softmax_in_fp32 ....................... False
[default0]:  bert_binary_head ................................ True
[default0]:  bert_load ....................................... None
[default0]:  bf16 ............................................ True
[default0]:  bias_dropout_fusion ............................. True
[default0]:  bias_gelu_fusion ................................ True
[default0]:  biencoder_projection_dim ........................ 0
[default0]:  biencoder_shared_query_context_model ............ False
[default0]:  block_data_path ................................. None
[default0]:  checkpoint_activations .......................... True
[default0]:  checkpoint_in_cpu ............................... False
[default0]:  checkpoint_num_layers ........................... 1
[default0]:  clip_grad ....................................... 1.0
[default0]:  codecarbon_dir .................................. None
[default0]:  consumed_train_samples .......................... 0
[default0]:  consumed_train_tokens ........................... 0
[default0]:  consumed_valid_samples .......................... 0
[default0]:  contigious_checkpointing ........................ False
[default0]:  cpu_optimizer ................................... False
[default0]:  cpu_torch_adam .................................. False
[default0]:  curriculum_learning ............................. False
[default0]:  data_impl ....................................... mmap
[default0]:  data_parallel_size .............................. 8
[default0]:  data_path ....................................... None
[default0]:  dataloader_type ................................. single
[default0]:  DDP_impl ........................................ local
[default0]:  decoder_seq_length .............................. None
[default0]:  deepscale ....................................... False
[default0]:  deepscale_config ................................ None
[default0]:  deepspeed ....................................... True
[default0]:  deepspeed_activation_checkpointing .............. True
[default0]:  deepspeed_config ................................ ./ds_config.202330.json
[default0]:  deepspeed_mpi ................................... False
[default0]:  distribute_checkpointed_activations ............. False
[default0]:  distributed_backend ............................. nccl
[default0]:  embed_layernorm ................................. True
[default0]:  embedding_path .................................. None
[default0]:  encoder_seq_length .............................. 2048
[default0]:  eod_mask_loss ................................... False
[default0]:  eval_interval ................................... 1000
[default0]:  eval_iters ...................................... 10
[default0]:  eval_only ....................................... None
[default0]:  evidence_data_path .............................. None
[default0]:  exit_duration_in_mins ........................... 5990
[default0]:  exit_interval ................................... None
[default0]:  ffn_hidden_size ................................. 57344
[default0]:  finetune ........................................ False
[default0]:  fp16 ............................................ False
[default0]:  fp16_lm_cross_entropy ........................... False
[default0]:  fp32_residual_connection ........................ False
[default0]:  gigaflos_no_embeds .............................. 0
[default0]:  global_batch_size ............................... 2048
[default0]:  glu_activation .................................. None
[default0]:  hidden_dropout .................................. 0.1
[default0]:  hidden_size ..................................... 14336
[default0]:  hysteresis ...................................... 2
[default0]:  ict_head_size ................................... None
[default0]:  ict_load ........................................ None
[default0]:  img_dim ......................................... 224
[default0]:  indexer_batch_size .............................. 128
[default0]:  indexer_log_interval ............................ 1000
[default0]:  init_method_std ................................. 0.0048
[default0]:  init_method_xavier_uniform ...................... False
[default0]:  initial_loss_scale .............................. 4294967296
[default0]:  kill_switch_path ................................ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/kill-switch-tr11-176B-exp1
[default0]:  kv_channels ..................................... 128
[default0]:  layernorm_epsilon ............................... 1e-05
[default0]:  lazy_mpu_init ................................... None
[default0]:  load ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:  local_rank ...................................... None
[default0]:  log_batch_size_to_tensorboard ................... True
[default0]:  log_interval .................................... 1
[default0]:  log_learning_rate_to_tensorboard ................ True
[default0]:  log_level ....................................... None
[default0]:  log_level_replica ............................... None
[default0]:  log_loss_scale_to_tensorboard ................... True
[default0]:  log_num_zeros_in_grad ........................... False
[default0]:  log_params_norm ................................. False
[default0]:  log_path ........................................ None
[default0]:  log_timers_to_tensorboard ....................... True
[default0]:  log_validation_ppl_to_tensorboard ............... True
[default0]:  loss_on_targets_only ............................ False
[default0]:  loss_scale ...................................... None
[default0]:  loss_scale_window ............................... 1000
[default0]:  lr .............................................. 6e-05
[default0]:  lr_decay_iters .................................. None
[default0]:  lr_decay_samples ................................ 200000000
[default0]:  lr_decay_style .................................. cosine
[default0]:  lr_decay_tokens ................................. None
[default0]:  lr_warmup_fraction .............................. None
[default0]:  lr_warmup_iters ................................. 0
[default0]:  lr_warmup_samples ............................... 183105
[default0]:  make_vocab_size_divisible_by .................... 128
[default0]:  mask_prob ....................................... 0.15
[default0]:  masked_softmax_fusion ........................... True
[default0]:  max_position_embeddings ......................... 2048
[default0]:  memory_centric_tiled_linear ..................... False
[default0]:  merge_file ...................................... None
[default0]:  micro_batch_size ................................ 2
[default0]:  min_loss_scale .................................. 1.0
[default0]:  min_lr .......................................... 6e-06
[default0]:  mmap_warmup ..................................... False
[default0]:  no_load_optim ................................... None
[default0]:  no_load_rng ..................................... None
[default0]:  no_save_optim ................................... None
[default0]:  no_save_rng ..................................... None
[default0]:  num_attention_heads ............................. 112
[default0]:  num_channels .................................... 3
[default0]:  num_classes ..................................... 1000
[default0]:  num_layers ...................................... 70
[default0]:  num_layers_per_virtual_pipeline_stage ........... None
[default0]:  num_workers ..................................... 2
[default0]:  onnx_safe ....................................... None
[default0]:  openai_gelu ..................................... False
[default0]:  optimizer ....................................... adam
[default0]:  override_lr_scheduler ........................... False
[default0]:  pad_vocab_size_to ............................... 250880
[default0]:  params_dtype .................................... torch.bfloat16
[default0]:  partition_activations ........................... False
[default0]:  patch_dim ....................................... 16
[default0]:  pipeline_model_parallel_size .................... 12
[default0]:  position_embedding_type ......................... PositionEmbeddingType.alibi
[default0]:  pp_partition_method ............................. type:transformer|embedding
[default0]:  profile_backward ................................ False
[default0]:  query_in_block_prob ............................. 0.1
[default0]:  rampup_batch_size ............................... ['16', '16', '9_765_625']
[default0]:  rank ............................................ 0
[default0]:  remote_device ................................... none
[default0]:  reset_attention_mask ............................ False
[default0]:  reset_position_ids .............................. False
[default0]:  retriever_report_topk_accuracies ................ []
[default0]:  retriever_score_scaling ......................... False
[default0]:  retriever_seq_length ............................ 256
[default0]:  reweight_loss_based_on_position_frequency ....... False
[default0]:  sample_rate ..................................... 1.0
[default0]:  save ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints
[default0]:  save_interval ................................... 500
[default0]:  scatter_gather_tensors_in_pipeline .............. True
[default0]:  scattered_embeddings ............................ False
[default0]:  seed ............................................ 42
[default0]:  seq_length ...................................... 2048
[default0]:  sgd_momentum .................................... 0.9
[default0]:  short_seq_prob .................................. 0.1
[default0]:  skip_train_iteration_range ...................... None
[default0]:  split ........................................... None
[default0]:  split_transformers .............................. False
[default0]:  synchronize_each_layer .......................... False
[default0]:  tensor_model_parallel_size ...................... 4
[default0]:  tensorboard_dir ................................. /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/tr11-176B-ml-logs/tensorboard
[default0]:  tensorboard_log_interval ........................ 1
[default0]:  tensorboard_queue_size .......................... 5
[default0]:  test_weighted_split_names ....................... ['test']
[default0]:  test_weighted_split_paths ....................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  test_weighted_split_paths_path .................. None
[default0]:  test_weighted_split_splits ...................... [['0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0']]
[default0]:  test_weighted_split_weights ..................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  tile_factor ..................................... 1
[default0]:  titles_data_path ................................ None
[default0]:  tokenizer_name_or_path .......................... bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k
[default0]:  tokenizer_type .................................. PretrainedFromHF
[default0]:  train_iters ..................................... None
[default0]:  train_samples ................................... 220000000
[default0]:  train_tokens .................................... None
[default0]:  train_weighted_split_names ...................... ['train']
[default0]:  train_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  train_weighted_split_paths_path ................. None
[default0]:  train_weighted_split_splits ..................... [['0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949']]
[default0]:  train_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  use_bnb_optimizer ............................... False
[default0]:  use_checkpoint_lr_scheduler ..................... False
[default0]:  use_contiguous_buffers_in_ddp ................... True
[default0]:  use_cpu_initialization .......................... None
[default0]:  use_one_sent_docs ............................... False
[default0]:  use_pin_memory .................................. False
[default0]:  valid_weighted_split_names ...................... ['valid']
[default0]:  valid_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']]
[default0]:  valid_weighted_split_paths_path ................. None
[default0]:  valid_weighted_split_splits ..................... [['0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999']]
[default0]:  valid_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']]
[default0]:  virtual_pipeline_model_parallel_size ............ None
[default0]:  vocab_extra_ids ................................. 0
[default0]:  vocab_file ...................................... None
[default0]:  weight_decay .................................... 0.1
[default0]:  world_size ...................................... 384
[default0]:  zero_allgather_bucket_size ...................... 0.0
[default0]:  zero_contigious_gradients ....................... False
[default0]:  zero_reduce_bucket_size ......................... 0.0
[default0]:  zero_reduce_scatter ............................. False
[default0]:  zero_stage ...................................... 0
[default0]:-------------------- end of arguments ---------------------
[default0]:will use batch size rampup starting from global batch size 16 to global batch size 2048 with batch size increments 16 over 9765625 samples.
[default0]:> building PretrainedFromHF tokenizer ...
[default0]: vocab file is un-used. loading tokenizer from pre-trained model
[default0]:Offline mode: forcing local_files_only=True
[default0]:Offline mode: forcing local_files_only=True
[default0]:Can't load following files from cache: ['added_tokens_file'] and cannot check if these files are necessary for the tokenizer to operate.
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/special_tokens_map.json from cache at /gpfswork/rech/six/commun/models/b0b3428eb9bea3ef62a6e9983742117e4860f4ec1af66eebce1702b8ec7cb364.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer_config.json from cache at /gpfswork/rech/six/commun/models/31fb66a88196017b3a12c4798e55bcf8a11b312b42dd9429c83f7237c0a8a807.e683c1a11fe6388761e34fd7cddbcd77f3552cefb70e9aca4a4cc72c027c8f40
[default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer.json from cache at /gpfswork/rech/six/commun/models/b28b4c1d8aed4c72b765cce6a9a7ce8c5460d05a5b4ea6fa5855dff6a721d171.397b0d7316cb89fa15f0bebce2bd6c5e71e92a14e95de167940173a60253b03e
[default0]: > padded vocab (size: 250680) with 200 dummy tokens (new size: 250880)
[default0]:DeepSpeed general environment info:
[default0]:torch install path ............... ['/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch']
[default0]:torch version .................... 1.11.0+cu115
[default0]:torch cuda version ............... 11.5
[default0]:nvcc version ..................... 11.4
[default0]:deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed']
[default0]:deepspeed info ................... 0.6.0+ed26ef4, ed26ef4, olruwase/bf16-updates
[default0]:deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
[default0]:**** Git info for Megatron: git_hash=0415583 git_branch=sync-meg-lm ****
[default0]:> initializing torch distributed ...
[default7]:> setting tensorboard ...
[default0]:> initializing tensor model parallel with size 4
[default0]:> initializing pipeline model parallel with size 12
[default0]:> setting random seeds to 42 ...
[default0]:[2022-03-04 04:08:55,637] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2760 and data parallel seed: 42
[default0]:> compiling dataset index builder ...
[default0]:make: Entering directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data'
[default0]:make: Nothing to be done for 'default'.
[default0]:make: Leaving directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data'
[default0]:>>> done with dataset index builder. Compilation time: 0.108 seconds
[default0]:> compiling and loading fused kernels ...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module scaled_upper_triang_masked_softmax_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module scaled_upper_triang_masked_softmax_cuda...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module scaled_masked_softmax_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module scaled_masked_softmax_cuda...
[default0]:Detected CUDA files, patching ldflags
[default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
[default0]:Building extension module fused_mix_prec_layer_norm_cuda...
[default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[default0]:ninja: no work to do.
[default0]:Loading extension module fused_mix_prec_layer_norm_cuda...
[default0]:>>> done with compiling and loading fused kernels. Compilation time: 10.004 seconds
[default0]:time to initialize megatron (seconds): 81.752
[default0]:[after megatron is initialized] datetime: 2022-03-04 04:09:05 
[default0]:building GPT model ...
[default0]:[2022-03-04 04:09:05,789] [INFO] [utils.py:828:see_memory_usage] Before Building Model
[default0]:[2022-03-04 04:09:05,789] [INFO] [utils.py:829:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[default0]:[2022-03-04 04:09:05,790] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.22 GB, percent = 8.6%
[default0]:SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
[default0]:Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=0, model=1): 1, ProcessCoord(pipe=0, data=0, model=2): 2, ProcessCoord(pipe=0, data=0, model=3): 3, ProcessCoord(pipe=0, data=1, model=0): 4, ProcessCoord(pipe=0, data=1, model=1): 5, ProcessCoord(pipe=0, data=1, model=2): 6, ProcessCoord(pipe=0, data=1, model=3): 7, ProcessCoord(pipe=0, data=2, model=0): 8, ProcessCoord(pipe=0, data=2, model=1): 9, ProcessCoord(pipe=0, data=2, model=2): 10, ProcessCoord(pipe=0, data=2, model=3): 11, ProcessCoord(pipe=0, data=3, model=0): 12, ProcessCoord(pipe=0, data=3, model=1): 13, ProcessCoord(pipe=0, data=3, model=2): 14, ProcessCoord(pipe=0, data=3, model=3): 15, ProcessCoord(pipe=0, data=4, model=0): 16, ProcessCoord(pipe=0, data=4, model=1): 17, ProcessCoord(pipe=0, data=4, model=2): 18, ProcessCoord(pipe=0, data=4, model=3): 19, ProcessCoord(pipe=0, data=5, model=0): 20, ProcessCoord(pipe=0, data=5, model=1): 21, ProcessCoord(pipe=0, data=5, model=2): 22, ProcessCoord(pipe=0, data=5, model=3): 23, ProcessCoord(pipe=0, data=6, model=0): 24, ProcessCoord(pipe=0, data=6, model=1): 25, ProcessCoord(pipe=0, data=6, model=2): 26, ProcessCoord(pipe=0, data=6, model=3): 27, ProcessCoord(pipe=0, data=7, model=0): 28, ProcessCoord(pipe=0, data=7, model=1): 29, ProcessCoord(pipe=0, data=7, model=2): 30, ProcessCoord(pipe=0, data=7, model=3): 31, ProcessCoord(pipe=1, data=0, model=0): 32, ProcessCoord(pipe=1, data=0, model=1): 33, ProcessCoord(pipe=1, data=0, model=2): 34, ProcessCoord(pipe=1, data=0, model=3): 35, ProcessCoord(pipe=1, data=1, model=0): 36, ProcessCoord(pipe=1, data=1, model=1): 37, ProcessCoord(pipe=1, data=1, model=2): 38, ProcessCoord(pipe=1, data=1, model=3): 39, ProcessCoord(pipe=1, data=2, model=0): 40, ProcessCoord(pipe=1, data=2, model=1): 41, ProcessCoord(pipe=1, data=2, model=2): 42, ProcessCoord(pipe=1, data=2, model=3): 43, ProcessCoord(pipe=1, data=3, model=0): 44, ProcessCoord(pipe=1, data=3, model=1): 45, ProcessCoord(pipe=1, data=3, model=2): 46, ProcessCoord(pipe=1, data=3, model=3): 47, ProcessCoord(pipe=1, data=4, model=0): 48, ProcessCoord(pipe=1, data=4, model=1): 49, ProcessCoord(pipe=1, data=4, model=2): 50, ProcessCoord(pipe=1, data=4, model=3): 51, ProcessCoord(pipe=1, data=5, model=0): 52, ProcessCoord(pipe=1, data=5, model=1): 53, ProcessCoord(pipe=1, data=5, model=2): 54, ProcessCoord(pipe=1, data=5, model=3): 55, ProcessCoord(pipe=1, data=6, model=0): 56, ProcessCoord(pipe=1, data=6, model=1): 57, ProcessCoord(pipe=1, data=6, model=2): 58, ProcessCoord(pipe=1, data=6, model=3): 59, ProcessCoord(pipe=1, data=7, model=0): 60, ProcessCoord(pipe=1, data=7, model=1): 61, ProcessCoord(pipe=1, data=7, model=2): 62, ProcessCoord(pipe=1, data=7, model=3): 63, ProcessCoord(pipe=2, data=0, model=0): 64, ProcessCoord(pipe=2, data=0, model=1): 65, ProcessCoord(pipe=2, data=0, model=2): 66, ProcessCoord(pipe=2, data=0, model=3): 67, ProcessCoord(pipe=2, data=1, model=0): 68, ProcessCoord(pipe=2, data=1, model=1): 69, ProcessCoord(pipe=2, data=1, model=2): 70, ProcessCoord(pipe=2, data=1, model=3): 71, ProcessCoord(pipe=2, data=2, model=0): 72, ProcessCoord(pipe=2, data=2, model=1): 73, ProcessCoord(pipe=2, data=2, model=2): 74, ProcessCoord(pipe=2, data=2, model=3): 75, ProcessCoord(pipe=2, data=3, model=0): 76, ProcessCoord(pipe=2, data=3, model=1): 77, ProcessCoord(pipe=2, data=3, model=2): 78, ProcessCoord(pipe=2, data=3, model=3): 79, ProcessCoord(pipe=2, data=4, model=0): 80, ProcessCoord(pipe=2, data=4, model=1): 81, ProcessCoord(pipe=2, data=4, model=2): 82, ProcessCoord(pipe=2, data=4, model=3): 83, ProcessCoord(pipe=2, data=5, model=0): 84, ProcessCoord(pipe=2, data=5, model=1): 85, ProcessCoord(pipe=2, data=5, model=2): 86, ProcessCoord(pipe=2, data=5, model=3): 87, ProcessCoord(pipe=2, data=6, model=0): 88, ProcessCoord(pipe=2, data=6, model=1): 89, ProcessCoord(pipe=2, data=6, model=2): 90, ProcessCoord(pipe=2, data=6, model=3): 91, ProcessCoord(pipe=2, data=7, model=0): 92, ProcessCoord(pipe=2, data=7, model=1): 93, ProcessCoord(pipe=2, data=7, model=2): 94, ProcessCoord(pipe=2, data=7, model=3): 95, ProcessCoord(pipe=3, data=0, model=0): 96, ProcessCoord(pipe=3, data=0, model=1): 97, ProcessCoord(pipe=3, data=0, model=2): 98, ProcessCoord(pipe=3, data=0, model=3): 99, ProcessCoord(pipe=3, data=1, model=0): 100, ProcessCoord(pipe=3, data=1, model=1): 101, ProcessCoord(pipe=3, data=1, model=2): 102, ProcessCoord(pipe=3, data=1, model=3): 103, ProcessCoord(pipe=3, data=2, model=0): 104, ProcessCoord(pipe=3, data=2, model=1): 105, ProcessCoord(pipe=3, data=2, model=2): 106, ProcessCoord(pipe=3, data=2, model=3): 107, ProcessCoord(pipe=3, data=3, model=0): 108, ProcessCoord(pipe=3, data=3, model=1): 109, ProcessCoord(pipe=3, data=3, model=2): 110, ProcessCoord(pipe=3, data=3, model=3): 111, ProcessCoord(pipe=3, data=4, model=0): 112, ProcessCoord(pipe=3, data=4, model=1): 113, ProcessCoord(pipe=3, data=4, model=2): 114, ProcessCoord(pipe=3, data=4, model=3): 115, ProcessCoord(pipe=3, data=5, model=0): 116, ProcessCoord(pipe=3, data=5, model=1): 117, ProcessCoord(pipe=3, data=5, model=2): 118, ProcessCoord(pipe=3, data=5, model=3): 119, ProcessCoord(pipe=3, data=6, model=0): 120, ProcessCoord(pipe=3, data=6, model=1): 121, ProcessCoord(pipe=3, data=6, model=2): 122, ProcessCoord(pipe=3, data=6, model=3): 123, ProcessCoord(pipe=3, data=7, model=0): 124, ProcessCoord(pipe=3, data=7, model=1): 125, ProcessCoord(pipe=3, data=7, model=2): 126, ProcessCoord(pipe=3, data=7, model=3): 127, ProcessCoord(pipe=4, data=0, model=0): 128, ProcessCoord(pipe=4, data=0, model=1): 129, ProcessCoord(pipe=4, data=0, model=2): 130, ProcessCoord(pipe=4, data=0, model=3): 131, ProcessCoord(pipe=4, data=1, model=0): 132, ProcessCoord(pipe=4, data=1, model=1): 133, ProcessCoord(pipe=4, data=1, model=2): 134, ProcessCoord(pipe=4, data=1, model=3): 135, ProcessCoord(pipe=4, data=2, model=0): 136, ProcessCoord(pipe=4, data=2, model=1): 137, ProcessCoord(pipe=4, data=2, model=2): 138, ProcessCoord(pipe=4, data=2, model=3): 139, ProcessCoord(pipe=4, data=3, model=0): 140, ProcessCoord(pipe=4, data=3, model=1): 141, ProcessCoord(pipe=4, data=3, model=2): 142, ProcessCoord(pipe=4, data=3, model=3): 143, ProcessCoord(pipe=4, data=4, model=0): 144, ProcessCoord(pipe=4, data=4, model=1): 145, ProcessCoord(pipe=4, data=4, model=2): 146, ProcessCoord(pipe=4, data=4, model=3): 147, ProcessCoord(pipe=4, data=5, model=0): 148, ProcessCoord(pipe=4, data=5, model=1): 149, ProcessCoord(pipe=4, data=5, model=2): 150, ProcessCoord(pipe=4, data=5, model=3): 151, ProcessCoord(pipe=4, data=6, model=0): 152, ProcessCoord(pipe=4, data=6, model=1): 153, ProcessCoord(pipe=4, data=6, model=2): 154, ProcessCoord(pipe=4, data=6, model=3): 155, ProcessCoord(pipe=4, data=7, model=0): 156, ProcessCoord(pipe=4, data=7, model=1): 157, ProcessCoord(pipe=4, data=7, model=2): 158, ProcessCoord(pipe=4, data=7, model=3): 159, ProcessCoord(pipe=5, data=0, model=0): 160, ProcessCoord(pipe=5, data=0, model=1): 161, ProcessCoord(pipe=5, data=0, model=2): 162, ProcessCoord(pipe=5, data=0, model=3): 163, ProcessCoord(pipe=5, data=1, model=0): 164, ProcessCoord(pipe=5, data=1, model=1): 165, ProcessCoord(pipe=5, data=1, model=2): 166, ProcessCoord(pipe=5, data=1, model=3): 167, ProcessCoord(pipe=5, data=2, model=0): 168, ProcessCoord(pipe=5, data=2, model=1): 169, ProcessCoord(pipe=5, data=2, model=2): 170, ProcessCoord(pipe=5, data=2, model=3): 171, ProcessCoord(pipe=5, data=3, model=0): 172, ProcessCoord(pipe=5, data=3, model=1): 173, ProcessCoord(pipe=5, data=3, model=2): 174, ProcessCoord(pipe=5, data=3, model=3): 175, ProcessCoord(pipe=5, data=4, model=0): 176, ProcessCoord(pipe=5, data=4, model=1): 177, ProcessCoord(pipe=5, data=4, model=2): 178, ProcessCoord(pipe=5, data=4, model=3): 179, ProcessCoord(pipe=5, data=5, model=0): 180, ProcessCoord(pipe=5, data=5, model=1): 181, ProcessCoord(pipe=5, data=5, model=2): 182, ProcessCoord(pipe=5, data=5, model=3): 183, ProcessCoord(pipe=5, data=6, model=0): 184, ProcessCoord(pipe=5, data=6, model=1): 185, ProcessCoord(pipe=5, data=6, model=2): 186, ProcessCoord(pipe=5, data=6, model=3): 187, ProcessCoord(pipe=5, data=7, model=0): 188, ProcessCoord(pipe=5, data=7, model=1): 189, ProcessCoord(pipe=5, data=7, model=2): 190, ProcessCoord(pipe=5, data=7, model=3): 191, ProcessCoord(pipe=6, data=0, model=0): 192, ProcessCoord(pipe=6, data=0, model=1): 193, ProcessCoord(pipe=6, data=0, model=2): 194, ProcessCoord(pipe=6, data=0, model=3): 195, ProcessCoord(pipe=6, data=1, model=0): 196, ProcessCoord(pipe=6, data=1, model=1): 197, ProcessCoord(pipe=6, data=1, model=2): 198, ProcessCoord(pipe=6, data=1, model=3): 199, ProcessCoord(pipe=6, data=2, model=0): 200, ProcessCoord(pipe=6, data=2, model=1): 201, ProcessCoord(pipe=6, data=2, model=2): 202, ProcessCoord(pipe=6, data=2, model=3): 203, ProcessCoord(pipe=6, data=3, model=0): 204, ProcessCoord(pipe=6, data=3, model=1): 205, ProcessCoord(pipe=6, data=3, model=2): 206, ProcessCoord(pipe=6, data=3, model=3): 207, ProcessCoord(pipe=6, data=4, model=0): 208, ProcessCoord(pipe=6, data=4, model=1): 209, ProcessCoord(pipe=6, data=4, model=2): 210, ProcessCoord(pipe=6, data=4, model=3): 211, ProcessCoord(pipe=6, data=5, model=0): 212, ProcessCoord(pipe=6, data=5, model=1): 213, ProcessCoord(pipe=6, data=5, model=2): 214, ProcessCoord(pipe=6, data=5, model=3): 215, ProcessCoord(pipe=6, data=6, model=0): 216, ProcessCoord(pipe=6, data=6, model=1): 217, ProcessCoord(pipe=6, data=6, model=2): 218, ProcessCoord(pipe=6, data=6, model=3): 219, ProcessCoord(pipe=6, data=7, model=0): 220, ProcessCoord(pipe=6, data=7, model=1): 221, ProcessCoord(pipe=6, data=7, model=2): 222, ProcessCoord(pipe=6, data=7, model=3): 223, ProcessCoord(pipe=7, data=0, model=0): 224, ProcessCoord(pipe=7, data=0, model=1): 225, ProcessCoord(pipe=7, data=0, model=2): 226, ProcessCoord(pipe=7, data=0, model=3): 227, ProcessCoord(pipe=7, data=1, model=0): 228, ProcessCoord(pipe=7, data=1, model=1): 229, ProcessCoord(pipe=7, data=1, model=2): 230, ProcessCoord(pipe=7, data=1, model=3): 231, ProcessCoord(pipe=7, data=2, model=0): 232, ProcessCoord(pipe=7, data=2, model=1): 233, ProcessCoord(pipe=7, data=2, model=2): 234, ProcessCoord(pipe=7, data=2, model=3): 235, ProcessCoord(pipe=7, data=3, model=0): 236, ProcessCoord(pipe=7, data=3, model=1): 237, ProcessCoord(pipe=7, data=3, model=2): 238, ProcessCoord(pipe=7, data=3, model=3): 239, ProcessCoord(pipe=7, data=4, model=0): 240, ProcessCoord(pipe=7, data=4, model=1): 241, ProcessCoord(pipe=7, data=4, model=2): 242, ProcessCoord(pipe=7, data=4, model=3): 243, ProcessCoord(pipe=7, data=5, model=0): 244, ProcessCoord(pipe=7, data=5, model=1): 245, ProcessCoord(pipe=7, data=5, model=2): 246, ProcessCoord(pipe=7, data=5, model=3): 247, ProcessCoord(pipe=7, data=6, model=0): 248, ProcessCoord(pipe=7, data=6, model=1): 249, ProcessCoord(pipe=7, data=6, model=2): 250, ProcessCoord(pipe=7, data=6, model=3): 251, ProcessCoord(pipe=7, data=7, model=0): 252, ProcessCoord(pipe=7, data=7, model=1): 253, ProcessCoord(pipe=7, data=7, model=2): 254, ProcessCoord(pipe=7, data=7, model=3): 255, ProcessCoord(pipe=8, data=0, model=0): 256, ProcessCoord(pipe=8, data=0, model=1): 257, ProcessCoord(pipe=8, data=0, model=2): 258, ProcessCoord(pipe=8, data=0, model=3): 259, ProcessCoord(pipe=8, data=1, model=0): 260, ProcessCoord(pipe=8, data=1, model=1): 261, ProcessCoord(pipe=8, data=1, model=2): 262, ProcessCoord(pipe=8, data=1, model=3): 263, ProcessCoord(pipe=8, data=2, model=0): 264, ProcessCoord(pipe=8, data=2, model=1): 265, ProcessCoord(pipe=8, data=2, model=2): 266, ProcessCoord(pipe=8, data=2, model=3): 267, ProcessCoord(pipe=8, data=3, model=0): 268, ProcessCoord(pipe=8, data=3, model=1): 269, ProcessCoord(pipe=8, data=3, model=2): 270, ProcessCoord(pipe=8, data=3, model=3): 271, ProcessCoord(pipe=8, data=4, model=0): 272, ProcessCoord(pipe=8, data=4, model=1): 273, ProcessCoord(pipe=8, data=4, model=2): 274, ProcessCoord(pipe=8, data=4, model=3): 275, ProcessCoord(pipe=8, data=5, model=0): 276, ProcessCoord(pipe=8, data=5, model=1): 277, ProcessCoord(pipe=8, data=5, model=2): 278, ProcessCoord(pipe=8, data=5, model=3): 279, ProcessCoord(pipe=8, data=6, model=0): 280, ProcessCoord(pipe=8, data=6, model=1): 281, ProcessCoord(pipe=8, data=6, model=2): 282, ProcessCoord(pipe=8, data=6, model=3): 283, ProcessCoord(pipe=8, data=7, model=0): 284, ProcessCoord(pipe=8, data=7, model=1): 285, ProcessCoord(pipe=8, data=7, model=2): 286, ProcessCoord(pipe=8, data=7, model=3): 287, ProcessCoord(pipe=9, data=0, model=0): 288, ProcessCoord(pipe=9, data=0, model=1): 289, ProcessCoord(pipe=9, data=0, model=2): 290, ProcessCoord(pipe=9, data=0, model=3): 291, ProcessCoord(pipe=9, data=1, model=0): 292, ProcessCoord(pipe=9, data=1, model=1): 293, ProcessCoord(pipe=9, data=1, model=2): 294, ProcessCoord(pipe=9, data=1, model=3): 295, ProcessCoord(pipe=9, data=2, model=0): 296, ProcessCoord(pipe=9, data=2, model=1): 297, ProcessCoord(pipe=9, data=2, model=2): 298, ProcessCoord(pipe=9, data=2, model=3): 299, ProcessCoord(pipe=9, data=3, model=0): 300, ProcessCoord(pipe=9, data=3, model=1): 301, ProcessCoord(pipe=9, data=3, model=2): 302, ProcessCoord(pipe=9, data=3, model=3): 303, ProcessCoord(pipe=9, data=4, model=0): 304, ProcessCoord(pipe=9, data=4, model=1): 305, ProcessCoord(pipe=9, data=4, model=2): 306, ProcessCoord(pipe=9, data=4, model=3): 307, ProcessCoord(pipe=9, data=5, model=0): 308, ProcessCoord(pipe=9, data=5, model=1): 309, ProcessCoord(pipe=9, data=5, model=2): 310, ProcessCoord(pipe=9, data=5, model=3): 311, ProcessCoord(pipe=9, data=6, model=0): 312, ProcessCoord(pipe=9, data=6, model=1): 313, ProcessCoord(pipe=9, data=6, model=2): 314, ProcessCoord(pipe=9, data=6, model=3): 315, ProcessCoord(pipe=9, data=7, model=0): 316, ProcessCoord(pipe=9, data=7, model=1): 317, ProcessCoord(pipe=9, data=7, model=2): 318, ProcessCoord(pipe=9, data=7, model=3): 319, ProcessCoord(pipe=10, data=0, model=0): 320, ProcessCoord(pipe=10, data=0, model=1): 321, ProcessCoord(pipe=10, data=0, model=2): 322, ProcessCoord(pipe=10, data=0, model=3): 323, ProcessCoord(pipe=10, data=1, model=0): 324, ProcessCoord(pipe=10, data=1, model=1): 325, ProcessCoord(pipe=10, data=1, model=2): 326, ProcessCoord(pipe=10, data=1, model=3): 327, ProcessCoord(pipe=10, data=2, model=0): 328, ProcessCoord(pipe=10, data=2, model=1): 329, ProcessCoord(pipe=10, data=2, model=2): 330, ProcessCoord(pipe=10, data=2, model=3): 331, ProcessCoord(pipe=10, data=3, model=0): 332, ProcessCoord(pipe=10, data=3, model=1): 333, ProcessCoord(pipe=10, data=3, model=2): 334, ProcessCoord(pipe=10, data=3, model=3): 335, ProcessCoord(pipe=10, data=4, model=0): 336, ProcessCoord(pipe=10, data=4, model=1): 337, ProcessCoord(pipe=10, data=4, model=2): 338, ProcessCoord(pipe=10, data=4, model=3): 339, ProcessCoord(pipe=10, data=5, model=0): 340, ProcessCoord(pipe=10, data=5, model=1): 341, ProcessCoord(pipe=10, data=5, model=2): 342, ProcessCoord(pipe=10, data=5, model=3): 343, ProcessCoord(pipe=10, data=6, model=0): 344, ProcessCoord(pipe=10, data=6, model=1): 345, ProcessCoord(pipe=10, data=6, model=2): 346, ProcessCoord(pipe=10, data=6, model=3): 347, ProcessCoord(pipe=10, data=7, model=0): 348, ProcessCoord(pipe=10, data=7, model=1): 349, ProcessCoord(pipe=10, data=7, model=2): 350, ProcessCoord(pipe=10, data=7, model=3): 351, ProcessCoord(pipe=11, data=0, model=0): 352, ProcessCoord(pipe=11, data=0, model=1): 353, ProcessCoord(pipe=11, data=0, model=2): 354, ProcessCoord(pipe=11, data=0, model=3): 355, ProcessCoord(pipe=11, data=1, model=0): 356, ProcessCoord(pipe=11, data=1, model=1): 357, ProcessCoord(pipe=11, data=1, model=2): 358, ProcessCoord(pipe=11, data=1, model=3): 359, ProcessCoord(pipe=11, data=2, model=0): 360, ProcessCoord(pipe=11, data=2, model=1): 361, ProcessCoord(pipe=11, data=2, model=2): 362, ProcessCoord(pipe=11, data=2, model=3): 363, ProcessCoord(pipe=11, data=3, model=0): 364, ProcessCoord(pipe=11, data=3, model=1): 365, ProcessCoord(pipe=11, data=3, model=2): 366, ProcessCoord(pipe=11, data=3, model=3): 367, ProcessCoord(pipe=11, data=4, model=0): 368, ProcessCoord(pipe=11, data=4, model=1): 369, ProcessCoord(pipe=11, data=4, model=2): 370, ProcessCoord(pipe=11, data=4, model=3): 371, ProcessCoord(pipe=11, data=5, model=0): 372, ProcessCoord(pipe=11, data=5, model=1): 373, ProcessCoord(pipe=11, data=5, model=2): 374, ProcessCoord(pipe=11, data=5, model=3): 375, ProcessCoord(pipe=11, data=6, model=0): 376, ProcessCoord(pipe=11, data=6, model=1): 377, ProcessCoord(pipe=11, data=6, model=2): 378, ProcessCoord(pipe=11, data=6, model=3): 379, ProcessCoord(pipe=11, data=7, model=0): 380, ProcessCoord(pipe=11, data=7, model=1): 381, ProcessCoord(pipe=11, data=7, model=2): 382, ProcessCoord(pipe=11, data=7, model=3): 383}
[default0]:[2022-03-04 04:09:07,781] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method type:transformer|embedding
[default0]:stage=0 layers=8
[default0]:     0: _to_float16
[default0]:     1: EmbeddingPipe
[default0]:     2: <lambda>
[default0]:     3: ParallelTransformerLayerPipe
[default0]:     4: ParallelTransformerLayerPipe
[default0]:     5: ParallelTransformerLayerPipe
[default0]:     6: ParallelTransformerLayerPipe
[default0]:     7: ParallelTransformerLayerPipe
[default0]:stage=1 layers=6
[default0]:     8: ParallelTransformerLayerPipe
[default0]:     9: ParallelTransformerLayerPipe
[default0]:    10: ParallelTransformerLayerPipe
[default0]:    11: ParallelTransformerLayerPipe
[default0]:    12: ParallelTransformerLayerPipe
[default0]:    13: ParallelTransformerLayerPipe
[default0]:stage=2 layers=6
[default0]:    14: ParallelTransformerLayerPipe
[default0]:    15: ParallelTransformerLayerPipe
[default0]:    16: ParallelTransformerLayerPipe
[default0]:    17: ParallelTransformerLayerPipe
[default0]:    18: ParallelTransformerLayerPipe
[default0]:    19: ParallelTransformerLayerPipe
[default0]:stage=3 layers=6
[default0]:    20: ParallelTransformerLayerPipe
[default0]:    21: ParallelTransformerLayerPipe
[default0]:    22: ParallelTransformerLayerPipe
[default0]:    23: ParallelTransformerLayerPipe
[default0]:    24: ParallelTransformerLayerPipe
[default0]:    25: ParallelTransformerLayerPipe
[default0]:stage=4 layers=6
[default0]:    26: ParallelTransformerLayerPipe
[default0]:    27: ParallelTransformerLayerPipe
[default0]:    28: ParallelTransformerLayerPipe
[default0]:    29: ParallelTransformerLayerPipe
[default0]:    30: ParallelTransformerLayerPipe
[default0]:    31: ParallelTransformerLayerPipe
[default0]:stage=5 layers=6
[default0]:    32: ParallelTransformerLayerPipe
[default0]:    33: ParallelTransformerLayerPipe
[default0]:    34: ParallelTransformerLayerPipe
[default0]:    35: ParallelTransformerLayerPipe
[default0]:    36: ParallelTransformerLayerPipe
[default0]:    37: ParallelTransformerLayerPipe
[default0]:stage=6 layers=6
[default0]:    38: ParallelTransformerLayerPipe
[default0]:    39: ParallelTransformerLayerPipe
[default0]:    40: ParallelTransformerLayerPipe
[default0]:    41: ParallelTransformerLayerPipe
[default0]:    42: ParallelTransformerLayerPipe
[default0]:    43: ParallelTransformerLayerPipe
[default0]:stage=7 layers=6
[default0]:    44: ParallelTransformerLayerPipe
[default0]:    45: ParallelTransformerLayerPipe
[default0]:    46: ParallelTransformerLayerPipe
[default0]:    47: ParallelTransformerLayerPipe
[default0]:    48: ParallelTransformerLayerPipe
[default0]:    49: ParallelTransformerLayerPipe
[default0]:stage=8 layers=6
[default0]:    50: ParallelTransformerLayerPipe
[default0]:    51: ParallelTransformerLayerPipe
[default0]:    52: ParallelTransformerLayerPipe
[default0]:    53: ParallelTransformerLayerPipe
[default0]:    54: ParallelTransformerLayerPipe
[default0]:    55: ParallelTransformerLayerPipe
[default0]:stage=9 layers=6
[default0]:    56: ParallelTransformerLayerPipe
[default0]:    57: ParallelTransformerLayerPipe
[default0]:    58: ParallelTransformerLayerPipe
[default0]:    59: ParallelTransformerLayerPipe
[default0]:    60: ParallelTransformerLayerPipe
[default0]:    61: ParallelTransformerLayerPipe
[default0]:stage=10 layers=6
[default0]:    62: ParallelTransformerLayerPipe
[default0]:    63: ParallelTransformerLayerPipe
[default0]:    64: ParallelTransformerLayerPipe
[default0]:    65: ParallelTransformerLayerPipe
[default0]:    66: ParallelTransformerLayerPipe
[default0]:    67: ParallelTransformerLayerPipe
[default0]:stage=11 layers=9
[default0]:    68: ParallelTransformerLayerPipe
[default0]:    69: ParallelTransformerLayerPipe
[default0]:    70: ParallelTransformerLayerPipe
[default0]:    71: ParallelTransformerLayerPipe
[default0]:    72: ParallelTransformerLayerPipe
[default0]:    73: <lambda>
[default0]:    74: MixedFusedLayerNorm
[default0]:    75: EmbeddingPipe
[default0]:    76: float16_to_fp32
[default0]:  loss: CrossEntropy
[default0]:[2022-03-04 04:09:08,974] [INFO] [utils.py:828:see_memory_usage] After Building Model
[default0]:[2022-03-04 04:09:08,974] [INFO] [utils.py:829:see_memory_usage] MA 7.43 GB         Max_MA 7.43 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-04 04:09:08,975] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.64 GB, percent = 8.7%
[default0]:setting training iterations to 128728
[default0]:> learning rate decay style: cosine
[default0]:DeepSpeed is enabled.
[default0]:[2022-03-04 04:09:08,996] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+ed26ef4, git-hash=ed26ef4, git-branch=olruwase/bf16-updates
[default0]:[2022-03-04 04:09:10,849] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[default0]:[2022-03-04 04:09:10,850] [INFO] [engine.py:1092:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[default0]:[2022-03-04 04:09:10,850] [INFO] [engine.py:1098:_configure_optimizer] Using client Optimizer as basic optimizer
[default0]:[2022-03-04 04:09:10,850] [INFO] [engine.py:1114:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
[default0]:[2022-03-04 04:09:10,850] [INFO] [engine.py:1328:_configure_bf16_optimizer] Creating unfused BF16 optimizer
[default0]:[2022-03-04 04:09:10,885] [INFO] [utils.py:828:see_memory_usage] begin bf16_optimizer
[default0]:[2022-03-04 04:09:10,886] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB         Max_MA 7.43 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-04 04:09:10,886] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.99 GB, percent = 8.7%
[default0]:[2022-03-04 04:09:10,908] [INFO] [utils.py:828:see_memory_usage] before initializing group 0
[default0]:[2022-03-04 04:09:10,909] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB         Max_MA 7.42 GB         CA 7.45 GB         Max_CA 7 GB 
[default0]:[2022-03-04 04:09:10,909] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.99 GB, percent = 8.7%
[default0]:[2022-03-04 04:09:10,971] [INFO] [utils.py:828:see_memory_usage] after initializing group 0
[default0]:[2022-03-04 04:09:10,971] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB         Max_MA 17.01 GB         CA 20.23 GB         Max_CA 20 GB 
[default0]:[2022-03-04 04:09:10,971] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.99 GB, percent = 8.7%
[default0]:[2022-03-04 04:09:10,992] [INFO] [utils.py:828:see_memory_usage] before initializing group 1
[default0]:[2022-03-04 04:09:10,993] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB         Max_MA 17.01 GB         CA 20.23 GB         Max_CA 20 GB 
[default0]:[2022-03-04 04:09:10,993] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.99 GB, percent = 8.7%
[default0]:[2022-03-04 04:09:11,036] [INFO] [utils.py:828:see_memory_usage] after initializing group 1
[default0]:[2022-03-04 04:09:11,036] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB         Max_MA 24.11 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-04 04:09:11,036] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.99 GB, percent = 8.7%
[default0]:[2022-03-04 04:09:11,057] [INFO] [utils.py:828:see_memory_usage] before initializing group 2
[default0]:[2022-03-04 04:09:11,058] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB         Max_MA 24.11 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-04 04:09:11,058] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.99 GB, percent = 8.7%
[default0]:[2022-03-04 04:09:11,080] [INFO] [utils.py:828:see_memory_usage] after initializing group 2
[default0]:[2022-03-04 04:09:11,080] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB         Max_MA 24.12 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-04 04:09:11,080] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.99 GB, percent = 8.7%
[default0]:[2022-03-04 04:09:11,102] [INFO] [utils.py:828:see_memory_usage] before initialize_optimizer
[default0]:[2022-03-04 04:09:11,102] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB         Max_MA 24.12 GB         CA 30.5 GB         Max_CA 30 GB 
[default0]:[2022-03-04 04:09:11,102] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.99 GB, percent = 8.7%
[default0]:[2022-03-04 04:09:11,152] [INFO] [utils.py:828:see_memory_usage] end initialize_optimizer
[default0]:[2022-03-04 04:09:11,153] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB         Max_MA 27.82 GB         CA 34.21 GB         Max_CA 34 GB 
[default0]:[2022-03-04 04:09:11,153] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.99 GB, percent = 8.7%
[default0]:[2022-03-04 04:09:11,175] [INFO] [utils.py:828:see_memory_usage] end bf16_optimizer
[default0]:[2022-03-04 04:09:11,175] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB         Max_MA 27.82 GB         CA 34.21 GB         Max_CA 34 GB 
[default0]:[2022-03-04 04:09:11,175] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 43.99 GB, percent = 8.7%
[default0]:[2022-03-04 04:09:11,175] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[default0]:[2022-03-04 04:09:11,175] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[default0]:[2022-03-04 04:09:11,175] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x14850b086250>
[default0]:[2022-03-04 04:09:11,176] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1057:print] DeepSpeedEngine configuration:
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   activation_checkpointing_config  {
[default0]:    "partition_activations": false, 
[default0]:    "contiguous_memory_optimization": false, 
[default0]:    "cpu_checkpointing": false, 
[default0]:    "number_checkpoints": null, 
[default0]:    "synchronize_checkpoint_boundary": false, 
[default0]:    "profile": false
[default0]:}
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   amp_enabled .................. False
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   amp_params ................... False
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   autotuning_config ............ {
[default0]:    "enabled": false, 
[default0]:    "start_step": null, 
[default0]:    "end_step": null, 
[default0]:    "metric_path": null, 
[default0]:    "arg_mappings": null, 
[default0]:    "metric": "throughput", 
[default0]:    "model_info": null, 
[default0]:    "results_dir": null, 
[default0]:    "exps_dir": null, 
[default0]:    "overwrite": true, 
[default0]:    "fast": true, 
[default0]:    "start_profile_step": 3, 
[default0]:    "end_profile_step": 5, 
[default0]:    "tuner_type": "gridsearch", 
[default0]:    "tuner_early_stopping": 5, 
[default0]:    "tuner_num_trials": 50, 
[default0]:    "model_info_path": null, 
[default0]:    "mp_size": 1, 
[default0]:    "max_train_batch_size": null, 
[default0]:    "min_train_batch_size": 1, 
[default0]:    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
[default0]:    "min_train_micro_batch_size_per_gpu": 1, 
[default0]:    "num_tuning_micro_batch_sizes": 3
[default0]:}
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   bfloat16_enabled ............. True
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   checkpoint_tag_validation_enabled  True
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   checkpoint_tag_validation_fail  False
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   communication_data_type ...... None
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   curriculum_enabled ........... False
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   curriculum_params ............ False
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   dataloader_drop_last ......... False
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   disable_allgather ............ False
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   dump_state ................... False
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   dynamic_loss_scale_args ...... None
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   eigenvalue_enabled ........... False
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   eigenvalue_gas_boundary_resolution  1
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   eigenvalue_layer_name ........ bert.encoder.layer
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   eigenvalue_layer_num ......... 0
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   eigenvalue_max_iter .......... 100
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   eigenvalue_stability ......... 1e-06
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   eigenvalue_tol ............... 0.01
[default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print]   eigenvalue_verbose ........... False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   elasticity_enabled ........... False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   flops_profiler_config ........ {
[default0]:    "enabled": false, 
[default0]:    "profile_step": 1, 
[default0]:    "module_depth": -1, 
[default0]:    "top_modules": 1, 
[default0]:    "detailed": true, 
[default0]:    "output_file": null
[default0]:}
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   fp16_enabled ................. False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   fp16_master_weights_and_gradients  False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   fp16_mixed_quantize .......... False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   global_rank .................. 0
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   gradient_accumulation_steps .. 128
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   gradient_clipping ............ 1.0
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   gradient_predivide_factor .... 1.0
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   initial_dynamic_scale ........ 1
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   loss_scale ................... 1.0
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   memory_breakdown ............. False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   optimizer_legacy_fusion ...... False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   optimizer_name ............... None
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   optimizer_params ............. None
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   pld_enabled .................. False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   pld_params ................... False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   prescale_gradients ........... False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   quantize_change_rate ......... 0.001
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   quantize_groups .............. 1
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   quantize_offset .............. 1000
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   quantize_period .............. 1000
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   quantize_rounding ............ 0
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   quantize_start_bits .......... 16
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   quantize_target_bits ......... 8
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   quantize_training_enabled .... False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   quantize_type ................ 0
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   quantize_verbose ............. False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   scheduler_name ............... None
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   scheduler_params ............. None
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   sparse_attention ............. None
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   sparse_gradients_enabled ..... False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   steps_per_print .............. 2000
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   tensorboard_enabled .......... False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   tensorboard_job_name ......... DeepSpeedJobName
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   tensorboard_output_path ...... 
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   train_batch_size ............. 2048
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   train_micro_batch_size_per_gpu  2
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   use_quantizer_kernel ......... False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   wall_clock_breakdown ......... False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   world_size ................... 8
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   zero_allow_untested_optimizer  False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   zero_config .................. {
[default0]:    "stage": 0, 
[default0]:    "contiguous_gradients": true, 
[default0]:    "reduce_scatter": true, 
[default0]:    "reduce_bucket_size": 5.000000e+08, 
[default0]:    "allgather_partitions": true, 
[default0]:    "allgather_bucket_size": 5.000000e+08, 
[default0]:    "overlap_comm": false, 
[default0]:    "load_from_fp32_weights": true, 
[default0]:    "elastic_checkpoint": false, 
[default0]:    "offload_param": null, 
[default0]:    "offload_optimizer": null, 
[default0]:    "sub_group_size": 1.000000e+09, 
[default0]:    "prefetch_bucket_size": 5.000000e+07, 
[default0]:    "param_persistence_threshold": 1.000000e+05, 
[default0]:    "max_live_parameters": 1.000000e+09, 
[default0]:    "max_reuse_distance": 1.000000e+09, 
[default0]:    "gather_16bit_weights_on_model_save": false, 
[default0]:    "ignore_unused_parameters": true, 
[default0]:    "round_robin_gradients": false, 
[default0]:    "legacy_stage1": false
[default0]:}
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   zero_enabled ................. False
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print]   zero_optimization_stage ...... 0
[default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1063:print]   json = {
[default0]:    "train_micro_batch_size_per_gpu": 2, 
[default0]:    "train_batch_size": 2.048000e+03, 
[default0]:    "gradient_clipping": 1.0, 
[default0]:    "zero_optimization": {
[default0]:        "stage": 0
[default0]:    }, 
[default0]:    "bf16": {
[default0]:        "enabled": true
[default0]:    }, 
[default0]:    "steps_per_print": 2.000000e+03, 
[default0]:    "wall_clock_breakdown": false
[default0]:}
[default0]:[2022-03-04 04:09:11,178] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=128 micro_batch_size=2
[default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=323 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=321 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=322 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=320 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=128 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=194 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=192 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=193 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=195 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=97 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=99 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=96 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=131 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=129 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=130 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=98 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=224 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=227 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=225 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=353 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=354 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=355 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=352 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=226 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=66 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=64 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=65 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=67 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=256 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=257 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=259 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=33 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=34 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=32 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=35 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=289 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=2 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=288 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=1 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=3 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=291 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=258 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=290 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=160 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default1]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=161 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=163 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=162 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M)
[default0]: > using checkpoint value 6e-05 for learning rate
[default0]: > using checkpoint value 6e-06 for minimum learning rate
[default0]: > using checkpoint value 183105 for warmup iterations
[default0]: > using checkpoint value 200000000 for total number of iterations
[default0]: > using checkpoint value cosine for decay style
[default0]:[2022-03-04 04:09:25,632] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 120
[default0]:[2022-03-04 04:09:26,492] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 120
[default4]:[2022-03-04 04:09:27,414] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 124
[default0]:[2022-03-04 04:09:27,431] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 40
[default2]:[2022-03-04 04:09:28,021] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 122
[default4]:[2022-03-04 04:09:28,258] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 44
[default3]:[2022-03-04 04:09:28,390] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 331
[default0]:[2022-03-04 04:09:28,365] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 40
[default4]:[2022-03-04 04:09:28,445] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 124
[default4]:[2022-03-04 04:09:28,601] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 180
[default0]:[2022-03-04 04:09:28,946] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 272
[default2]:[2022-03-04 04:09:29,017] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 122
[default4]:[2022-03-04 04:09:29,143] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 44
[default0]:[2022-03-04 04:09:29,240] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 48
[default3]:[2022-03-04 04:09:29,340] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 331
[default4]:[2022-03-04 04:09:29,257] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 236
[default4]:[2022-03-04 04:09:29,437] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 332
[default4]:[2022-03-04 04:09:29,554] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 372
[default7]:[2022-03-04 04:09:29,472] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 167
[default4]:[2022-03-04 04:09:29,660] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 252
[default4]:[2022-03-04 04:09:29,671] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 180
[default0]:[2022-03-04 04:09:29,679] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 104
[default6]:[2022-03-04 04:09:29,795] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 278
[default4]:[2022-03-04 04:09:29,833] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 28
[default5]:[2022-03-04 04:09:29,893] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 277
[default7]:[2022-03-04 04:09:29,913] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 335
[default0]:[2022-03-04 04:09:29,898] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 24
[default2]:[2022-03-04 04:09:30,018] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 274
[default0]:[2022-03-04 04:09:30,019] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 272
[default6]:[2022-03-04 04:09:29,989] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 238
[default6]:[2022-03-04 04:09:29,987] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 126
[default4]:[2022-03-04 04:09:30,057] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 276
[default1]:[2022-03-04 04:09:30,056] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 121
[default0]:[2022-03-04 04:09:30,065] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 48
[default1]:[2022-03-04 04:09:30,085] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 233
[default5]:[2022-03-04 04:09:30,119] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 125
[default4]:[2022-03-04 04:09:30,145] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 236
[default3]:[2022-03-04 04:09:30,189] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 123
[default4]:[2022-03-04 04:09:30,149] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 316
[default0]:[2022-03-04 04:09:30,269] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 248
[default0]:[2022-03-04 04:09:30,248] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 352
[default4]:[2022-03-04 04:09:30,340] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 308
[default4]:[2022-03-04 04:09:30,323] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 332
[default5]:[2022-03-04 04:09:30,344] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 237
[default7]:[2022-03-04 04:09:30,322] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 167
[default4]:[2022-03-04 04:09:30,419] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 108
[default1]:[2022-03-04 04:09:30,367] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 329
[default4]:[2022-03-04 04:09:30,393] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 20
[default7]:[2022-03-04 04:09:30,472] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 279
[default6]:[2022-03-04 04:09:30,499] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 334
[default4]:[2022-03-04 04:09:30,509] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 252
[default0]:[2022-03-04 04:09:30,448] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 232
[default2]:[2022-03-04 04:09:30,533] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 42
[default4]:[2022-03-04 04:09:30,521] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 372
[default7]:[2022-03-04 04:09:30,632] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 127
[default0]:[2022-03-04 04:09:30,550] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 104
[default2]:[2022-03-04 04:09:30,558] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 314
[default0]:[2022-03-04 04:09:30,675] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 328
[default4]:[2022-03-04 04:09:30,738] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 284
[default0]:[2022-03-04 04:09:30,715] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 280
[default5]:[2022-03-04 04:09:30,695] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 157
[default4]:[2022-03-04 04:09:30,724] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 164
[default0]:[2022-03-04 04:09:30,672] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 160
[default0]:[2022-03-04 04:09:30,810] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 24
[default3]:[2022-03-04 04:09:30,811] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 163
[default6]:[2022-03-04 04:09:30,783] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 166
[default4]:[2022-03-04 04:09:30,909] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 52
[default5]:[2022-03-04 04:09:30,875] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 277
[default7]:[2022-03-04 04:09:30,859] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 335
[default6]:[2022-03-04 04:09:30,939] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 278
[default6]:[2022-03-04 04:09:30,900] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 238
[default0]:[2022-03-04 04:09:30,868] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 344
[default4]:[2022-03-04 04:09:30,951] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 28
[default2]:[2022-03-04 04:09:30,935] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 162
[default1]:[2022-03-04 04:09:31,043] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 177
[default5]:[2022-03-04 04:09:30,986] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 317
[default4]:[2022-03-04 04:09:31,048] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 316
[default0]:[2022-03-04 04:09:31,135] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 304
[default1]:[2022-03-04 04:09:31,103] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 233
[default1]:[2022-03-04 04:09:31,067] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 273
[default4]:[2022-03-04 04:09:31,086] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 348
[default0]:[2022-03-04 04:09:31,094] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 336
[default6]:[2022-03-04 04:09:31,087] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 30
[default2]:[2022-03-04 04:09:31,152] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 274
[default4]:[2022-03-04 04:09:31,225] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 276
[default4]:[2022-03-04 04:09:31,183] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 308
[default2]:[2022-03-04 04:09:31,242] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 154
[default6]:[2022-03-04 04:09:31,198] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 46
[default0]:[2022-03-04 04:09:31,175] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 368
[default0]:[2022-03-04 04:09:31,334] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 248
[default0]:[2022-03-04 04:09:31,249] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 352
[default2]:[2022-03-04 04:09:31,330] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 234
[default0]:[2022-03-04 04:09:31,315] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 56
[default4]:[2022-03-04 04:09:31,341] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 20
[default5]:[2022-03-04 04:09:31,270] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 237
[default3]:[2022-03-04 04:09:31,406] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 275
[default6]:[2022-03-04 04:09:31,362] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 334
[default4]:[2022-03-04 04:09:31,372] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 108
[default0]:[2022-03-04 04:09:31,358] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 232
[default2]:[2022-03-04 04:09:31,414] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 314
[default1]:[2022-03-04 04:09:31,408] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 329
[default2]:[2022-03-04 04:09:31,404] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 42
[default3]:[2022-03-04 04:09:31,486] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 123
[default4]:[2022-03-04 04:09:31,516] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 196
[default0]:[2022-03-04 04:09:31,458] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 192
[default5]:[2022-03-04 04:09:31,535] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 125
[default5]:[2022-03-04 04:09:31,492] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 333
[default4]:[2022-03-04 04:09:31,494] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 116
[default2]:[2022-03-04 04:09:31,534] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 330
[default5]:[2022-03-04 04:09:31,551] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 157
[default0]:[2022-03-04 04:09:31,537] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 152
[default4]:[2022-03-04 04:09:31,627] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 300
[default1]:[2022-03-04 04:09:31,586] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 121
[default0]:[2022-03-04 04:09:31,622] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 328
[default6]:[2022-03-04 04:09:31,574] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 126
[default0]:[2022-03-04 04:09:31,593] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 168
[default4]:[2022-03-04 04:09:31,611] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 156
[default0]:[2022-03-04 04:09:31,573] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 280
[default1]:[2022-03-04 04:09:31,607] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 161
[default7]:[2022-03-04 04:09:31,644] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 279
[default5]:[2022-03-04 04:09:31,671] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 53
[default6]:[2022-03-04 04:09:31,669] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 62
[default7]:[2022-03-04 04:09:31,665] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 63
[default4]:[2022-03-04 04:09:31,663] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 284
[default3]:[2022-03-04 04:09:31,709] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 171
[default4]:[2022-03-04 04:09:31,810] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 52
[default7]:[2022-03-04 04:09:31,748] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 127
[default5]:[2022-03-04 04:09:31,802] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 109
[default1]:[2022-03-04 04:09:31,762] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 193
[default3]:[2022-03-04 04:09:31,836] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 235
[default2]:[2022-03-04 04:09:31,787] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 74
[default3]:[2022-03-04 04:09:31,798] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 155
[default6]:[2022-03-04 04:09:31,784] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 374
[default1]:[2022-03-04 04:09:31,862] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 313
[default6]:[2022-03-04 04:09:31,863] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 182
[default5]:[2022-03-04 04:09:31,906] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 317
[default3]:[2022-03-04 04:09:31,933] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 315
[default5]:[2022-03-04 04:09:31,872] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 189
[default5]:[2022-03-04 04:09:31,935] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 45
[default4]:[2022-03-04 04:09:31,917] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 148
[default0]:[2022-03-04 04:09:31,911] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 336
[default1]:[2022-03-04 04:09:31,973] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 105
[default1]:[2022-03-04 04:09:31,986] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 177
[default0]:[2022-03-04 04:09:31,958] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 176
[default4]:[2022-03-04 04:09:32,037] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 228
[default1]:[2022-03-04 04:09:32,044] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 273
[default0]:[2022-03-04 04:09:31,999] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 344
[default0]:[2022-03-04 04:09:31,986] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 312
[default0]:[2022-03-04 04:09:31,991] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 160
[default2]:[2022-03-04 04:09:31,971] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 162
[default0]:[2022-03-04 04:09:32,081] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 304
[default2]:[2022-03-04 04:09:32,088] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 178
[default7]:[2022-03-04 04:09:32,133] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 183
[default5]:[2022-03-04 04:09:32,128] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 181
[default7]:[2022-03-04 04:09:32,058] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 175
[default3]:[2022-03-04 04:09:32,152] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 115
[default0]:[2022-03-04 04:09:32,088] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 72
[default6]:[2022-03-04 04:09:32,150] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 46
[default1]:[2022-03-04 04:09:32,110] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 153
[default6]:[2022-03-04 04:09:32,127] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 30
[default2]:[2022-03-04 04:09:32,193] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 234
[default0]:[2022-03-04 04:09:32,197] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 184
[default4]:[2022-03-04 04:09:32,246] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 348
[default2]:[2022-03-04 04:09:32,186] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 154
[default7]:[2022-03-04 04:09:32,232] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 159
[default0]:[2022-03-04 04:09:32,204] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 368
[default2]:[2022-03-04 04:09:32,190] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 26
[default4]:[2022-03-04 04:09:32,237] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 164
[default3]:[2022-03-04 04:09:32,257] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 163
[default6]:[2022-03-04 04:09:32,236] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 166
[default3]:[2022-03-04 04:09:32,312] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 275
[default5]:[2022-03-04 04:09:32,269] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 253
[default4]:[2022-03-04 04:09:32,268] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 76
[default7]:[2022-03-04 04:09:32,266] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 191
[default7]:[2022-03-04 04:09:32,321] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 375
[default0]:[2022-03-04 04:09:32,427] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 192
[default4]:[2022-03-04 04:09:32,406] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 68
[default3]:[2022-03-04 04:09:32,373] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 59
[default1]:[2022-03-04 04:09:32,359] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 185
[default5]:[2022-03-04 04:09:32,374] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 333
[default0]:[2022-03-04 04:09:32,435] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 56
[default2]:[2022-03-04 04:09:32,430] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 266
[default0]:[2022-03-04 04:09:32,434] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 168
[default4]:[2022-03-04 04:09:32,360] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 172
[default7]:[2022-03-04 04:09:32,365] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 47
[default3]:[2022-03-04 04:09:32,407] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 147
[default4]:[2022-03-04 04:09:32,384] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 340
[default3]:[2022-03-04 04:09:32,432] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 27
[default1]:[2022-03-04 04:09:32,530] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 305
[default4]:[2022-03-04 04:09:32,517] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 196
[default0]:[2022-03-04 04:09:32,499] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 128
[default2]:[2022-03-04 04:09:32,532] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 114
[default4]:[2022-03-04 04:09:32,490] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 116
[default2]:[2022-03-04 04:09:32,486] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 58
[default2]:[2022-03-04 04:09:32,501] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 330
[default0]:[2022-03-04 04:09:32,472] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 264
[default3]:[2022-03-04 04:09:32,548] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 43
[default6]:[2022-03-04 04:09:32,490] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 262
[default1]:[2022-03-04 04:09:32,552] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 161
[default7]:[2022-03-04 04:09:32,542] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 31
[default5]:[2022-03-04 04:09:32,585] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 53
[default3]:[2022-03-04 04:09:32,637] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 187
[default1]:[2022-03-04 04:09:32,625] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 57
[default4]:[2022-03-04 04:09:32,636] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 268
[default3]:[2022-03-04 04:09:32,626] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 307
[default5]:[2022-03-04 04:09:32,627] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 309
[default6]:[2022-03-04 04:09:32,634] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 310
[default7]:[2022-03-04 04:09:32,553] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 255
[default7]:[2022-03-04 04:09:32,601] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 239
[default6]:[2022-03-04 04:09:32,597] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 174
[default4]:[2022-03-04 04:09:32,625] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 292
[default6]:[2022-03-04 04:09:32,577] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 118
[default5]:[2022-03-04 04:09:32,638] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 77
[default7]:[2022-03-04 04:09:32,652] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 319
[default3]:[2022-03-04 04:09:32,611] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 171
[default6]:[2022-03-04 04:09:32,674] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 54
[default7]:[2022-03-04 04:09:32,654] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 311
[default5]:[2022-03-04 04:09:32,698] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 109
[default5]:[2022-03-04 04:09:32,734] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 197
[default1]:[2022-03-04 04:09:32,735] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 193
[default3]:[2022-03-04 04:09:32,668] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 235
[default3]:[2022-03-04 04:09:32,681] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 251
[default4]:[2022-03-04 04:09:32,658] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 60
[default2]:[2022-03-04 04:09:32,680] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 306
[default6]:[2022-03-04 04:09:32,747] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 318
[default0]:[2022-03-04 04:09:32,686] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 0
[default0]:[2022-03-04 04:09:32,661] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 32
[default4]:[2022-03-04 04:09:32,665] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 156
[default0]:[2022-03-04 04:09:32,660] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 208
[default0]:[2022-03-04 04:09:32,676] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 152
[default5]:[2022-03-04 04:09:32,729] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 165
[default4]:[2022-03-04 04:09:32,787] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 300
[default1]:[2022-03-04 04:09:32,771] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 297
[default4]:[2022-03-04 04:09:32,841] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 140
[default6]:[2022-03-04 04:09:32,781] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 62
[default2]:[2022-03-04 04:09:32,756] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 50
[default1]:[2022-03-04 04:09:32,831] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 49
[default2]:[2022-03-04 04:09:32,830] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 74
[default5]:[2022-03-04 04:09:32,799] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 45
[default4]:[2022-03-04 04:09:32,832] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 148
[default6]:[2022-03-04 04:09:32,834] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 78
[default4]:[2022-03-04 04:09:32,840] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 380
[default1]:[2022-03-04 04:09:32,834] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 337
[default3]:[2022-03-04 04:09:32,854] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 339
[default0]:[2022-03-04 04:09:32,856] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 216
[default0]:[2022-03-04 04:09:32,932] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 296
[default7]:[2022-03-04 04:09:32,867] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 199
[default2]:[2022-03-04 04:09:32,876] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 106
[default5]:[2022-03-04 04:09:32,918] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 61
[default4]:[2022-03-04 04:09:32,917] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 228
[default1]:[2022-03-04 04:09:32,951] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 169
[default5]:[2022-03-04 04:09:32,864] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 189
[default4]:[2022-03-04 04:09:32,876] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 188
[default3]:[2022-03-04 04:09:32,861] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 155
[default1]:[2022-03-04 04:09:32,900] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 73
[default5]:[2022-03-04 04:09:32,986] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 301
[default1]:[2022-03-04 04:09:32,970] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 105
[default3]:[2022-03-04 04:09:32,984] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 107
[default1]:[2022-03-04 04:09:33,003] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 249
[default7]:[2022-03-04 04:09:33,040] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 63
[default1]:[2022-03-04 04:09:32,963] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 313
[default1]:[2022-03-04 04:09:32,978] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 113
[default3]:[2022-03-04 04:09:32,969] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 315
[default4]:[2022-03-04 04:09:33,036] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 36
[default4]:[2022-03-04 04:09:32,999] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 212
[default1]:[2022-03-04 04:09:33,025] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 153
[default0]:[2022-03-04 04:09:32,975] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 376
[default6]:[2022-03-04 04:09:32,974] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 374
[default3]:[2022-03-04 04:09:33,100] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 179
[default0]:[2022-03-04 04:09:33,070] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 176
[default2]:[2022-03-04 04:09:33,117] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 282
[default0]:[2022-03-04 04:09:33,147] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 112
[default7]:[2022-03-04 04:09:33,103] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 111
[default2]:[2022-03-04 04:09:33,146] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 250
[default0]:[2022-03-04 04:09:33,104] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 72
[default0]:[2022-03-04 04:09:33,084] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 312
[default7]:[2022-03-04 04:09:33,096] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 159
[default7]:[2022-03-04 04:09:33,151] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 55
[default6]:[2022-03-04 04:09:33,167] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 110
[default0]:[2022-03-04 04:09:33,165] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 64
[default6]:[2022-03-04 04:09:33,183] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 270
[default2]:[2022-03-04 04:09:33,211] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 170
[default7]:[2022-03-04 04:09:33,173] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 175
[default3]:[2022-03-04 04:09:33,213] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 291
[default6]:[2022-03-04 04:09:33,242] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 158
[default7]:[2022-03-04 04:09:33,187] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 79
[default0]:[2022-03-04 04:09:33,234] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 144
[default1]:[2022-03-04 04:09:33,193] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 265
[default3]:[2022-03-04 04:09:33,199] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 371
[default3]:[2022-03-04 04:09:33,314] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 91
[default3]:[2022-03-04 04:09:33,339] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 195
[default0]:[2022-03-04 04:09:33,324] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 128
[default5]:[2022-03-04 04:09:33,285] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 253
[default3]:[2022-03-04 04:09:33,266] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 115
[default4]:[2022-03-04 04:09:33,283] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 76
[default7]:[2022-03-04 04:09:33,316] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 47
[default5]:[2022-03-04 04:09:33,337] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 293
[default7]:[2022-03-04 04:09:33,331] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 151
[default4]:[2022-03-04 04:09:33,264] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 340
[default2]:[2022-03-04 04:09:33,358] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 338
[default2]:[2022-03-04 04:09:33,342] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 26
[default2]:[2022-03-04 04:09:33,306] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 370
[default4]:[2022-03-04 04:09:33,349] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 324
[default1]:[2022-03-04 04:09:33,413] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 65
[default3]:[2022-03-04 04:09:33,394] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 59
[default6]:[2022-03-04 04:09:33,356] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 182
[default0]:[2022-03-04 04:09:33,373] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 184
[default5]:[2022-03-04 04:09:33,436] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 349
[default7]:[2022-03-04 04:09:33,438] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 351
[default7]:[2022-03-04 04:09:33,442] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 239
[default5]:[2022-03-04 04:09:33,401] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 173
[default1]:[2022-03-04 04:09:33,405] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 345
[default2]:[2022-03-04 04:09:33,434] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 346
[default6]:[2022-03-04 04:09:33,384] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 350
[default3]:[2022-03-04 04:09:33,365] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 75
[default3]:[2022-03-04 04:09:33,410] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 43
[default3]:[2022-03-04 04:09:33,445] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 147
[default5]:[2022-03-04 04:09:33,404] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 341
[default1]:[2022-03-04 04:09:33,385] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 25
[default1]:[2022-03-04 04:09:33,430] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 369
[default7]:[2022-03-04 04:09:33,380] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 375
[default3]:[2022-03-04 04:09:33,463] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 51
[default2]:[2022-03-04 04:09:33,479] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 178
[default4]:[2022-03-04 04:09:33,490] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 68
[default7]:[2022-03-04 04:09:33,481] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 271
[default7]:[2022-03-04 04:09:33,487] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 191
[default2]:[2022-03-04 04:09:33,553] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 226
[default1]:[2022-03-04 04:09:33,537] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 289
[default3]:[2022-03-04 04:09:33,480] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 27
[default5]:[2022-03-04 04:09:33,545] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 165
[default6]:[2022-03-04 04:09:33,481] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 342
[default6]:[2022-03-04 04:09:33,570] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 198
[default7]:[2022-03-04 04:09:33,642] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 183
[default1]:[2022-03-04 04:09:33,625] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 57
[default2]:[2022-03-04 04:09:33,593] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 58
[default5]:[2022-03-04 04:09:33,640] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 117
[default6]:[2022-03-04 04:09:33,605] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 230
[default4]:[2022-03-04 04:09:33,613] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 244
[default5]:[2022-03-04 04:09:33,634] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 229
[default7]:[2022-03-04 04:09:33,591] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 255
[default2]:[2022-03-04 04:09:33,612] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 266
[default0]:[2022-03-04 04:09:33,564] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 0
[default0]: checkpoint version 3.0
[default7]:[2022-03-04 04:09:33,580] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 23
[default6]:[2022-03-04 04:09:33,633] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 254
[default1]:[2022-03-04 04:09:33,583] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 17
[default0]:[2022-03-04 04:09:33,610] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 32
[default4]:[2022-03-04 04:09:33,574] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 292
[default7]:[2022-03-04 04:09:33,594] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 119
[default2]:[2022-03-04 04:09:33,629] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 258
[default1]:[2022-03-04 04:09:33,649] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 41
[default6]:[2022-03-04 04:09:33,610] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 262
[default0]:[2022-03-04 04:09:33,566] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 256
[default5]:[2022-03-04 04:09:33,606] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 29
[default1]:[2022-03-04 04:09:33,669] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 89
[default5]:[2022-03-04 04:09:33,667] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 93
[default0]:[2022-03-04 04:09:33,651] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 320
[default2]:[2022-03-04 04:09:33,721] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 194
[default6]:[2022-03-04 04:09:33,737] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 70
[default4]:[2022-03-04 04:09:33,738] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 60
[default0]:[2022-03-04 04:09:33,699] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 224
[default2]:[2022-03-04 04:09:33,714] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 114
[default5]:[2022-03-04 04:09:33,653] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 181
[default6]:[2022-03-04 04:09:33,729] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 318
[default6]:[2022-03-04 04:09:33,666] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 174
[default7]:[2022-03-04 04:09:33,735] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 295
[default4]:[2022-03-04 04:09:33,748] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 204
[default0]:[2022-03-04 04:09:33,704] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 208
[default7]:[2022-03-04 04:09:33,720] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 319
[default2]:[2022-03-04 04:09:33,784] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 186
[default3]:[2022-03-04 04:09:33,750] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 131
[default4]:[2022-03-04 04:09:33,795] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 140
[default1]:[2022-03-04 04:09:33,838] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 185
[default2]:[2022-03-04 04:09:33,754] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 306
[default7]:[2022-03-04 04:09:33,828] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 199
[default0]:[2022-03-04 04:09:33,832] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 264
[default4]:[2022-03-04 04:09:33,826] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 172
[default3]:[2022-03-04 04:09:33,767] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 347
[default6]:[2022-03-04 04:09:33,801] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 190
[default6]:[2022-03-04 04:09:33,783] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 118
[default3]:[2022-03-04 04:09:33,830] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 283
[default3]:[2022-03-04 04:09:33,851] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 267
[default1]:[2022-03-04 04:09:33,846] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 337
[default0]:[2022-03-04 04:09:33,775] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 216
[default7]:[2022-03-04 04:09:33,789] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 31
[default6]:[2022-03-04 04:09:33,863] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 54
[default1]:[2022-03-04 04:09:33,929] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 297
[default5]:[2022-03-04 04:09:33,865] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 197
[default4]:[2022-03-04 04:09:33,880] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 132
[default5]:[2022-03-04 04:09:33,900] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 309
[default0]:[2022-03-04 04:09:33,898] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 16
[default1]:[2022-03-04 04:09:33,860] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 49
[default2]:[2022-03-04 04:09:33,860] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 50
[default1]:[2022-03-04 04:09:33,923] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 169
[default4]:[2022-03-04 04:09:33,881] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 380
[default3]:[2022-03-04 04:09:33,893] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 339
[default5]:[2022-03-04 04:09:33,881] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 373
[default3]:[2022-03-04 04:09:34,022] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 67
[default3]:[2022-03-04 04:09:33,957] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 179
[default5]:[2022-03-04 04:09:34,027] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 61
[default2]:[2022-03-04 04:09:33,995] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 282
[default7]:[2022-03-04 04:09:33,991] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 287
[default4]:[2022-03-04 04:09:34,052] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 36
[default6]:[2022-03-04 04:09:33,998] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 214
[default0]:[2022-03-04 04:09:33,979] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 200
[default7]:[2022-03-04 04:09:34,060] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 343
[default0]:[2022-03-04 04:09:34,005] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 376
[default1]:[2022-03-04 04:09:34,117] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 305
[default6]:[2022-03-04 04:09:34,115] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 326
[default0]:[2022-03-04 04:09:34,054] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 96
[default3]:[2022-03-04 04:09:34,049] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 187
[default2]:[2022-03-04 04:09:34,072] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 106
[default1]:[2022-03-04 04:09:34,088] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 129
[default7]:[2022-03-04 04:09:34,098] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 71
[default5]:[2022-03-04 04:09:34,131] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 69
[default6]:[2022-03-04 04:09:34,063] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 286
[default3]:[2022-03-04 04:09:34,138] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 251
[default7]:[2022-03-04 04:09:34,151] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 231
[default5]:[2022-03-04 04:09:34,117] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 21
[default3]:[2022-03-04 04:09:34,117] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 203
[default6]:[2022-03-04 04:09:34,137] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 158
[default5]:[2022-03-04 04:09:34,098] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 77
[default4]:[2022-03-04 04:09:34,108] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 188
[default0]:[2022-03-04 04:09:34,139] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 288
[default7]:[2022-03-04 04:09:34,141] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 207
[default7]:[2022-03-04 04:09:34,072] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 303
[default7]:[2022-03-04 04:09:34,116] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 383
[default7]:[2022-03-04 04:09:34,185] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 55
[default7]:[2022-03-04 04:09:34,185] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 311
[default0]:[2022-03-04 04:09:34,242] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 64
[default3]:[2022-03-04 04:09:34,165] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 307
[default0]:[2022-03-04 04:09:34,168] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 112
[default7]:[2022-03-04 04:09:34,232] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 95
[default5]:[2022-03-04 04:09:34,207] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 269
[default2]:[2022-03-04 04:09:34,236] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 18
[default1]:[2022-03-04 04:09:34,244] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 73
[default4]:[2022-03-04 04:09:34,223] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 212
[default4]:[2022-03-04 04:09:34,226] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 12
[default4]:[2022-03-04 04:09:34,276] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 324
[default0]:[2022-03-04 04:09:34,256] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 296
[default3]:[2022-03-04 04:09:34,247] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 299
[default3]:[2022-03-04 04:09:34,284] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 195
[default2]:[2022-03-04 04:09:34,317] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 66
[default3]:[2022-03-04 04:09:34,288] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 107
[default4]:[2022-03-04 04:09:34,256] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 268
[default1]:[2022-03-04 04:09:34,276] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 113
[default1]:[2022-03-04 04:09:34,273] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 225
[default6]:[2022-03-04 04:09:34,258] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 310
[default5]:[2022-03-04 04:09:34,333] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 285
[default1]:[2022-03-04 04:09:34,316] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 281
[default3]:[2022-03-04 04:09:34,266] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 291
[default6]:[2022-03-04 04:09:34,279] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 22
[default3]:[2022-03-04 04:09:34,293] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 75
[default6]:[2022-03-04 04:09:34,329] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 78
[default3]:[2022-03-04 04:09:34,292] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 379
[default5]:[2022-03-04 04:09:34,320] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 341
[default3]:[2022-03-04 04:09:34,360] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 51
[default5]:[2022-03-04 04:09:34,402] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 301
[default5]:[2022-03-04 04:09:34,348] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 133
[default1]:[2022-03-04 04:09:34,392] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 241
[default5]:[2022-03-04 04:09:34,408] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 245
[default1]:[2022-03-04 04:09:34,419] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 249
[default3]:[2022-03-04 04:09:34,393] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 227
[default2]:[2022-03-04 04:09:34,419] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 170
[default3]:[2022-03-04 04:09:34,355] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 19
[default7]:[2022-03-04 04:09:34,425] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 119
[default5]:[2022-03-04 04:09:34,386] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 293
[default1]:[2022-03-04 04:09:34,447] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 41
[default1]:[2022-03-04 04:09:34,386] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 25
[default2]:[2022-03-04 04:09:34,517] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 250
[default7]:[2022-03-04 04:09:34,466] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 79
[default0]:[2022-03-04 04:09:34,488] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 144
[default5]:[2022-03-04 04:09:34,520] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 205
[default2]:[2022-03-04 04:09:34,508] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 338
[default5]:[2022-03-04 04:09:34,507] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 29
[default6]:[2022-03-04 04:09:34,565] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 342
[default3]:[2022-03-04 04:09:34,587] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 91
[default0]:[2022-03-04 04:09:34,564] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 88
[default6]:[2022-03-04 04:09:34,623] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 110
[default6]:[2022-03-04 04:09:34,615] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 198
[default1]:[2022-03-04 04:09:34,590] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 65
[default0]:[2022-03-04 04:09:34,628] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 136
[default5]:[2022-03-04 04:09:34,571] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 117
[default7]:[2022-03-04 04:09:34,610] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 111
[default4]:[2022-03-04 04:09:34,603] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 84
[default5]:[2022-03-04 04:09:34,629] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 37
[default2]:[2022-03-04 04:09:34,567] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 34
[default5]:[2022-03-04 04:09:34,635] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 173
[default0]:[2022-03-04 04:09:34,606] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 240
[default2]:[2022-03-04 04:09:34,604] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 226
[default1]:[2022-03-04 04:09:34,625] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 289
[default5]:[2022-03-04 04:09:34,619] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 261
[default7]:[2022-03-04 04:09:34,621] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 151
[default6]:[2022-03-04 04:09:34,656] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 150
[default7]:[2022-03-04 04:09:34,637] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 263
[default1]:[2022-03-04 04:09:34,577] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 257
[default2]:[2022-03-04 04:09:34,578] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 370
[default3]:[2022-03-04 04:09:34,567] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 371
[default5]:[2022-03-04 04:09:34,650] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 13
[default5]:[2022-03-04 04:09:34,571] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 221
[default0]:[2022-03-04 04:09:34,708] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 320
[default2]:[2022-03-04 04:09:34,715] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 194
[default2]:[2022-03-04 04:09:34,748] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 298
[default4]:[2022-03-04 04:09:34,701] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 100
[default4]:[2022-03-04 04:09:34,739] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 244
[default5]:[2022-03-04 04:09:34,720] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 85
[default7]:[2022-03-04 04:09:34,673] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 23
[default6]:[2022-03-04 04:09:34,667] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 294
[default6]:[2022-03-04 04:09:34,751] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 350
[default1]:[2022-03-04 04:09:34,691] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 201
[default2]:[2022-03-04 04:09:34,679] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 258
[default0]:[2022-03-04 04:09:34,709] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 256
[default2]:[2022-03-04 04:09:34,672] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 290
[default4]:[2022-03-04 04:09:34,836] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 92
[default2]:[2022-03-04 04:09:34,776] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 186
[default6]:[2022-03-04 04:09:34,822] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 230
[default5]:[2022-03-04 04:09:34,772] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 229
[default7]:[2022-03-04 04:09:34,762] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 271
[default4]:[2022-03-04 04:09:34,798] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 204
[default6]:[2022-03-04 04:09:34,828] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 190
[default3]:[2022-03-04 04:09:34,834] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 283
[default6]:[2022-03-04 04:09:34,846] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 382
[default1]:[2022-03-04 04:09:34,780] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 369
[default0]:[2022-03-04 04:09:34,899] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 80
[default6]:[2022-03-04 04:09:34,895] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 70
[default4]:[2022-03-04 04:09:34,904] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 364
[default0]:[2022-03-04 04:09:34,895] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 224
[default6]:[2022-03-04 04:09:34,899] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 270
[default6]:[2022-03-04 04:09:34,859] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 254
[default4]:[2022-03-04 04:09:34,896] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 356
[default7]:[2022-03-04 04:09:34,945] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 295
[default2]:[2022-03-04 04:09:34,922] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 202
[default2]:[2022-03-04 04:09:34,882] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 378
[default5]:[2022-03-04 04:09:34,926] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 381
[default7]:[2022-03-04 04:09:34,884] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 343
[default4]:[2022-03-04 04:09:34,927] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 260
[default1]:[2022-03-04 04:09:34,912] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 217
[default2]:[2022-03-04 04:09:34,962] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 10
[default1]:[2022-03-04 04:09:35,013] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 89
[default2]:[2022-03-04 04:09:35,041] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 90
[default0]:[2022-03-04 04:09:34,959] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 96
[default7]:[2022-03-04 04:09:35,041] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 71
[default3]:[2022-03-04 04:09:34,967] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 139
[default6]:[2022-03-04 04:09:35,024] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 38
[default6]:[2022-03-04 04:09:35,003] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 214
[default6]:[2022-03-04 04:09:35,020] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 206
[default3]:[2022-03-04 04:09:35,023] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 267
[default1]:[2022-03-04 04:09:34,973] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 265
[default7]:[2022-03-04 04:09:34,995] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 303
[default1]:[2022-03-04 04:09:35,016] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 377
[default5]:[2022-03-04 04:09:35,121] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 93
[default3]:[2022-03-04 04:09:35,139] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 299
[default6]:[2022-03-04 04:09:35,139] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 326
[default3]:[2022-03-04 04:09:35,091] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 67
[default5]:[2022-03-04 04:09:35,112] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 69
[default5]:[2022-03-04 04:09:35,059] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 269
[default3]:[2022-03-04 04:09:35,146] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 35
[default1]:[2022-03-04 04:09:35,125] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 345
[default1]:[2022-03-04 04:09:35,131] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 33
[default3]:[2022-03-04 04:09:35,124] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 347
[default0]:[2022-03-04 04:09:35,138] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 288
[default3]:[2022-03-04 04:09:35,156] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 211
[default5]:[2022-03-04 04:09:35,142] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 149
[default6]:[2022-03-04 04:09:35,158] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 222
[default0]:[2022-03-04 04:09:35,103] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 200
[default5]:[2022-03-04 04:09:35,101] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 373
[default7]:[2022-03-04 04:09:35,164] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 383
[default6]:[2022-03-04 04:09:35,219] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 94
[default2]:[2022-03-04 04:09:35,208] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 322
[default1]:[2022-03-04 04:09:35,202] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 97
[default6]:[2022-03-04 04:09:35,178] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 366
[default3]:[2022-03-04 04:09:35,180] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 363
[default7]:[2022-03-04 04:09:35,246] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 231
[default7]:[2022-03-04 04:09:35,178] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 87
[default3]:[2022-03-04 04:09:35,185] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 83
[default6]:[2022-03-04 04:09:35,201] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 86
[default7]:[2022-03-04 04:09:35,159] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 39
[default0]:[2022-03-04 04:09:35,227] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 16
[default5]:[2022-03-04 04:09:35,189] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 285
[default1]:[2022-03-04 04:09:35,199] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 17
[default1]:[2022-03-04 04:09:35,196] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 81
[default1]:[2022-03-04 04:09:35,173] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 145
[default2]:[2022-03-04 04:09:35,174] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 146
[default7]:[2022-03-04 04:09:35,162] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 207
[default0]:[2022-03-04 04:09:35,187] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 8
[default4]:[2022-03-04 04:09:35,181] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 220
[default3]:[2022-03-04 04:09:35,191] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 11
[default2]:[2022-03-04 04:09:35,323] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 66
[default6]:[2022-03-04 04:09:35,321] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 286
[default3]:[2022-03-04 04:09:35,277] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 227
[default6]:[2022-03-04 04:09:35,297] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 302
[default4]:[2022-03-04 04:09:35,299] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 4
[default7]:[2022-03-04 04:09:35,349] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 351
[default7]:[2022-03-04 04:09:35,333] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 95
[default7]:[2022-03-04 04:09:35,257] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 287
[default1]:[2022-03-04 04:09:35,299] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 281
[default2]:[2022-03-04 04:09:35,304] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 346
[default2]:[2022-03-04 04:09:35,348] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 18
[default3]:[2022-03-04 04:09:35,271] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 203
[default3]:[2022-03-04 04:09:35,294] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 259
[default3]:[2022-03-04 04:09:35,355] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 379
[default7]:[2022-03-04 04:09:35,343] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 15
[default4]:[2022-03-04 04:09:35,362] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 12
[default0]:[2022-03-04 04:09:35,447] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 88
[default2]:[2022-03-04 04:09:35,393] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 82
[default6]:[2022-03-04 04:09:35,444] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 102
[default3]:[2022-03-04 04:09:35,439] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 131
[default3]:[2022-03-04 04:09:35,423] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 243
[default5]:[2022-03-04 04:09:35,360] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 349
[default7]:[2022-03-04 04:09:35,457] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 215
[default7]:[2022-03-04 04:09:35,525] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 327
[default7]:[2022-03-04 04:09:35,494] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 103
[default0]:[2022-03-04 04:09:35,487] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 136
[default1]:[2022-03-04 04:09:35,454] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 225
[default5]:[2022-03-04 04:09:35,478] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 21
[default6]:[2022-03-04 04:09:35,518] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 294
[default6]:[2022-03-04 04:09:35,550] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 22
[default2]:[2022-03-04 04:09:35,503] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 290
[default7]:[2022-03-04 04:09:35,601] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 135
[default5]:[2022-03-04 04:09:35,563] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 325
[default3]:[2022-03-04 04:09:35,581] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 99
[default5]:[2022-03-04 04:09:35,592] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 101
[default5]:[2022-03-04 04:09:35,613] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 245
[default5]:[2022-03-04 04:09:35,619] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 365
[default4]:[2022-03-04 04:09:35,644] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 100
[default6]:[2022-03-04 04:09:35,629] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 246
[default5]:[2022-03-04 04:09:35,633] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 37
[default2]:[2022-03-04 04:09:35,625] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 34
[default3]:[2022-03-04 04:09:35,615] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 19
[default1]:[2022-03-04 04:09:35,589] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 209
[default6]:[2022-03-04 04:09:35,584] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 150
[default2]:[2022-03-04 04:09:35,573] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 218
[default2]:[2022-03-04 04:09:35,615] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 210
[default1]:[2022-03-04 04:09:35,728] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 241
[default1]:[2022-03-04 04:09:35,729] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 129
[default2]:[2022-03-04 04:09:35,658] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 298
[default7]:[2022-03-04 04:09:35,702] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 367
[default2]:[2022-03-04 04:09:35,687] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 242
[default7]:[2022-03-04 04:09:35,692] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 247
[default5]:[2022-03-04 04:09:35,713] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 205
[default1]:[2022-03-04 04:09:35,748] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 257
[default5]:[2022-03-04 04:09:35,763] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 221
[default7]:[2022-03-04 04:09:35,748] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 223
[default4]:[2022-03-04 04:09:35,832] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 92
[default0]:[2022-03-04 04:09:35,770] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 360
[default6]:[2022-03-04 04:09:35,849] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 134
[default2]:[2022-03-04 04:09:35,755] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 98
[default0]:[2022-03-04 04:09:35,786] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 240
[default1]:[2022-03-04 04:09:35,855] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 9
[default6]:[2022-03-04 04:09:35,822] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 14
[default5]:[2022-03-04 04:09:35,774] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 13
[default2]:[2022-03-04 04:09:35,862] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 130
[default1]:[2022-03-04 04:09:35,947] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 321
[default3]:[2022-03-04 04:09:35,880] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 323
[default5]:[2022-03-04 04:09:35,865] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 85
[default3]:[2022-03-04 04:09:35,889] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 139
[default1]:[2022-03-04 04:09:35,945] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 201
[default2]:[2022-03-04 04:09:35,900] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 202
[default5]:[2022-03-04 04:09:35,913] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 213
[default2]:[2022-03-04 04:09:35,918] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 378
[default7]:[2022-03-04 04:09:35,960] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 263
[default3]:[2022-03-04 04:09:35,936] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 219
[default1]:[2022-03-04 04:09:36,013] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 137
[default4]:[2022-03-04 04:09:36,040] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 132
[default5]:[2022-03-04 04:09:36,018] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 141
[default2]:[2022-03-04 04:09:36,024] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 138
[default2]:[2022-03-04 04:09:36,008] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 146
[default6]:[2022-03-04 04:09:36,034] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 358
[default6]:[2022-03-04 04:09:36,044] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 222
[default2]:[2022-03-04 04:09:36,052] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 90
[default2]:[2022-03-04 04:09:36,137] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 322
[default6]:[2022-03-04 04:09:36,110] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 142
[default4]:[2022-03-04 04:09:36,081] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 364
[default6]:[2022-03-04 04:09:36,144] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 302
[default3]:[2022-03-04 04:09:36,061] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 3
[default2]:[2022-03-04 04:09:36,096] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 2
[default4]:[2022-03-04 04:09:36,117] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 84
[default1]:[2022-03-04 04:09:36,070] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 33
[default5]:[2022-03-04 04:09:36,113] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 261
[default1]:[2022-03-04 04:09:36,086] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 145
[default5]:[2022-03-04 04:09:36,077] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 149
[default6]:[2022-03-04 04:09:36,078] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 206
[default1]:[2022-03-04 04:09:36,085] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 217
[default7]:[2022-03-04 04:09:36,115] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 359
[default1]:[2022-03-04 04:09:36,116] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 377
[default5]:[2022-03-04 04:09:36,175] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 133
[default3]:[2022-03-04 04:09:36,186] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 355
[default1]:[2022-03-04 04:09:36,203] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 361
[default2]:[2022-03-04 04:09:36,221] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 362
[default7]:[2022-03-04 04:09:36,214] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 143
[default4]:[2022-03-04 04:09:36,227] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 356
[default1]:[2022-03-04 04:09:36,224] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 81
[default3]:[2022-03-04 04:09:36,189] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 211
[default6]:[2022-03-04 04:09:36,191] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 6
[default6]:[2022-03-04 04:09:36,261] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 94
[default1]:[2022-03-04 04:09:36,289] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 97
[default2]:[2022-03-04 04:09:36,321] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 354
[default3]:[2022-03-04 04:09:36,346] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 363
[default6]:[2022-03-04 04:09:36,316] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 382
[default4]:[2022-03-04 04:09:36,351] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 260
[default0]:[2022-03-04 04:09:36,305] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 8
[default4]:[2022-03-04 04:09:36,311] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 220
[default7]:[2022-03-04 04:09:36,301] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 7
[default2]:[2022-03-04 04:09:36,325] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 10
[default0]:[2022-03-04 04:09:36,412] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 80
[default1]:[2022-03-04 04:09:36,374] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 1
[default5]:[2022-03-04 04:09:36,380] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 381
[default6]:[2022-03-04 04:09:36,547] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 366
[default4]:[2022-03-04 04:09:36,510] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 4
[default3]:[2022-03-04 04:09:36,465] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 35
[default6]:[2022-03-04 04:09:36,532] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 38
[default5]:[2022-03-04 04:09:36,518] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 357
[default7]:[2022-03-04 04:09:36,516] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 215
[default3]:[2022-03-04 04:09:36,543] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 11
[default3]:[2022-03-04 04:09:36,641] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 243
[default5]:[2022-03-04 04:09:36,601] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 5
[default7]:[2022-03-04 04:09:36,584] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 39
[default1]:[2022-03-04 04:09:36,575] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 209
[default2]:[2022-03-04 04:09:36,655] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 210
[default7]:[2022-03-04 04:09:36,644] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 223
[default5]:[2022-03-04 04:09:36,651] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 101
[default1]:[2022-03-04 04:09:36,725] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 353
[default5]:[2022-03-04 04:09:36,740] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 365
[default3]:[2022-03-04 04:09:36,744] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 259
[default7]:[2022-03-04 04:09:36,805] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 135
[default5]:[2022-03-04 04:09:36,838] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 325
[default6]:[2022-03-04 04:09:36,842] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 102
[default7]:[2022-03-04 04:09:36,810] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 327
[default5]:[2022-03-04 04:09:36,844] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 213
[default7]:[2022-03-04 04:09:36,771] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 15
[default6]:[2022-03-04 04:09:36,866] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 14
[default7]:[2022-03-04 04:09:36,919] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 367
[default2]:[2022-03-04 04:09:36,858] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 98
[default3]:[2022-03-04 04:09:36,945] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 83
[default1]:[2022-03-04 04:09:36,943] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 9
[default2]:[2022-03-04 04:09:36,910] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 218
[default2]:[2022-03-04 04:09:36,969] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 130
[default7]:[2022-03-04 04:09:37,030] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 103
[default3]:[2022-03-04 04:09:37,043] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 99
[default2]:[2022-03-04 04:09:36,961] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 242
[default2]:[2022-03-04 04:09:37,030] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 138
[default1]:[2022-03-04 04:09:37,111] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 321
[default3]:[2022-03-04 04:09:37,105] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 323
[default7]:[2022-03-04 04:09:37,136] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 247
[default6]:[2022-03-04 04:09:37,128] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 246
[default5]:[2022-03-04 04:09:37,221] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 141
[default0]:[2022-03-04 04:09:37,158] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 360
[default6]:[2022-03-04 04:09:37,196] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 134
[default2]:[2022-03-04 04:09:37,299] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 82
[default1]:[2022-03-04 04:09:37,302] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 361
[default2]:[2022-03-04 04:09:37,323] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 362
[default7]:[2022-03-04 04:09:37,302] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 87
[default6]:[2022-03-04 04:09:37,279] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 86
[default3]:[2022-03-04 04:09:37,305] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 219
[default1]:[2022-03-04 04:09:37,358] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 137
[default3]:[2022-03-04 04:09:37,372] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 3
[default6]:[2022-03-04 04:09:37,392] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 358
[default3]:[2022-03-04 04:09:37,536] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 355
[default6]:[2022-03-04 04:09:37,486] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 142
[default7]:[2022-03-04 04:09:37,503] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 143
[default1]:[2022-03-04 04:09:37,611] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 1
[default5]:[2022-03-04 04:09:37,572] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 357
[default6]:[2022-03-04 04:09:37,654] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 6
[default7]:[2022-03-04 04:09:37,632] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 359
[default2]:[2022-03-04 04:09:37,746] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 354
[default2]:[2022-03-04 04:09:37,714] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 2
[default1]:[2022-03-04 04:09:37,794] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 353
[default0]:  successfully loaded checkpoint from /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints at iteration 4704
[default0]:estimated model parameters: 191.162474496
[default0]:estimated model parameters without embeddings: 148.003086336
[default0]:[after model, optimizer, and learning rate scheduler are built] datetime: 2022-03-04 04:09:37 
[default0]:> building train, validation, and test datasets ...
[default0]: > datasets target sizes (minimum size):
[default0]:    train:      220000000
[default0]:    validation: 2641920
[default0]:    test:       20480
[default0]:> building train, validation, and test datasets for GPT ...
[default0]: > building dataset index ...
[default5]:[2022-03-04 04:09:37,876] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 5
[default0]:/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/utils.py:280: UserWarning: Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings
[default0]:  warnings.warn("Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings")
[default7]:[2022-03-04 04:09:37,903] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 7
[default7]:time (ms) | load-checkpoint: 24741.96
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.110049 seconds
[default0]:    number of documents: 1276214
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 1211127) total of 1211127 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.147 seconds
[default0]:    total number of samples: 19333818
[default0]:    total number of epochs: 41
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.007279 seconds
[default0]:    number of documents: 2218089
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 2104966) total of 2104966 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.172 seconds
[default0]:    total number of samples: 4602461
[default0]:    total number of epochs: 22
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.013273 seconds
[default0]:    number of documents: 14716427
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 13965889) total of 13965889 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.044 seconds
[default0]:    total number of samples: 35728792
[default0]:    total number of epochs: 4
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.015255 seconds
[default0]:    number of documents: 2767535
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 2626391) total of 2626391 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.056 seconds
[default0]:    total number of samples: 28139393
[default0]:    total number of epochs: 28
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.005037 seconds
[default0]:    number of documents: 786245
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 746147) total of 746147 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.029 seconds
[default0]:    total number of samples: 670404
[default0]:    total number of epochs: 22
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.015366 seconds
[default0]:    number of documents: 1748556
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 1659380) total of 1659380 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.134 seconds
[default0]:    total number of samples: 27952020
[default0]:    total number of epochs: 56
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002278 seconds
[default0]:    number of documents: 29464287
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 27961608) total of 27961608 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.046 seconds
[default0]:    total number of samples: 14638800
[default0]:    total number of epochs: 42
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.005770 seconds
[default0]:    number of documents: 38304059
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 36350552) total of 36350552 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.048 seconds
[default0]:    total number of samples: 27308815
[default0]:    total number of epochs: 46
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.005162 seconds
[default0]:    number of documents: 729667
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 692454) total of 692454 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.100 seconds
[default0]:    total number of samples: 6887421
[default0]:    total number of epochs: 22
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002614 seconds
[default0]:    number of documents: 24265522
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 23027980) total of 23027980 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.053 seconds
[default0]:    total number of samples: 10304343
[default0]:    total number of epochs: 25
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.014178 seconds
[default0]:    number of documents: 9587455
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 9098495) total of 9098495 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.059 seconds
[default0]:    total number of samples: 28924755
[default0]:    total number of epochs: 10
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.005628 seconds
[default0]:    number of documents: 4335929
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 4114797) total of 4114797 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.044 seconds
[default0]:    total number of samples: 29929866
[default0]:    total number of epochs: 11
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001051 seconds
[default0]:    number of documents: 149731
[default0]: > dataset split:
[default0]:    train:
[default0]:     document indices in [0, 142095) total of 142095 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.013 seconds
[default0]:    total number of samples: 127855
[default0]:    total number of epochs: 18
[default0]:> building indices for blendable datasets ...
[default0]: > sample ratios:
[default0]:   dataset 0, input: 0.0870676, achieved: 0.0870676
[default0]:   dataset 1, input: 0.0207314, achieved: 0.0207314
[default0]:   dataset 2, input: 0.1247, achieved: 0.1247
[default0]:   dataset 3, input: 0.124182, achieved: 0.124182
[default0]:   dataset 4, input: 0.0029046, achieved: 0.0029046
[default0]:   dataset 5, input: 0.1247, achieved: 0.1247
[default0]:   dataset 6, input: 0.0659275, achieved: 0.0659275
[default0]:   dataset 7, input: 0.120941, achieved: 0.120941
[default0]:   dataset 8, input: 0.0310665, achieved: 0.0310665
[default0]:   dataset 9, input: 0.0454631, achieved: 0.0454631
[default0]:   dataset 10, input: 0.127064, achieved: 0.127064
[default0]:   dataset 11, input: 0.1247, achieved: 0.1247
[default0]:   dataset 12, input: 0.000554406, achieved: 0.000554405
[default0]:> elapsed time for building blendable dataset indices: 4.26 (sec)
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002351 seconds
[default0]:    number of documents: 1276214
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [1211127, 1274938) total of 63811 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.009 seconds
[default0]:    total number of samples: 241146
[default0]:    total number of epochs: 18
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002175 seconds
[default0]:    number of documents: 2218089
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [2104966, 2215871) total of 110905 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.008 seconds
[default0]:    total number of samples: 55872
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.011774 seconds
[default0]:    number of documents: 14716427
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [13965889, 14701711) total of 735822 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.023 seconds
[default0]:    total number of samples: 1880535
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002515 seconds
[default0]:    number of documents: 2767535
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [2626391, 2764767) total of 138376 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.009 seconds
[default0]:    total number of samples: 480297
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.009166 seconds
[default0]:    number of documents: 786245
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [746147, 785459) total of 39312 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.006 seconds
[default0]:    total number of samples: 8487
[default0]:    total number of epochs: 8
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002463 seconds
[default0]:    number of documents: 1748556
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [1659380, 1746807) total of 87427 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.027 seconds
[default0]:    total number of samples: 907157
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.015043 seconds
[default0]:    number of documents: 29464287
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [27961608, 29434823) total of 1473215 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.019 seconds
[default0]:    total number of samples: 186675
[default0]:    total number of epochs: 12
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.007638 seconds
[default0]:    number of documents: 38304059
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [36350552, 38265755) total of 1915203 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.099 seconds
[default0]:    total number of samples: 333733
[default0]:    total number of epochs: 13
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002091 seconds
[default0]:    number of documents: 729667
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [692454, 728937) total of 36483 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.006 seconds
[default0]:    total number of samples: 98264
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.009394 seconds
[default0]:    number of documents: 24265522
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [23027980, 24241256) total of 1213276 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.020 seconds
[default0]:    total number of samples: 129080
[default0]:    total number of epochs: 6
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.007646 seconds
[default0]:    number of documents: 9587455
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [9098495, 9577868) total of 479373 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.012 seconds
[default0]:    total number of samples: 469042
[default0]:    total number of epochs: 3
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.006787 seconds
[default0]:    number of documents: 4335929
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [4114797, 4331593) total of 216796 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.016 seconds
[default0]:    total number of samples: 398209
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.000600 seconds
[default0]:    number of documents: 149731
[default0]: > dataset split:
[default0]:    valid:
[default0]:     document indices in [142095, 149581) total of 7486 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 1544
[default0]:    total number of epochs: 6
[default0]:> building indices for blendable datasets ...
[default0]: > sample ratios:
[default0]:   dataset 0, input: 0.0870676, achieved: 0.0870675
[default0]:   dataset 1, input: 0.0207314, achieved: 0.0207315
[default0]:   dataset 2, input: 0.1247, achieved: 0.1247
[default0]:   dataset 3, input: 0.124182, achieved: 0.124182
[default0]:   dataset 4, input: 0.0029046, achieved: 0.00290461
[default0]:   dataset 5, input: 0.1247, achieved: 0.1247
[default0]:   dataset 6, input: 0.0659275, achieved: 0.0659274
[default0]:   dataset 7, input: 0.120941, achieved: 0.120941
[default0]:   dataset 8, input: 0.0310665, achieved: 0.0310665
[default0]:   dataset 9, input: 0.0454631, achieved: 0.0454631
[default0]:   dataset 10, input: 0.127064, achieved: 0.127064
[default0]:   dataset 11, input: 0.1247, achieved: 0.1247
[default0]:   dataset 12, input: 0.000554406, achieved: 0.000554525
[default0]:> elapsed time for building blendable dataset indices: 0.09 (sec)
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002387 seconds
[default0]:    number of documents: 1276214
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [1274938, 1276214) total of 1276 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.010 seconds
[default0]:    total number of samples: 202915
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002280 seconds
[default0]:    number of documents: 2218089
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [2215871, 2218089) total of 2218 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 459
[default0]:    total number of epochs: 13
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002061 seconds
[default0]:    number of documents: 14716427
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [14701711, 14716427) total of 14716 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 37487
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001854 seconds
[default0]:    number of documents: 2767535
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [2764767, 2767535) total of 2768 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.003 seconds
[default0]:    total number of samples: 9926
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.006601 seconds
[default0]:    number of documents: 786245
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [785459, 786245) total of 786 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 79
[default0]:    total number of epochs: 4
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001729 seconds
[default0]:    number of documents: 1748556
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [1746807, 1748556) total of 1749 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 34096
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001585 seconds
[default0]:    number of documents: 29464287
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [29434823, 29464287) total of 29464 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 1645
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.007080 seconds
[default0]:    number of documents: 38304059
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [38265755, 38304059) total of 38304 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 2778
[default0]:    total number of epochs: 5
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.002180 seconds
[default0]:    number of documents: 729667
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [728937, 729667) total of 730 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.008 seconds
[default0]:    total number of samples: 716
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001711 seconds
[default0]:    number of documents: 24265522
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [24241256, 24265522) total of 24266 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.003 seconds
[default0]:    total number of samples: 1312
[default0]:    total number of epochs: 3
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001763 seconds
[default0]:    number of documents: 9587455
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [9577868, 9587455) total of 9587 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 3324
[default0]:    total number of epochs: 2
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.001737 seconds
[default0]:    number of documents: 4335929
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [4331593, 4335929) total of 4336 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.004 seconds
[default0]:    total number of samples: 3964
[default0]:    total number of epochs: 1
[default0]: > building dataset index ...
[default0]:    reading sizes...
[default0]:    reading pointers...
[default0]:    reading document index...
[default0]:    creating numpy buffer of mmap...
[default0]:    creating memory view of numpy buffer...
[default0]: > finished creating indexed dataset in 0.000679 seconds
[default0]:    number of documents: 149731
[default0]: > dataset split:
[default0]:    test:
[default0]:     document indices in [149581, 149731) total of 150 documents
[default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_doc_idx.npy
[default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_sample_idx.npy
[default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_shuffle_idx.npy
[default0]:    loaded indexed file in 0.002 seconds
[default0]:    total number of samples: 15
[default0]:    total number of epochs: 2
[default0]:> building indices for blendable datasets ...
[default0]: > sample ratios:
[default0]:   dataset 0, input: 0.0870676, achieved: 0.0870664
[default0]:   dataset 1, input: 0.0207314, achieved: 0.020733
[default0]:   dataset 2, input: 0.1247, achieved: 0.124699
[default0]:   dataset 3, input: 0.124182, achieved: 0.12418
[default0]:   dataset 4, input: 0.0029046, achieved: 0.0029059
[default0]:   dataset 5, input: 0.1247, achieved: 0.124699
[default0]:   dataset 6, input: 0.0659275, achieved: 0.0659284
[default0]:   dataset 7, input: 0.120941, achieved: 0.12094
[default0]:   dataset 8, input: 0.0310665, achieved: 0.0310676
[default0]:   dataset 9, input: 0.0454631, achieved: 0.0454632
[default0]:   dataset 10, input: 0.127064, achieved: 0.127063
[default0]:   dataset 11, input: 0.1247, achieved: 0.124699
[default0]:   dataset 12, input: 0.000554406, achieved: 0.000555736
[default0]:> elapsed time for building blendable dataset indices: 0.01 (sec)
[default0]:> finished creating GPT datasets ...
[default3]:[003-005] 177.6021B / 177.6021B
[default3]:[003-006] 177.6021B / 177.6021B
[default1]:[001-006] 177.6021B / 177.6021B
[default2]:[002-009] 177.6021B / 177.6021B
[default1]:[001-002] 177.6021B / 177.6021B
[default7]:time (ms) | model-and-optimizer-setup: 32159.25 | train/valid/test-data-iterators-setup: 12990.59
[default2]:[002-010] 177.6021B / 177.6021B
[default1]:[001-010] 177.6021B / 177.6021B
[default3]:[003-003] 177.6021B / 177.6021B
[default2]:[002-004] 177.6021B / 177.6021B
[default0]:[000-002] 177.6021B / 177.6021B
[default3]:[003-011] 191.1639B / 148.0045B
[default2]:[002-011] 191.1639B / 148.0045B
[default0]:[000-011] 191.1639B / 148.0045B
[default0]:[000-007] 177.6021B / 177.6021B
[default3]:[003-007] 177.6021B / 177.6021B
[default1]:[001-011] 191.1639B / 148.0045B
[default3]:[003-002] 177.6021B / 177.6021B
[default0]:[000-010] 177.6021B / 177.6021B
[default3]:[003-010] 177.6021B / 177.6021B
[default1]:[001-007] 177.6021B / 177.6021B
[default2]:[002-002] 177.6021B / 177.6021B
[default3]:[003-009] 177.6021B / 177.6021B
[default2]:[002-007] 177.6021B / 177.6021B
[default2]:[002-003] 177.6021B / 177.6021B
[default0]:[after dataloaders are built] datetime: 2022-03-04 04:09:51 
[default0]:done with setup ...
[default0]:training ...
[default0]:Number of parameters: [tensor rank - pipeline rank] w/ and w/o embeddings:
[default0]:[000-000] 191.1625B / 148.0031B
[default0]:[before the start of training step] datetime: 2022-03-04 04:09:51 
[default0]:[2022-03-04 04:09:51,644] [INFO] [checkpointing.py:547:forward] Activation Checkpointing Information
[default0]:[2022-03-04 04:09:51,644] [INFO] [checkpointing.py:548:forward] ----Partition Activations False, CPU CHECKPOINTING False
[default0]:[2022-03-04 04:09:51,644] [INFO] [checkpointing.py:551:forward] ----contiguous Memory Checkpointing False with 70 total layers
[default0]:[2022-03-04 04:09:51,644] [INFO] [checkpointing.py:554:forward] ----Synchronization False
[default0]:[2022-03-04 04:09:51,644] [INFO] [checkpointing.py:555:forward] ----Profiling time in checkpointing False
[default3]:[003-001] 177.6021B / 177.6021B
[default1]:[001-001] 177.6021B / 177.6021B
[default0]:[000-004] 177.6021B / 177.6021B
[default3]:[003-004] 177.6021B / 177.6021B
[default3]:[003-000] 191.1625B / 148.0031B
[default1]:[001-000] 191.1625B / 148.0031B
[default2]:[002-000] 191.1625B / 148.0031B
[default1]:[001-005] 177.6021B / 177.6021B
[default2]:[002-001] 177.6021B / 177.6021B
[default0]:[000-001] 177.6021B / 177.6021B
[default0]:[000-009] 177.6021B / 177.6021B
[default2]:[002-006] 177.6021B / 177.6021B
[default0]:[000-006] 177.6021B / 177.6021B
[default2]:[002-008] 177.6021B / 177.6021B
[default1]:[001-008] 177.6021B / 177.6021B
[default1]:[001-003] 177.6021B / 177.6021B
[default0]:[000-003] 177.6021B / 177.6021B
[default0]:[000-008] 177.6021B / 177.6021B
[default2]:[002-005] 177.6021B / 177.6021B
[default1]:[001-004] 177.6021B / 177.6021B
[default3]:[003-008] 177.6021B / 177.6021B
[default1]:[001-009] 177.6021B / 177.6021B
[default0]:[000-005] 177.6021B / 177.6021B
[default3]:[Rank 163] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 259] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 195] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default7]: iteration     4705/  128728 | consumed samples:        75280 | consumed tokens:    154173440 | elapsed time per iteration (s): 40.42 | learning rate: 2.467E-05 | global batch size:    16 | lm loss: 8.390673E+00 | grad norm: 1.463 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.396 | TFLOPs: 3.03 |
[default3]:[Rank 99] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 355] (after 4705 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0
[default3]:[Rank 227] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 67] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 291] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 323] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 35] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 131] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default3]:[Rank 3] (after 4705 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0
[default1]:[Rank 321] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 64] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 353] (after 4705 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0
[default0]:[Rank 352] (after 4705 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0
[default0]:[Rank 224] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 225] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 320] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 0] (after 4705 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0
[default0]:[Rank 128] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 1] (after 4705 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0
[default1]:[Rank 33] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 32] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 161] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 288] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 192] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 97] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 96] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 162] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 289] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 258] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 129] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 257] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 194] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 160] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default0]:[Rank 256] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 290] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 193] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default1]:[Rank 65] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 322] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 130] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 354] (after 4705 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0
[default2]:[Rank 66] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 2] (after 4705 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0
[default2]:[Rank 98] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 226] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default2]:[Rank 34] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0
[default7]: iteration     4706/  128728 | consumed samples:        75296 | consumed tokens:    154206208 | elapsed time per iteration (s): 13.98 | learning rate: 2.467E-05 | global batch size:    16 | lm loss: 5.303584E+00 | grad norm: 1.153 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.145 | TFLOPs: 8.76 |
[default7]: iteration     4707/  128728 | consumed samples:        75312 | consumed tokens:    154238976 | elapsed time per iteration (s): 13.74 | learning rate: 2.468E-05 | global batch size:    16 | lm loss: 5.203705E+00 | grad norm: 1.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.165 | TFLOPs: 8.92 |
[default7]: iteration     4708/  128728 | consumed samples:        75328 | consumed tokens:    154271744 | elapsed time per iteration (s): 13.70 | learning rate: 2.468E-05 | global batch size:    16 | lm loss: 5.036973E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.168 | TFLOPs: 8.94 |
[default7]: iteration     4709/  128728 | consumed samples:        75344 | consumed tokens:    154304512 | elapsed time per iteration (s): 13.87 | learning rate: 2.469E-05 | global batch size:    16 | lm loss: 5.276271E+00 | grad norm: 1.210 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.153 | TFLOPs: 8.83 |
[default7]: iteration     4710/  128728 | consumed samples:        75360 | consumed tokens:    154337280 | elapsed time per iteration (s): 13.75 | learning rate: 2.469E-05 | global batch size:    16 | lm loss: 5.234168E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.164 | TFLOPs: 8.91 |
[default7]: iteration     4711/  128728 | consumed samples:        75376 | consumed tokens:    154370048 | elapsed time per iteration (s): 13.73 | learning rate: 2.470E-05 | global batch size:    16 | lm loss: 5.284269E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.165 | TFLOPs: 8.92 |
[default7]: iteration     4712/  128728 | consumed samples:        75392 | consumed tokens:    154402816 | elapsed time per iteration (s): 13.68 | learning rate: 2.470E-05 | global batch size:    16 | lm loss: 5.290073E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.169 | TFLOPs: 8.95 |
[default7]: iteration     4713/  128728 | consumed samples:        75408 | consumed tokens:    154435584 | elapsed time per iteration (s): 13.82 | learning rate: 2.471E-05 | global batch size:    16 | lm loss: 5.294506E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.158 | TFLOPs: 8.87 |
[default7]: iteration     4714/  128728 | consumed samples:        75424 | consumed tokens:    154468352 | elapsed time per iteration (s): 13.68 | learning rate: 2.471E-05 | global batch size:    16 | lm loss: 5.210737E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.169 | TFLOPs: 8.95 |
[default7]: iteration     4715/  128728 | consumed samples:        75440 | consumed tokens:    154501120 | elapsed time per iteration (s): 13.72 | learning rate: 2.472E-05 | global batch size:    16 | lm loss: 4.925090E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.166 | TFLOPs: 8.93 |
[default7]: iteration     4716/  128728 | consumed samples:        75456 | consumed tokens:    154533888 | elapsed time per iteration (s): 13.80 | learning rate: 2.473E-05 | global batch size:    16 | lm loss: 5.171408E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.159 | TFLOPs: 8.88 |
[default7]: iteration     4717/  128728 | consumed samples:        75472 | consumed tokens:    154566656 | elapsed time per iteration (s): 13.74 | learning rate: 2.473E-05 | global batch size:    16 | lm loss: 5.223558E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.164 | TFLOPs: 8.91 |
[default7]: iteration     4718/  128728 | consumed samples:        75488 | consumed tokens:    154599424 | elapsed time per iteration (s): 13.72 | learning rate: 2.474E-05 | global batch size:    16 | lm loss: 5.274587E+00 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.166 | TFLOPs: 8.93 |
[default7]: iteration     4719/  128728 | consumed samples:        75504 | consumed tokens:    154632192 | elapsed time per iteration (s): 13.72 | learning rate: 2.474E-05 | global batch size:    16 | lm loss: 5.199393E+00 | grad norm: 1.240 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.166 | TFLOPs: 8.93 |
[default7]: iteration     4720/  128728 | consumed samples:        75520 | consumed tokens:    154664960 | elapsed time per iteration (s): 13.72 | learning rate: 2.475E-05 | global batch size:    16 | lm loss: 5.032928E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.167 | TFLOPs: 8.93 |
[default7]: iteration     4721/  128728 | consumed samples:        75536 | consumed tokens:    154697728 | elapsed time per iteration (s): 13.84 | learning rate: 2.475E-05 | global batch size:    16 | lm loss: 5.543484E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.156 | TFLOPs: 8.85 |
[default7]: iteration     4722/  128728 | consumed samples:        75552 | consumed tokens:    154730496 | elapsed time per iteration (s): 13.82 | learning rate: 2.476E-05 | global batch size:    16 | lm loss: 5.203832E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.158 | TFLOPs: 8.87 |
[default7]: iteration     4723/  128728 | consumed samples:        75568 | consumed tokens:    154763264 | elapsed time per iteration (s): 13.75 | learning rate: 2.476E-05 | global batch size:    16 | lm loss: 5.214847E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.164 | TFLOPs: 8.91 |
[default7]: iteration     4724/  128728 | consumed samples:        75584 | consumed tokens:    154796032 | elapsed time per iteration (s): 13.73 | learning rate: 2.477E-05 | global batch size:    16 | lm loss: 5.272194E+00 | grad norm: 3.048 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.166 | TFLOPs: 8.92 |
[default7]: iteration     4725/  128728 | consumed samples:        75600 | consumed tokens:    154828800 | elapsed time per iteration (s): 13.66 | learning rate: 2.477E-05 | global batch size:    16 | lm loss: 5.209924E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.171 | TFLOPs: 8.97 |
[default7]: iteration     4726/  128728 | consumed samples:        75616 | consumed tokens:    154861568 | elapsed time per iteration (s): 13.64 | learning rate: 2.478E-05 | global batch size:    16 | lm loss: 5.252506E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.173 | TFLOPs: 8.98 |
[default7]: iteration     4727/  128728 | consumed samples:        75632 | consumed tokens:    154894336 | elapsed time per iteration (s): 13.73 | learning rate: 2.478E-05 | global batch size:    16 | lm loss: 5.076056E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.166 | TFLOPs: 8.92 |
[default7]: iteration     4728/  128728 | consumed samples:        75648 | consumed tokens:    154927104 | elapsed time per iteration (s): 13.73 | learning rate: 2.479E-05 | global batch size:    16 | lm loss: 5.213652E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.165 | TFLOPs: 8.92 |
[default7]: iteration     4729/  128728 | consumed samples:        75664 | consumed tokens:    154959872 | elapsed time per iteration (s): 13.83 | learning rate: 2.479E-05 | global batch size:    16 | lm loss: 5.241081E+00 | grad norm: 1.233 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.157 | TFLOPs: 8.86 |
[default7]: iteration     4730/  128728 | consumed samples:        75680 | consumed tokens:    154992640 | elapsed time per iteration (s): 13.70 | learning rate: 2.480E-05 | global batch size:    16 | lm loss: 5.206524E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.168 | TFLOPs: 8.94 |
[default7]: iteration     4731/  128728 | consumed samples:        75696 | consumed tokens:    155025408 | elapsed time per iteration (s): 13.83 | learning rate: 2.480E-05 | global batch size:    16 | lm loss: 5.311900E+00 | grad norm: 1.464 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.157 | TFLOPs: 8.86 |
[default7]: iteration     4732/  128728 | consumed samples:        75712 | consumed tokens:    155058176 | elapsed time per iteration (s): 13.76 | learning rate: 2.481E-05 | global batch size:    16 | lm loss: 5.097121E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.163 | TFLOPs: 8.90 |
[default7]: iteration     4733/  128728 | consumed samples:        75728 | consumed tokens:    155090944 | elapsed time per iteration (s): 13.71 | learning rate: 2.481E-05 | global batch size:    16 | lm loss: 5.149732E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.167 | TFLOPs: 8.93 |
[default7]: iteration     4734/  128728 | consumed samples:        75744 | consumed tokens:    155123712 | elapsed time per iteration (s): 13.65 | learning rate: 2.482E-05 | global batch size:    16 | lm loss: 5.032346E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.172 | TFLOPs: 8.97 |
[default7]: iteration     4735/  128728 | consumed samples:        75760 | consumed tokens:    155156480 | elapsed time per iteration (s): 13.76 | learning rate: 2.483E-05 | global batch size:    16 | lm loss: 4.994672E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.163 | TFLOPs: 8.90 |
[default7]: iteration     4736/  128728 | consumed samples:        75776 | consumed tokens:    155189248 | elapsed time per iteration (s): 13.84 | learning rate: 2.483E-05 | global batch size:    16 | lm loss: 5.258005E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.156 | TFLOPs: 8.85 |
[default7]: iteration     4737/  128728 | consumed samples:        75792 | consumed tokens:    155222016 | elapsed time per iteration (s): 13.88 | learning rate: 2.484E-05 | global batch size:    16 | lm loss: 5.300239E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.153 | TFLOPs: 8.83 |
[default7]: iteration     4738/  128728 | consumed samples:        75808 | consumed tokens:    155254784 | elapsed time per iteration (s): 13.75 | learning rate: 2.484E-05 | global batch size:    16 | lm loss: 5.183598E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.164 | TFLOPs: 8.91 |
[default7]: iteration     4739/  128728 | consumed samples:        75824 | consumed tokens:    155287552 | elapsed time per iteration (s): 13.87 | learning rate: 2.485E-05 | global batch size:    16 | lm loss: 5.146806E+00 | grad norm: 1.094 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.154 | TFLOPs: 8.84 |
[default7]: iteration     4740/  128728 | consumed samples:        75840 | consumed tokens:    155320320 | elapsed time per iteration (s): 13.65 | learning rate: 2.485E-05 | global batch size:    16 | lm loss: 5.352815E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.172 | TFLOPs: 8.97 |
[default7]: iteration     4741/  128728 | consumed samples:        75856 | consumed tokens:    155353088 | elapsed time per iteration (s): 13.71 | learning rate: 2.486E-05 | global batch size:    16 | lm loss: 5.348001E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.167 | TFLOPs: 8.94 |
[default7]: iteration     4742/  128728 | consumed samples:        75872 | consumed tokens:    155385856 | elapsed time per iteration (s): 13.68 | learning rate: 2.486E-05 | global batch size:    16 | lm loss: 4.845537E+00 | grad norm: 1.299 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.169 | TFLOPs: 8.95 |
[default7]: iteration     4743/  128728 | consumed samples:        75888 | consumed tokens:    155418624 | elapsed time per iteration (s): 13.83 | learning rate: 2.487E-05 | global batch size:    16 | lm loss: 5.267847E+00 | grad norm: 1.306 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.157 | TFLOPs: 8.86 |
[default7]: iteration     4744/  128728 | consumed samples:        75904 | consumed tokens:    155451392 | elapsed time per iteration (s): 13.76 | learning rate: 2.487E-05 | global batch size:    16 | lm loss: 5.161267E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.163 | TFLOPs: 8.90 |
[default7]: iteration     4745/  128728 | consumed samples:        75920 | consumed tokens:    155484160 | elapsed time per iteration (s): 13.85 | learning rate: 2.488E-05 | global batch size:    16 | lm loss: 5.323788E+00 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.155 | TFLOPs: 8.85 |
[default7]: iteration     4746/  128728 | consumed samples:        75936 | consumed tokens:    155516928 | elapsed time per iteration (s): 13.75 | learning rate: 2.488E-05 | global batch size:    16 | lm loss: 5.108951E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.164 | TFLOPs: 8.91 |
[default7]: iteration     4747/  128728 | consumed samples:        75952 | consumed tokens:    155549696 | elapsed time per iteration (s): 13.84 | learning rate: 2.489E-05 | global batch size:    16 | lm loss: 5.174131E+00 | grad norm: 1.248 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.156 | TFLOPs: 8.85 |
[default7]: iteration     4748/  128728 | consumed samples:        75968 | consumed tokens:    155582464 | elapsed time per iteration (s): 13.83 | learning rate: 2.489E-05 | global batch size:    16 | lm loss: 5.362530E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.157 | TFLOPs: 8.86 |
[default7]: iteration     4749/  128728 | consumed samples:        75984 | consumed tokens:    155615232 | elapsed time per iteration (s): 13.82 | learning rate: 2.490E-05 | global batch size:    16 | lm loss: 5.456128E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.158 | TFLOPs: 8.87 |
[default7]: iteration     4750/  128728 | consumed samples:        76000 | consumed tokens:    155648000 | elapsed time per iteration (s): 13.82 | learning rate: 2.490E-05 | global batch size:    16 | lm loss: 5.163225E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.157 | TFLOPs: 8.86 |
[default7]: iteration     4751/  128728 | consumed samples:        76016 | consumed tokens:    155680768 | elapsed time per iteration (s): 13.78 | learning rate: 2.491E-05 | global batch size:    16 | lm loss: 5.049766E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.161 | TFLOPs: 8.89 |
[default7]: iteration     4752/  128728 | consumed samples:        76032 | consumed tokens:    155713536 | elapsed time per iteration (s): 13.81 | learning rate: 2.491E-05 | global batch size:    16 | lm loss: 5.226779E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.158 | TFLOPs: 8.87 |
[default7]: iteration     4753/  128728 | consumed samples:        76048 | consumed tokens:    155746304 | elapsed time per iteration (s): 13.84 | learning rate: 2.492E-05 | global batch size:    16 | lm loss: 4.977962E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.156 | TFLOPs: 8.85 |
[default7]: iteration     4754/  128728 | consumed samples:        76064 | consumed tokens:    155779072 | elapsed time per iteration (s): 13.73 | learning rate: 2.492E-05 | global batch size:    16 | lm loss: 5.137729E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.166 | TFLOPs: 8.92 |
[default7]: iteration     4755/  128728 | consumed samples:        76080 | consumed tokens:    155811840 | elapsed time per iteration (s): 13.66 | learning rate: 2.493E-05 | global batch size:    16 | lm loss: 5.145767E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.171 | TFLOPs: 8.97 |
[default7]: iteration     4756/  128728 | consumed samples:        76096 | consumed tokens:    155844608 | elapsed time per iteration (s): 13.77 | learning rate: 2.494E-05 | global batch size:    16 | lm loss: 5.172428E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.162 | TFLOPs: 8.90 |
[default7]: iteration     4757/  128728 | consumed samples:        76112 | consumed tokens:    155877376 | elapsed time per iteration (s): 13.81 | learning rate: 2.494E-05 | global batch size:    16 | lm loss: 5.208878E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.158 | TFLOPs: 8.87 |
[default7]: iteration     4758/  128728 | consumed samples:        76128 | consumed tokens:    155910144 | elapsed time per iteration (s): 13.84 | learning rate: 2.495E-05 | global batch size:    16 | lm loss: 5.108291E+00 | grad norm: 2.017 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.156 | TFLOPs: 8.85 |
[default7]: iteration     4759/  128728 | consumed samples:        76144 | consumed tokens:    155942912 | elapsed time per iteration (s): 13.79 | learning rate: 2.495E-05 | global batch size:    16 | lm loss: 5.342599E+00 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.160 | TFLOPs: 8.88 |
[default7]: iteration     4760/  128728 | consumed samples:        76160 | consumed tokens:    155975680 | elapsed time per iteration (s): 13.66 | learning rate: 2.496E-05 | global batch size:    16 | lm loss: 5.177962E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.172 | TFLOPs: 8.97 |
[default7]: iteration     4761/  128728 | consumed samples:        76176 | consumed tokens:    156008448 | elapsed time per iteration (s): 13.83 | learning rate: 2.496E-05 | global batch size:    16 | lm loss: 5.397847E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.157 | TFLOPs: 8.86 |
[default7]: iteration     4762/  128728 | consumed samples:        76192 | consumed tokens:    156041216 | elapsed time per iteration (s): 13.65 | learning rate: 2.497E-05 | global batch size:    16 | lm loss: 5.027542E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.172 | TFLOPs: 8.97 |
[default7]: iteration     4763/  128728 | consumed samples:        76208 | consumed tokens:    156073984 | elapsed time per iteration (s): 13.85 | learning rate: 2.497E-05 | global batch size:    16 | lm loss: 4.952395E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.156 | TFLOPs: 8.85 |
[default7]: iteration     4764/  128728 | consumed samples:        76224 | consumed tokens:    156106752 | elapsed time per iteration (s): 13.74 | learning rate: 2.498E-05 | global batch size:    16 | lm loss: 5.375393E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.165 | TFLOPs: 8.92 |
[default7]: iteration     4765/  128728 | consumed samples:        76240 | consumed tokens:    156139520 | elapsed time per iteration (s): 13.85 | learning rate: 2.498E-05 | global batch size:    16 | lm loss: 5.178174E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.155 | TFLOPs: 8.84 |
[default7]: iteration     4766/  128728 | consumed samples:        76256 | consumed tokens:    156172288 | elapsed time per iteration (s): 13.84 | learning rate: 2.499E-05 | global batch size:    16 | lm loss: 5.174879E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.156 | TFLOPs: 8.85 |
[default7]: iteration     4767/  128728 | consumed samples:        76272 | consumed tokens:    156205056 | elapsed time per iteration (s): 13.82 | learning rate: 2.499E-05 | global batch size:    16 | lm loss: 5.215740E+00 | grad norm: 1.283 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.158 | TFLOPs: 8.86 |
[default7]: iteration     4768/  128728 | consumed samples:        76288 | consumed tokens:    156237824 | elapsed time per iteration (s): 13.67 | learning rate: 2.500E-05 | global batch size:    16 | lm loss: 5.455339E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.171 | TFLOPs: 8.96 |
[default7]: iteration     4769/  128728 | consumed samples:        76304 | consumed tokens:    156270592 | elapsed time per iteration (s): 13.79 | learning rate: 2.500E-05 | global batch size:    16 | lm loss: 4.930388E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.160 | TFLOPs: 8.88 |
[default7]: iteration     4770/  128728 | consumed samples:        76320 | consumed tokens:    156303360 | elapsed time per iteration (s): 13.84 | learning rate: 2.501E-05 | global batch size:    16 | lm loss: 4.997752E+00 | grad norm: 1.218 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.156 | TFLOPs: 8.85 |
[default7]: iteration     4771/  128728 | consumed samples:        76336 | consumed tokens:    156336128 | elapsed time per iteration (s): 13.83 | learning rate: 2.501E-05 | global batch size:    16 | lm loss: 5.173059E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.157 | TFLOPs: 8.85 |
[default7]: iteration     4772/  128728 | consumed samples:        76352 | consumed tokens:    156368896 | elapsed time per iteration (s): 14.24 | learning rate: 2.502E-05 | global batch size:    16 | lm loss: 5.054476E+00 | grad norm: 0.613 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.124 | TFLOPs: 8.60 |
[default7]: iteration     4773/  128728 | consumed samples:        76368 | consumed tokens:    156401664 | elapsed time per iteration (s): 13.73 | learning rate: 2.502E-05 | global batch size:    16 | lm loss: 5.099241E+00 | grad norm: 0.969 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.165 | TFLOPs: 8.92 |
[default7]: iteration     4774/  128728 | consumed samples:        76384 | consumed tokens:    156434432 | elapsed time per iteration (s): 13.79 | learning rate: 2.503E-05 | global batch size:    16 | lm loss: 5.027586E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.160 | TFLOPs: 8.88 |
[default7]: iteration     4775/  128728 | consumed samples:        76400 | consumed tokens:    156467200 | elapsed time per iteration (s): 13.80 | learning rate: 2.503E-05 | global batch size:    16 | lm loss: 5.055077E+00 | grad norm: 1.007 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.159 | TFLOPs: 8.88 |
[default7]: iteration     4776/  128728 | consumed samples:        76416 | consumed tokens:    156499968 | elapsed time per iteration (s): 13.86 | learning rate: 2.504E-05 | global batch size:    16 | lm loss: 4.901511E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.154 | TFLOPs: 8.84 |
[default7]: iteration     4777/  128728 | consumed samples:        76432 | consumed tokens:    156532736 | elapsed time per iteration (s): 13.96 | learning rate: 2.505E-05 | global batch size:    16 | lm loss: 5.218966E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.146 | TFLOPs: 8.78 |
[default7]: iteration     4778/  128728 | consumed samples:        76448 | consumed tokens:    156565504 | elapsed time per iteration (s): 13.89 | learning rate: 2.505E-05 | global batch size:    16 | lm loss: 5.255514E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.152 | TFLOPs: 8.82 |
[default7]: iteration     4779/  128728 | consumed samples:        76464 | consumed tokens:    156598272 | elapsed time per iteration (s): 13.82 | learning rate: 2.506E-05 | global batch size:    16 | lm loss: 4.949065E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.158 | TFLOPs: 8.87 |
[default7]: iteration     4780/  128728 | consumed samples:        76480 | consumed tokens:    156631040 | elapsed time per iteration (s): 13.70 | learning rate: 2.506E-05 | global batch size:    16 | lm loss: 4.956588E+00 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.168 | TFLOPs: 8.94 |
[default7]: iteration     4781/  128728 | consumed samples:        76496 | consumed tokens:    156663808 | elapsed time per iteration (s): 13.71 | learning rate: 2.507E-05 | global batch size:    16 | lm loss: 5.024817E+00 | grad norm: 1.694 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.167 | TFLOPs: 8.94 |
[default7]: iteration     4782/  128728 | consumed samples:        76512 | consumed tokens:    156696576 | elapsed time per iteration (s): 13.68 | learning rate: 2.507E-05 | global batch size:    16 | lm loss: 5.319356E+00 | grad norm: 1.751 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.169 | TFLOPs: 8.95 |
[default7]: iteration     4783/  128728 | consumed samples:        76528 | consumed tokens:    156729344 | elapsed time per iteration (s): 13.80 | learning rate: 2.508E-05 | global batch size:    16 | lm loss: 5.366149E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.159 | TFLOPs: 8.87 |
[default7]: iteration     4784/  128728 | consumed samples:        76544 | consumed tokens:    156762112 | elapsed time per iteration (s): 13.68 | learning rate: 2.508E-05 | global batch size:    16 | lm loss: 5.334771E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.170 | TFLOPs: 8.96 |
[default7]: iteration     4785/  128728 | consumed samples:        76560 | consumed tokens:    156794880 | elapsed time per iteration (s): 13.81 | learning rate: 2.509E-05 | global batch size:    16 | lm loss: 5.220145E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.159 | TFLOPs: 8.87 |
[default7]: iteration     4786/  128728 | consumed samples:        76576 | consumed tokens:    156827648 | elapsed time per iteration (s): 13.84 | learning rate: 2.509E-05 | global batch size:    16 | lm loss: 5.085683E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.156 | TFLOPs: 8.85 |
[default7]: iteration     4787/  128728 | consumed samples:        76592 | consumed tokens:    156860416 | elapsed time per iteration (s): 13.79 | learning rate: 2.510E-05 | global batch size:    16 | lm loss: 5.058179E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.160 | TFLOPs: 8.88 |
[default7]: iteration     4788/  128728 | consumed samples:        76608 | consumed tokens:    156893184 | elapsed time per iteration (s): 13.82 | learning rate: 2.510E-05 | global batch size:    16 | lm loss: 5.208087E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.158 | TFLOPs: 8.86 |
[default7]: iteration     4789/  128728 | consumed samples:        76624 | consumed tokens:    156925952 | elapsed time per iteration (s): 13.73 | learning rate: 2.511E-05 | global batch size:    16 | lm loss: 5.153974E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.165 | TFLOPs: 8.92 |
[default7]: iteration     4790/  128728 | consumed samples:        76640 | consumed tokens:    156958720 | elapsed time per iteration (s): 13.81 | learning rate: 2.511E-05 | global batch size:    16 | lm loss: 5.186059E+00 | grad norm: 0.659 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.158 | TFLOPs: 8.87 |
[default7]: iteration     4791/  128728 | consumed samples:        76656 | consumed tokens:    156991488 | elapsed time per iteration (s): 13.73 | learning rate: 2.512E-05 | global batch size:    16 | lm loss: 5.013607E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.165 | TFLOPs: 8.92 |
[default7]: iteration     4792/  128728 | consumed samples:        76672 | consumed tokens:    157024256 | elapsed time per iteration (s): 13.79 | learning rate: 2.512E-05 | global batch size:    16 | lm loss: 5.210199E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.160 | TFLOPs: 8.88 |
[default7]: iteration     4793/  128728 | consumed samples:        76688 | consumed tokens:    157057024 | elapsed time per iteration (s): 13.66 | learning rate: 2.513E-05 | global batch size:    16 | lm loss: 5.175740E+00 | grad norm: 0.947 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.171 | TFLOPs: 8.97 |
[default7]: iteration     4794/  128728 | consumed samples:        76704 | consumed tokens:    157089792 | elapsed time per iteration (s): 13.70 | learning rate: 2.513E-05 | global batch size:    16 | lm loss: 5.095262E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.168 | TFLOPs: 8.94 |
[default7]: iteration     4795/  128728 | consumed samples:        76720 | consumed tokens:    157122560 | elapsed time per iteration (s): 13.79 | learning rate: 2.514E-05 | global batch size:    16 | lm loss: 4.972818E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.160 | TFLOPs: 8.88 |
[default7]: iteration     4796/  128728 | consumed samples:        76736 | consumed tokens:    157155328 | elapsed time per iteration (s): 13.74 | learning rate: 2.514E-05 | global batch size:    16 | lm loss: 5.033150E+00 | grad norm: 1.104 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.165 | TFLOPs: 8.92 |
[default7]: iteration     4797/  128728 | consumed samples:        76752 | consumed tokens:    157188096 | elapsed time per iteration (s): 13.87 | learning rate: 2.515E-05 | global batch size:    16 | lm loss: 5.181136E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.154 | TFLOPs: 8.83 |
[default7]: iteration     4798/  128728 | consumed samples:        76768 | consumed tokens:    157220864 | elapsed time per iteration (s): 13.75 | learning rate: 2.516E-05 | global batch size:    16 | lm loss: 5.075924E+00 | grad norm: 1.012 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.163 | TFLOPs: 8.91 |
[default7]: iteration     4799/  128728 | consumed samples:        76784 | consumed tokens:    157253632 | elapsed time per iteration (s): 13.72 | learning rate: 2.516E-05 | global batch size:    16 | lm loss: 4.798205E+00 | grad norm: 2.782 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.166 | TFLOPs: 8.93 |
[default7]: iteration     4800/  128728 | consumed samples:        76800 | consumed tokens:    157286400 | elapsed time per iteration (s): 13.86 | learning rate: 2.517E-05 | global batch size:    16 | lm loss: 5.076591E+00 | grad norm: 2.086 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.155 | TFLOPs: 8.84 |
[default7]: iteration     4801/  128728 | consumed samples:        76816 | consumed tokens:    157319168 | elapsed time per iteration (s): 13.85 | learning rate: 2.517E-05 | global batch size:    16 | lm loss: 5.293148E+00 | grad norm: 3.280 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.155 | TFLOPs: 8.84 |
[default7]: iteration     4802/  128728 | consumed samples:        76832 | consumed tokens:    157351936 | elapsed time per iteration (s): 13.82 | learning rate: 2.518E-05 | global batch size:    16 | lm loss: 5.133687E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.157 | TFLOPs: 8.86 |
[default7]: iteration     4803/  128728 | consumed samples:        76848 | consumed tokens:    157384704 | elapsed time per iteration (s): 13.65 | learning rate: 2.518E-05 | global batch size:    16 | lm loss: 5.139082E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.172 | TFLOPs: 8.97 |
[default7]: iteration     4804/  128728 | consumed samples:        76864 | consumed tokens:    157417472 | elapsed time per iteration (s): 13.81 | learning rate: 2.519E-05 | global batch size:    16 | lm loss: 5.191136E+00 | grad norm: 1.038 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.159 | TFLOPs: 8.87 |
[default7]: iteration     4805/  128728 | consumed samples:        76880 | consumed tokens:    157450240 | elapsed time per iteration (s): 14.27 | learning rate: 2.519E-05 | global batch size:    16 | lm loss: 5.444860E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.121 | TFLOPs: 8.59 |
[default7]: iteration     4806/  128728 | consumed samples:        76896 | consumed tokens:    157483008 | elapsed time per iteration (s): 13.63 | learning rate: 2.520E-05 | global batch size:    16 | lm loss: 5.277452E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 1.174 | TFLOPs: 8.99 |
[default7]: iteration     4807/  128728 | consumed samples:        76928 | consumed tokens:    157548544 | elapsed time per iteration (s): 14.41 | learning rate: 2.521E-05 | global batch size:    32 | lm loss: 5.110476E+00 | grad norm: 0.556 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.220 | TFLOPs: 17.00 |
[default7]: iteration     4808/  128728 | consumed samples:        76960 | consumed tokens:    157614080 | elapsed time per iteration (s): 14.44 | learning rate: 2.522E-05 | global batch size:    32 | lm loss: 5.159946E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.216 | TFLOPs: 16.97 |
[default7]: iteration     4809/  128728 | consumed samples:        76992 | consumed tokens:    157679616 | elapsed time per iteration (s): 14.37 | learning rate: 2.523E-05 | global batch size:    32 | lm loss: 5.098501E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.227 | TFLOPs: 17.05 |
[default7]: iteration     4810/  128728 | consumed samples:        77024 | consumed tokens:    157745152 | elapsed time per iteration (s): 14.41 | learning rate: 2.524E-05 | global batch size:    32 | lm loss: 5.236533E+00 | grad norm: 0.541 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.221 | TFLOPs: 17.00 |
[default7]: iteration     4811/  128728 | consumed samples:        77056 | consumed tokens:    157810688 | elapsed time per iteration (s): 14.48 | learning rate: 2.525E-05 | global batch size:    32 | lm loss: 5.184154E+00 | grad norm: 0.506 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.210 | TFLOPs: 16.92 |
[default7]: iteration     4812/  128728 | consumed samples:        77088 | consumed tokens:    157876224 | elapsed time per iteration (s): 14.52 | learning rate: 2.526E-05 | global batch size:    32 | lm loss: 5.250757E+00 | grad norm: 0.555 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.203 | TFLOPs: 16.87 |
[default7]: iteration     4813/  128728 | consumed samples:        77120 | consumed tokens:    157941760 | elapsed time per iteration (s): 14.41 | learning rate: 2.527E-05 | global batch size:    32 | lm loss: 5.150150E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.221 | TFLOPs: 17.01 |
[default7]: iteration     4814/  128728 | consumed samples:        77152 | consumed tokens:    158007296 | elapsed time per iteration (s): 14.37 | learning rate: 2.528E-05 | global batch size:    32 | lm loss: 4.867079E+00 | grad norm: 0.546 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.226 | TFLOPs: 17.05 |
[default7]: iteration     4815/  128728 | consumed samples:        77184 | consumed tokens:    158072832 | elapsed time per iteration (s): 14.35 | learning rate: 2.529E-05 | global batch size:    32 | lm loss: 5.103984E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.231 | TFLOPs: 17.08 |
[default7]: iteration     4816/  128728 | consumed samples:        77216 | consumed tokens:    158138368 | elapsed time per iteration (s): 14.41 | learning rate: 2.530E-05 | global batch size:    32 | lm loss: 5.172581E+00 | grad norm: 0.523 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.220 | TFLOPs: 17.00 |
[default7]: iteration     4817/  128728 | consumed samples:        77248 | consumed tokens:    158203904 | elapsed time per iteration (s): 14.34 | learning rate: 2.531E-05 | global batch size:    32 | lm loss: 5.039461E+00 | grad norm: 0.612 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.232 | TFLOPs: 17.09 |
[default7]: iteration     4818/  128728 | consumed samples:        77280 | consumed tokens:    158269440 | elapsed time per iteration (s): 14.45 | learning rate: 2.532E-05 | global batch size:    32 | lm loss: 5.033366E+00 | grad norm: 0.591 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.214 | TFLOPs: 16.95 |
[default7]: iteration     4819/  128728 | consumed samples:        77312 | consumed tokens:    158334976 | elapsed time per iteration (s): 14.52 | learning rate: 2.533E-05 | global batch size:    32 | lm loss: 5.019548E+00 | grad norm: 0.507 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.203 | TFLOPs: 16.87 |
[default7]: iteration     4820/  128728 | consumed samples:        77344 | consumed tokens:    158400512 | elapsed time per iteration (s): 14.47 | learning rate: 2.534E-05 | global batch size:    32 | lm loss: 5.029814E+00 | grad norm: 0.528 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.212 | TFLOPs: 16.94 |
[default7]: iteration     4821/  128728 | consumed samples:        77376 | consumed tokens:    158466048 | elapsed time per iteration (s): 14.48 | learning rate: 2.535E-05 | global batch size:    32 | lm loss: 5.075526E+00 | grad norm: 0.533 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.210 | TFLOPs: 16.92 |
[default7]: iteration     4822/  128728 | consumed samples:        77408 | consumed tokens:    158531584 | elapsed time per iteration (s): 14.49 | learning rate: 2.537E-05 | global batch size:    32 | lm loss: 5.179887E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.209 | TFLOPs: 16.91 |
[default7]: iteration     4823/  128728 | consumed samples:        77440 | consumed tokens:    158597120 | elapsed time per iteration (s): 14.46 | learning rate: 2.538E-05 | global batch size:    32 | lm loss: 4.963607E+00 | grad norm: 0.576 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.213 | TFLOPs: 16.94 |
[default7]: iteration     4824/  128728 | consumed samples:        77472 | consumed tokens:    158662656 | elapsed time per iteration (s): 14.29 | learning rate: 2.539E-05 | global batch size:    32 | lm loss: 5.011718E+00 | grad norm: 0.528 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.239 | TFLOPs: 17.14 |
[default7]: iteration     4825/  128728 | consumed samples:        77504 | consumed tokens:    158728192 | elapsed time per iteration (s): 14.49 | learning rate: 2.540E-05 | global batch size:    32 | lm loss: 4.995124E+00 | grad norm: 0.552 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.91 |
[default7]: iteration     4826/  128728 | consumed samples:        77536 | consumed tokens:    158793728 | elapsed time per iteration (s): 14.44 | learning rate: 2.541E-05 | global batch size:    32 | lm loss: 5.081669E+00 | grad norm: 0.486 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.217 | TFLOPs: 16.97 |
[default7]: iteration     4827/  128728 | consumed samples:        77568 | consumed tokens:    158859264 | elapsed time per iteration (s): 14.53 | learning rate: 2.542E-05 | global batch size:    32 | lm loss: 5.067815E+00 | grad norm: 0.541 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.203 | TFLOPs: 16.87 |
[default7]: iteration     4828/  128728 | consumed samples:        77600 | consumed tokens:    158924800 | elapsed time per iteration (s): 14.45 | learning rate: 2.543E-05 | global batch size:    32 | lm loss: 4.991805E+00 | grad norm: 0.560 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.215 | TFLOPs: 16.96 |
[default7]: iteration     4829/  128728 | consumed samples:        77632 | consumed tokens:    158990336 | elapsed time per iteration (s): 14.37 | learning rate: 2.544E-05 | global batch size:    32 | lm loss: 5.154213E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.227 | TFLOPs: 17.05 |
[default7]: iteration     4830/  128728 | consumed samples:        77664 | consumed tokens:    159055872 | elapsed time per iteration (s): 14.53 | learning rate: 2.545E-05 | global batch size:    32 | lm loss: 4.978602E+00 | grad norm: 0.477 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.202 | TFLOPs: 16.86 |
[default7]: iteration     4831/  128728 | consumed samples:        77696 | consumed tokens:    159121408 | elapsed time per iteration (s): 14.47 | learning rate: 2.546E-05 | global batch size:    32 | lm loss: 4.966110E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.212 | TFLOPs: 16.93 |
[default7]: iteration     4832/  128728 | consumed samples:        77728 | consumed tokens:    159186944 | elapsed time per iteration (s): 14.36 | learning rate: 2.547E-05 | global batch size:    32 | lm loss: 4.906848E+00 | grad norm: 0.501 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.228 | TFLOPs: 17.06 |
[default7]: iteration     4833/  128728 | consumed samples:        77760 | consumed tokens:    159252480 | elapsed time per iteration (s): 14.40 | learning rate: 2.548E-05 | global batch size:    32 | lm loss: 4.992458E+00 | grad norm: 0.605 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.222 | TFLOPs: 17.01 |
[default7]: iteration     4834/  128728 | consumed samples:        77792 | consumed tokens:    159318016 | elapsed time per iteration (s): 14.37 | learning rate: 2.549E-05 | global batch size:    32 | lm loss: 4.984800E+00 | grad norm: 0.570 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.227 | TFLOPs: 17.05 |
[default7]: iteration     4835/  128728 | consumed samples:        77824 | consumed tokens:    159383552 | elapsed time per iteration (s): 14.50 | learning rate: 2.550E-05 | global batch size:    32 | lm loss: 5.221433E+00 | grad norm: 0.495 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.90 |
[default7]: iteration     4836/  128728 | consumed samples:        77856 | consumed tokens:    159449088 | elapsed time per iteration (s): 14.65 | learning rate: 2.551E-05 | global batch size:    32 | lm loss: 4.936250E+00 | grad norm: 0.615 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.185 | TFLOPs: 16.73 |
[default7]: iteration     4837/  128728 | consumed samples:        77888 | consumed tokens:    159514624 | elapsed time per iteration (s): 14.55 | learning rate: 2.552E-05 | global batch size:    32 | lm loss: 4.874154E+00 | grad norm: 0.537 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.200 | TFLOPs: 16.84 |
[default7]: iteration     4838/  128728 | consumed samples:        77920 | consumed tokens:    159580160 | elapsed time per iteration (s): 14.53 | learning rate: 2.553E-05 | global batch size:    32 | lm loss: 5.190948E+00 | grad norm: 0.435 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.203 | TFLOPs: 16.87 |
[default7]: iteration     4839/  128728 | consumed samples:        77952 | consumed tokens:    159645696 | elapsed time per iteration (s): 14.42 | learning rate: 2.554E-05 | global batch size:    32 | lm loss: 5.015795E+00 | grad norm: 4.281 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.220 | TFLOPs: 16.99 |
[default7]: iteration     4840/  128728 | consumed samples:        77984 | consumed tokens:    159711232 | elapsed time per iteration (s): 14.44 | learning rate: 2.555E-05 | global batch size:    32 | lm loss: 5.077456E+00 | grad norm: 0.479 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.216 | TFLOPs: 16.97 |
[default7]: iteration     4841/  128728 | consumed samples:        78016 | consumed tokens:    159776768 | elapsed time per iteration (s): 14.35 | learning rate: 2.556E-05 | global batch size:    32 | lm loss: 5.229739E+00 | grad norm: 0.519 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.229 | TFLOPs: 17.07 |
[default7]: iteration     4842/  128728 | consumed samples:        78048 | consumed tokens:    159842304 | elapsed time per iteration (s): 14.39 | learning rate: 2.557E-05 | global batch size:    32 | lm loss: 5.039967E+00 | grad norm: 0.505 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.224 | TFLOPs: 17.03 |
[default7]: iteration     4843/  128728 | consumed samples:        78080 | consumed tokens:    159907840 | elapsed time per iteration (s): 14.51 | learning rate: 2.559E-05 | global batch size:    32 | lm loss: 5.084831E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.205 | TFLOPs: 16.88 |
[default7]: iteration     4844/  128728 | consumed samples:        78112 | consumed tokens:    159973376 | elapsed time per iteration (s): 14.48 | learning rate: 2.560E-05 | global batch size:    32 | lm loss: 4.989566E+00 | grad norm: 0.514 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.210 | TFLOPs: 16.92 |
[default7]: iteration     4845/  128728 | consumed samples:        78144 | consumed tokens:    160038912 | elapsed time per iteration (s): 14.33 | learning rate: 2.561E-05 | global batch size:    32 | lm loss: 4.973344E+00 | grad norm: 0.531 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.233 | TFLOPs: 17.09 |
[default7]: iteration     4846/  128728 | consumed samples:        78176 | consumed tokens:    160104448 | elapsed time per iteration (s): 14.37 | learning rate: 2.562E-05 | global batch size:    32 | lm loss: 5.007797E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.227 | TFLOPs: 17.05 |
[default7]: iteration     4847/  128728 | consumed samples:        78208 | consumed tokens:    160169984 | elapsed time per iteration (s): 14.49 | learning rate: 2.563E-05 | global batch size:    32 | lm loss: 5.095990E+00 | grad norm: 0.499 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.91 |
[default7]: iteration     4848/  128728 | consumed samples:        78240 | consumed tokens:    160235520 | elapsed time per iteration (s): 14.37 | learning rate: 2.564E-05 | global batch size:    32 | lm loss: 5.174461E+00 | grad norm: 0.525 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.227 | TFLOPs: 17.05 |
[default7]: iteration     4849/  128728 | consumed samples:        78272 | consumed tokens:    160301056 | elapsed time per iteration (s): 14.49 | learning rate: 2.565E-05 | global batch size:    32 | lm loss: 5.072275E+00 | grad norm: 0.447 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.209 | TFLOPs: 16.91 |
[default7]: iteration     4850/  128728 | consumed samples:        78304 | consumed tokens:    160366592 | elapsed time per iteration (s): 14.40 | learning rate: 2.566E-05 | global batch size:    32 | lm loss: 4.968595E+00 | grad norm: 0.489 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.223 | TFLOPs: 17.02 |
[default7]: iteration     4851/  128728 | consumed samples:        78336 | consumed tokens:    160432128 | elapsed time per iteration (s): 14.42 | learning rate: 2.567E-05 | global batch size:    32 | lm loss: 5.029985E+00 | grad norm: 0.944 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.219 | TFLOPs: 16.99 |
[default7]: iteration     4852/  128728 | consumed samples:        78368 | consumed tokens:    160497664 | elapsed time per iteration (s): 14.38 | learning rate: 2.568E-05 | global batch size:    32 | lm loss: 4.903277E+00 | grad norm: 0.552 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.225 | TFLOPs: 17.04 |
[default7]: iteration     4853/  128728 | consumed samples:        78400 | consumed tokens:    160563200 | elapsed time per iteration (s): 14.40 | learning rate: 2.569E-05 | global batch size:    32 | lm loss: 5.001978E+00 | grad norm: 0.441 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.222 | TFLOPs: 17.01 |
[default7]: iteration     4854/  128728 | consumed samples:        78432 | consumed tokens:    160628736 | elapsed time per iteration (s): 14.35 | learning rate: 2.570E-05 | global batch size:    32 | lm loss: 4.934483E+00 | grad norm: 0.468 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.229 | TFLOPs: 17.07 |
[default7]: iteration     4855/  128728 | consumed samples:        78464 | consumed tokens:    160694272 | elapsed time per iteration (s): 14.33 | learning rate: 2.571E-05 | global batch size:    32 | lm loss: 4.979787E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.234 | TFLOPs: 17.10 |
[default7]: iteration     4856/  128728 | consumed samples:        78496 | consumed tokens:    160759808 | elapsed time per iteration (s): 14.42 | learning rate: 2.572E-05 | global batch size:    32 | lm loss: 4.876790E+00 | grad norm: 0.576 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.219 | TFLOPs: 16.99 |
[default7]: iteration     4857/  128728 | consumed samples:        78528 | consumed tokens:    160825344 | elapsed time per iteration (s): 14.50 | learning rate: 2.573E-05 | global batch size:    32 | lm loss: 5.032100E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.207 | TFLOPs: 16.90 |
[default7]: iteration     4858/  128728 | consumed samples:        78560 | consumed tokens:    160890880 | elapsed time per iteration (s): 14.39 | learning rate: 2.574E-05 | global batch size:    32 | lm loss: 4.934223E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.225 | TFLOPs: 17.03 |
[default7]: iteration     4859/  128728 | consumed samples:        78592 | consumed tokens:    160956416 | elapsed time per iteration (s): 14.98 | learning rate: 2.575E-05 | global batch size:    32 | lm loss: 4.762866E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.136 | TFLOPs: 16.35 |
[default7]: iteration     4860/  128728 | consumed samples:        78624 | consumed tokens:    161021952 | elapsed time per iteration (s): 14.42 | learning rate: 2.576E-05 | global batch size:    32 | lm loss: 5.198421E+00 | grad norm: 0.506 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.219 | TFLOPs: 16.99 |
[default7]: iteration     4861/  128728 | consumed samples:        78656 | consumed tokens:    161087488 | elapsed time per iteration (s): 14.40 | learning rate: 2.577E-05 | global batch size:    32 | lm loss: 4.902623E+00 | grad norm: 0.532 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.223 | TFLOPs: 17.02 |
[default7]: iteration     4862/  128728 | consumed samples:        78688 | consumed tokens:    161153024 | elapsed time per iteration (s): 14.43 | learning rate: 2.578E-05 | global batch size:    32 | lm loss: 4.889926E+00 | grad norm: 1.259 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.218 | TFLOPs: 16.98 |
[default7]: iteration     4863/  128728 | consumed samples:        78720 | consumed tokens:    161218560 | elapsed time per iteration (s): 14.52 | learning rate: 2.580E-05 | global batch size:    32 | lm loss: 4.984774E+00 | grad norm: 0.465 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.203 | TFLOPs: 16.87 |
[default7]: iteration     4864/  128728 | consumed samples:        78752 | consumed tokens:    161284096 | elapsed time per iteration (s): 14.36 | learning rate: 2.581E-05 | global batch size:    32 | lm loss: 4.985258E+00 | grad norm: 0.614 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.228 | TFLOPs: 17.06 |
[default7]: iteration     4865/  128728 | consumed samples:        78784 | consumed tokens:    161349632 | elapsed time per iteration (s): 14.49 | learning rate: 2.582E-05 | global batch size:    32 | lm loss: 5.015450E+00 | grad norm: 0.651 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.91 |
[default7]: iteration     4866/  128728 | consumed samples:        78816 | consumed tokens:    161415168 | elapsed time per iteration (s): 14.46 | learning rate: 2.583E-05 | global batch size:    32 | lm loss: 4.877583E+00 | grad norm: 0.511 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.213 | TFLOPs: 16.94 |
[default7]: iteration     4867/  128728 | consumed samples:        78848 | consumed tokens:    161480704 | elapsed time per iteration (s): 14.46 | learning rate: 2.584E-05 | global batch size:    32 | lm loss: 5.241075E+00 | grad norm: 0.617 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.213 | TFLOPs: 16.94 |
[default7]: iteration     4868/  128728 | consumed samples:        78880 | consumed tokens:    161546240 | elapsed time per iteration (s): 14.48 | learning rate: 2.585E-05 | global batch size:    32 | lm loss: 4.912823E+00 | grad norm: 0.481 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.211 | TFLOPs: 16.92 |
[default7]: iteration     4869/  128728 | consumed samples:        78912 | consumed tokens:    161611776 | elapsed time per iteration (s): 14.97 | learning rate: 2.586E-05 | global batch size:    32 | lm loss: 4.992380E+00 | grad norm: 0.570 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.138 | TFLOPs: 16.37 |
[default7]: iteration     4870/  128728 | consumed samples:        78944 | consumed tokens:    161677312 | elapsed time per iteration (s): 14.45 | learning rate: 2.587E-05 | global batch size:    32 | lm loss: 5.035939E+00 | grad norm: 0.463 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.214 | TFLOPs: 16.95 |
[default7]: iteration     4871/  128728 | consumed samples:        78976 | consumed tokens:    161742848 | elapsed time per iteration (s): 14.36 | learning rate: 2.588E-05 | global batch size:    32 | lm loss: 4.827978E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.229 | TFLOPs: 17.07 |
[default7]: iteration     4872/  128728 | consumed samples:        79008 | consumed tokens:    161808384 | elapsed time per iteration (s): 14.41 | learning rate: 2.589E-05 | global batch size:    32 | lm loss: 4.985816E+00 | grad norm: 0.479 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.220 | TFLOPs: 17.00 |
[default7]: iteration     4873/  128728 | consumed samples:        79040 | consumed tokens:    161873920 | elapsed time per iteration (s): 14.40 | learning rate: 2.590E-05 | global batch size:    32 | lm loss: 4.936251E+00 | grad norm: 0.564 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.223 | TFLOPs: 17.02 |
[default7]: iteration     4874/  128728 | consumed samples:        79072 | consumed tokens:    161939456 | elapsed time per iteration (s): 14.52 | learning rate: 2.591E-05 | global batch size:    32 | lm loss: 4.892041E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.203 | TFLOPs: 16.87 |
[default7]: iteration     4875/  128728 | consumed samples:        79104 | consumed tokens:    162004992 | elapsed time per iteration (s): 14.65 | learning rate: 2.592E-05 | global batch size:    32 | lm loss: 4.844186E+00 | grad norm: 0.546 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.184 | TFLOPs: 16.72 |
[default7]: iteration     4876/  128728 | consumed samples:        79136 | consumed tokens:    162070528 | elapsed time per iteration (s): 14.39 | learning rate: 2.593E-05 | global batch size:    32 | lm loss: 5.113724E+00 | grad norm: 0.485 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.223 | TFLOPs: 17.02 |
[default7]: iteration     4877/  128728 | consumed samples:        79168 | consumed tokens:    162136064 | elapsed time per iteration (s): 14.43 | learning rate: 2.594E-05 | global batch size:    32 | lm loss: 5.039042E+00 | grad norm: 0.543 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.217 | TFLOPs: 16.97 |
[default7]: iteration     4878/  128728 | consumed samples:        79200 | consumed tokens:    162201600 | elapsed time per iteration (s): 14.45 | learning rate: 2.595E-05 | global batch size:    32 | lm loss: 5.142283E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.215 | TFLOPs: 16.96 |
[default7]: iteration     4879/  128728 | consumed samples:        79232 | consumed tokens:    162267136 | elapsed time per iteration (s): 14.42 | learning rate: 2.596E-05 | global batch size:    32 | lm loss: 4.902722E+00 | grad norm: 0.561 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.220 | TFLOPs: 17.00 |
[default7]: iteration     4880/  128728 | consumed samples:        79264 | consumed tokens:    162332672 | elapsed time per iteration (s): 14.45 | learning rate: 2.597E-05 | global batch size:    32 | lm loss: 4.755108E+00 | grad norm: 1.026 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.215 | TFLOPs: 16.96 |
[default7]: iteration     4881/  128728 | consumed samples:        79296 | consumed tokens:    162398208 | elapsed time per iteration (s): 14.46 | learning rate: 2.598E-05 | global batch size:    32 | lm loss: 4.935410E+00 | grad norm: 0.542 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.213 | TFLOPs: 16.94 |
[default7]: iteration     4882/  128728 | consumed samples:        79328 | consumed tokens:    162463744 | elapsed time per iteration (s): 14.34 | learning rate: 2.599E-05 | global batch size:    32 | lm loss: 5.047359E+00 | grad norm: 0.530 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.231 | TFLOPs: 17.08 |
[default7]: iteration     4883/  128728 | consumed samples:        79360 | consumed tokens:    162529280 | elapsed time per iteration (s): 14.42 | learning rate: 2.600E-05 | global batch size:    32 | lm loss: 4.720992E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.219 | TFLOPs: 16.99 |
[default7]: iteration     4884/  128728 | consumed samples:        79392 | consumed tokens:    162594816 | elapsed time per iteration (s): 14.49 | learning rate: 2.602E-05 | global batch size:    32 | lm loss: 4.991364E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.209 | TFLOPs: 16.91 |
[default7]: iteration     4885/  128728 | consumed samples:        79424 | consumed tokens:    162660352 | elapsed time per iteration (s): 14.50 | learning rate: 2.603E-05 | global batch size:    32 | lm loss: 4.920027E+00 | grad norm: 0.495 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.207 | TFLOPs: 16.90 |
[default7]: iteration     4886/  128728 | consumed samples:        79456 | consumed tokens:    162725888 | elapsed time per iteration (s): 14.48 | learning rate: 2.604E-05 | global batch size:    32 | lm loss: 4.976588E+00 | grad norm: 0.508 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.210 | TFLOPs: 16.92 |
[default7]: iteration     4887/  128728 | consumed samples:        79488 | consumed tokens:    162791424 | elapsed time per iteration (s): 14.48 | learning rate: 2.605E-05 | global batch size:    32 | lm loss: 4.887921E+00 | grad norm: 0.499 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.210 | TFLOPs: 16.92 |
[default7]: iteration     4888/  128728 | consumed samples:        79520 | consumed tokens:    162856960 | elapsed time per iteration (s): 14.47 | learning rate: 2.606E-05 | global batch size:    32 | lm loss: 4.977568E+00 | grad norm: 0.608 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.211 | TFLOPs: 16.93 |
[default7]: iteration     4889/  128728 | consumed samples:        79552 | consumed tokens:    162922496 | elapsed time per iteration (s): 14.85 | learning rate: 2.607E-05 | global batch size:    32 | lm loss: 5.021329E+00 | grad norm: 0.459 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.154 | TFLOPs: 16.49 |
[default7]: iteration     4890/  128728 | consumed samples:        79584 | consumed tokens:    162988032 | elapsed time per iteration (s): 14.50 | learning rate: 2.608E-05 | global batch size:    32 | lm loss: 5.071374E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.90 |
[default7]: iteration     4891/  128728 | consumed samples:        79616 | consumed tokens:    163053568 | elapsed time per iteration (s): 14.71 | learning rate: 2.609E-05 | global batch size:    32 | lm loss: 5.032025E+00 | grad norm: 1.138 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.175 | TFLOPs: 16.66 |
[default7]: iteration     4892/  128728 | consumed samples:        79648 | consumed tokens:    163119104 | elapsed time per iteration (s): 14.49 | learning rate: 2.610E-05 | global batch size:    32 | lm loss: 5.014086E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.90 |
[default7]: iteration     4893/  128728 | consumed samples:        79680 | consumed tokens:    163184640 | elapsed time per iteration (s): 14.49 | learning rate: 2.611E-05 | global batch size:    32 | lm loss: 4.995523E+00 | grad norm: 0.495 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.90 |
[default7]: iteration     4894/  128728 | consumed samples:        79712 | consumed tokens:    163250176 | elapsed time per iteration (s): 14.42 | learning rate: 2.612E-05 | global batch size:    32 | lm loss: 4.978586E+00 | grad norm: 0.593 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.219 | TFLOPs: 16.99 |
[default7]: iteration     4895/  128728 | consumed samples:        79744 | consumed tokens:    163315712 | elapsed time per iteration (s): 14.41 | learning rate: 2.613E-05 | global batch size:    32 | lm loss: 4.924498E+00 | grad norm: 0.585 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.220 | TFLOPs: 17.00 |
[default7]: iteration     4896/  128728 | consumed samples:        79776 | consumed tokens:    163381248 | elapsed time per iteration (s): 14.66 | learning rate: 2.614E-05 | global batch size:    32 | lm loss: 4.915054E+00 | grad norm: 0.567 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.182 | TFLOPs: 16.71 |
[default7]: iteration     4897/  128728 | consumed samples:        79808 | consumed tokens:    163446784 | elapsed time per iteration (s): 14.46 | learning rate: 2.615E-05 | global batch size:    32 | lm loss: 5.211232E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.213 | TFLOPs: 16.94 |
[default7]: iteration     4898/  128728 | consumed samples:        79840 | consumed tokens:    163512320 | elapsed time per iteration (s): 14.91 | learning rate: 2.616E-05 | global batch size:    32 | lm loss: 4.795869E+00 | grad norm: 0.541 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.146 | TFLOPs: 16.43 |
[default7]: iteration     4899/  128728 | consumed samples:        79872 | consumed tokens:    163577856 | elapsed time per iteration (s): 14.43 | learning rate: 2.617E-05 | global batch size:    32 | lm loss: 4.986282E+00 | grad norm: 0.561 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.218 | TFLOPs: 16.98 |
[default7]: iteration     4900/  128728 | consumed samples:        79904 | consumed tokens:    163643392 | elapsed time per iteration (s): 14.44 | learning rate: 2.618E-05 | global batch size:    32 | lm loss: 5.010670E+00 | grad norm: 0.482 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.216 | TFLOPs: 16.96 |
[default7]: iteration     4901/  128728 | consumed samples:        79936 | consumed tokens:    163708928 | elapsed time per iteration (s): 14.42 | learning rate: 2.619E-05 | global batch size:    32 | lm loss: 4.981537E+00 | grad norm: 0.545 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.219 | TFLOPs: 16.99 |
[default7]: iteration     4902/  128728 | consumed samples:        79968 | consumed tokens:    163774464 | elapsed time per iteration (s): 14.41 | learning rate: 2.620E-05 | global batch size:    32 | lm loss: 5.037811E+00 | grad norm: 0.544 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.220 | TFLOPs: 17.00 |
[default7]: iteration     4903/  128728 | consumed samples:        80000 | consumed tokens:    163840000 | elapsed time per iteration (s): 14.50 | learning rate: 2.621E-05 | global batch size:    32 | lm loss: 5.033264E+00 | grad norm: 0.535 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.207 | TFLOPs: 16.90 |
[default7]: iteration     4904/  128728 | consumed samples:        80032 | consumed tokens:    163905536 | elapsed time per iteration (s): 14.46 | learning rate: 2.622E-05 | global batch size:    32 | lm loss: 4.824915E+00 | grad norm: 0.552 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.213 | TFLOPs: 16.94 |
[default7]: iteration     4905/  128728 | consumed samples:        80064 | consumed tokens:    163971072 | elapsed time per iteration (s): 14.31 | learning rate: 2.624E-05 | global batch size:    32 | lm loss: 5.107170E+00 | grad norm: 0.627 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.237 | TFLOPs: 17.13 |
[default7]: iteration     4906/  128728 | consumed samples:        80096 | consumed tokens:    164036608 | elapsed time per iteration (s): 14.43 | learning rate: 2.625E-05 | global batch size:    32 | lm loss: 5.018471E+00 | grad norm: 0.586 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.217 | TFLOPs: 16.97 |
[default7]: iteration     4907/  128728 | consumed samples:        80128 | consumed tokens:    164102144 | elapsed time per iteration (s): 14.34 | learning rate: 2.626E-05 | global batch size:    32 | lm loss: 4.920955E+00 | grad norm: 0.483 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.232 | TFLOPs: 17.09 |
[default7]: iteration     4908/  128728 | consumed samples:        80160 | consumed tokens:    164167680 | elapsed time per iteration (s): 14.40 | learning rate: 2.627E-05 | global batch size:    32 | lm loss: 4.959438E+00 | grad norm: 0.531 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.222 | TFLOPs: 17.01 |
[default7]: iteration     4909/  128728 | consumed samples:        80192 | consumed tokens:    164233216 | elapsed time per iteration (s): 14.44 | learning rate: 2.628E-05 | global batch size:    32 | lm loss: 4.835641E+00 | grad norm: 1.725 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.216 | TFLOPs: 16.96 |
[default7]: iteration     4910/  128728 | consumed samples:        80224 | consumed tokens:    164298752 | elapsed time per iteration (s): 14.33 | learning rate: 2.629E-05 | global batch size:    32 | lm loss: 5.024042E+00 | grad norm: 0.484 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.233 | TFLOPs: 17.10 |
[default7]: iteration     4911/  128728 | consumed samples:        80256 | consumed tokens:    164364288 | elapsed time per iteration (s): 14.50 | learning rate: 2.630E-05 | global batch size:    32 | lm loss: 4.906248E+00 | grad norm: 0.459 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.90 |
[default7]: iteration     4912/  128728 | consumed samples:        80288 | consumed tokens:    164429824 | elapsed time per iteration (s): 14.49 | learning rate: 2.631E-05 | global batch size:    32 | lm loss: 5.058978E+00 | grad norm: 0.466 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.91 |
[default7]: iteration     4913/  128728 | consumed samples:        80320 | consumed tokens:    164495360 | elapsed time per iteration (s): 14.67 | learning rate: 2.632E-05 | global batch size:    32 | lm loss: 4.788593E+00 | grad norm: 0.613 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.182 | TFLOPs: 16.70 |
[default7]: iteration     4914/  128728 | consumed samples:        80352 | consumed tokens:    164560896 | elapsed time per iteration (s): 14.45 | learning rate: 2.633E-05 | global batch size:    32 | lm loss: 5.040935E+00 | grad norm: 0.546 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.215 | TFLOPs: 16.96 |
[default7]: iteration     4915/  128728 | consumed samples:        80384 | consumed tokens:    164626432 | elapsed time per iteration (s): 14.40 | learning rate: 2.634E-05 | global batch size:    32 | lm loss: 4.802517E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.223 | TFLOPs: 17.02 |
[default7]: iteration     4916/  128728 | consumed samples:        80416 | consumed tokens:    164691968 | elapsed time per iteration (s): 14.37 | learning rate: 2.635E-05 | global batch size:    32 | lm loss: 4.939359E+00 | grad norm: 0.618 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.226 | TFLOPs: 17.04 |
[default7]: iteration     4917/  128728 | consumed samples:        80448 | consumed tokens:    164757504 | elapsed time per iteration (s): 14.40 | learning rate: 2.636E-05 | global batch size:    32 | lm loss: 4.907125E+00 | grad norm: 0.481 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.222 | TFLOPs: 17.01 |
[default7]: iteration     4918/  128728 | consumed samples:        80480 | consumed tokens:    164823040 | elapsed time per iteration (s): 14.47 | learning rate: 2.637E-05 | global batch size:    32 | lm loss: 4.967450E+00 | grad norm: 0.447 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.211 | TFLOPs: 16.93 |
[default7]: iteration     4919/  128728 | consumed samples:        80512 | consumed tokens:    164888576 | elapsed time per iteration (s): 14.49 | learning rate: 2.638E-05 | global batch size:    32 | lm loss: 4.866137E+00 | grad norm: 0.494 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.91 |
[default7]: iteration     4920/  128728 | consumed samples:        80544 | consumed tokens:    164954112 | elapsed time per iteration (s): 14.42 | learning rate: 2.639E-05 | global batch size:    32 | lm loss: 4.921837E+00 | grad norm: 0.535 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.219 | TFLOPs: 16.99 |
[default7]: iteration     4921/  128728 | consumed samples:        80576 | consumed tokens:    165019648 | elapsed time per iteration (s): 14.42 | learning rate: 2.640E-05 | global batch size:    32 | lm loss: 4.895585E+00 | grad norm: 0.542 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.220 | TFLOPs: 17.00 |
[default7]: iteration     4922/  128728 | consumed samples:        80608 | consumed tokens:    165085184 | elapsed time per iteration (s): 14.48 | learning rate: 2.641E-05 | global batch size:    32 | lm loss: 5.026771E+00 | grad norm: 0.449 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.209 | TFLOPs: 16.91 |
[default7]: iteration     4923/  128728 | consumed samples:        80640 | consumed tokens:    165150720 | elapsed time per iteration (s): 14.50 | learning rate: 2.642E-05 | global batch size:    32 | lm loss: 4.946136E+00 | grad norm: 0.500 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.207 | TFLOPs: 16.90 |
[default7]: iteration     4924/  128728 | consumed samples:        80672 | consumed tokens:    165216256 | elapsed time per iteration (s): 14.49 | learning rate: 2.643E-05 | global batch size:    32 | lm loss: 5.127237E+00 | grad norm: 0.516 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.91 |
[default7]: iteration     4925/  128728 | consumed samples:        80704 | consumed tokens:    165281792 | elapsed time per iteration (s): 14.44 | learning rate: 2.645E-05 | global batch size:    32 | lm loss: 4.908696E+00 | grad norm: 0.496 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.216 | TFLOPs: 16.97 |
[default7]: iteration     4926/  128728 | consumed samples:        80736 | consumed tokens:    165347328 | elapsed time per iteration (s): 14.53 | learning rate: 2.646E-05 | global batch size:    32 | lm loss: 4.818577E+00 | grad norm: 0.490 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.202 | TFLOPs: 16.86 |
[default7]: iteration     4927/  128728 | consumed samples:        80768 | consumed tokens:    165412864 | elapsed time per iteration (s): 14.36 | learning rate: 2.647E-05 | global batch size:    32 | lm loss: 5.022501E+00 | grad norm: 0.487 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.229 | TFLOPs: 17.07 |
[default7]: iteration     4928/  128728 | consumed samples:        80800 | consumed tokens:    165478400 | elapsed time per iteration (s): 14.63 | learning rate: 2.648E-05 | global batch size:    32 | lm loss: 5.095439E+00 | grad norm: 0.474 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.188 | TFLOPs: 16.75 |
[default7]: iteration     4929/  128728 | consumed samples:        80832 | consumed tokens:    165543936 | elapsed time per iteration (s): 14.56 | learning rate: 2.649E-05 | global batch size:    32 | lm loss: 4.867841E+00 | grad norm: 0.532 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.198 | TFLOPs: 16.83 |
[default7]: iteration     4930/  128728 | consumed samples:        80864 | consumed tokens:    165609472 | elapsed time per iteration (s): 14.47 | learning rate: 2.650E-05 | global batch size:    32 | lm loss: 4.895494E+00 | grad norm: 0.597 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.211 | TFLOPs: 16.93 |
[default7]: iteration     4931/  128728 | consumed samples:        80896 | consumed tokens:    165675008 | elapsed time per iteration (s): 14.39 | learning rate: 2.651E-05 | global batch size:    32 | lm loss: 4.990711E+00 | grad norm: 0.545 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.224 | TFLOPs: 17.03 |
[default7]: iteration     4932/  128728 | consumed samples:        80928 | consumed tokens:    165740544 | elapsed time per iteration (s): 14.48 | learning rate: 2.652E-05 | global batch size:    32 | lm loss: 5.046138E+00 | grad norm: 0.447 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.211 | TFLOPs: 16.93 |
[default7]: iteration     4933/  128728 | consumed samples:        80960 | consumed tokens:    165806080 | elapsed time per iteration (s): 14.43 | learning rate: 2.653E-05 | global batch size:    32 | lm loss: 4.994790E+00 | grad norm: 0.660 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.218 | TFLOPs: 16.98 |
[default7]: iteration     4934/  128728 | consumed samples:        80992 | consumed tokens:    165871616 | elapsed time per iteration (s): 14.52 | learning rate: 2.654E-05 | global batch size:    32 | lm loss: 4.952404E+00 | grad norm: 0.516 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.204 | TFLOPs: 16.87 |
[default7]: iteration     4935/  128728 | consumed samples:        81024 | consumed tokens:    165937152 | elapsed time per iteration (s): 14.44 | learning rate: 2.655E-05 | global batch size:    32 | lm loss: 5.107005E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.216 | TFLOPs: 16.96 |
[default7]: iteration     4936/  128728 | consumed samples:        81056 | consumed tokens:    166002688 | elapsed time per iteration (s): 14.76 | learning rate: 2.656E-05 | global batch size:    32 | lm loss: 4.917874E+00 | grad norm: 0.472 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.168 | TFLOPs: 16.60 |
[default7]: iteration     4937/  128728 | consumed samples:        81088 | consumed tokens:    166068224 | elapsed time per iteration (s): 14.49 | learning rate: 2.657E-05 | global batch size:    32 | lm loss: 4.967998E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.209 | TFLOPs: 16.91 |
[default7]: iteration     4938/  128728 | consumed samples:        81120 | consumed tokens:    166133760 | elapsed time per iteration (s): 14.43 | learning rate: 2.658E-05 | global batch size:    32 | lm loss: 4.925600E+00 | grad norm: 0.492 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.218 | TFLOPs: 16.98 |
[default7]: iteration     4939/  128728 | consumed samples:        81152 | consumed tokens:    166199296 | elapsed time per iteration (s): 14.57 | learning rate: 2.659E-05 | global batch size:    32 | lm loss: 4.884789E+00 | grad norm: 0.462 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.197 | TFLOPs: 16.82 |
[default7]: iteration     4940/  128728 | consumed samples:        81184 | consumed tokens:    166264832 | elapsed time per iteration (s): 14.37 | learning rate: 2.660E-05 | global batch size:    32 | lm loss: 4.857765E+00 | grad norm: 0.567 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.227 | TFLOPs: 17.05 |
[default7]: iteration     4941/  128728 | consumed samples:        81216 | consumed tokens:    166330368 | elapsed time per iteration (s): 14.75 | learning rate: 2.661E-05 | global batch size:    32 | lm loss: 4.846112E+00 | grad norm: 0.543 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.170 | TFLOPs: 16.61 |
[default7]: iteration     4942/  128728 | consumed samples:        81248 | consumed tokens:    166395904 | elapsed time per iteration (s): 14.46 | learning rate: 2.662E-05 | global batch size:    32 | lm loss: 5.160878E+00 | grad norm: 0.499 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.214 | TFLOPs: 16.95 |
[default7]: iteration     4943/  128728 | consumed samples:        81280 | consumed tokens:    166461440 | elapsed time per iteration (s): 14.51 | learning rate: 2.663E-05 | global batch size:    32 | lm loss: 5.023970E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.205 | TFLOPs: 16.88 |
[default7]: iteration     4944/  128728 | consumed samples:        81312 | consumed tokens:    166526976 | elapsed time per iteration (s): 14.48 | learning rate: 2.664E-05 | global batch size:    32 | lm loss: 4.885333E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.209 | TFLOPs: 16.91 |
[default7]: iteration     4945/  128728 | consumed samples:        81344 | consumed tokens:    166592512 | elapsed time per iteration (s): 14.35 | learning rate: 2.665E-05 | global batch size:    32 | lm loss: 4.871268E+00 | grad norm: 0.612 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.230 | TFLOPs: 17.08 |
[default7]: iteration     4946/  128728 | consumed samples:        81376 | consumed tokens:    166658048 | elapsed time per iteration (s): 14.47 | learning rate: 2.667E-05 | global batch size:    32 | lm loss: 5.091561E+00 | grad norm: 0.596 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.211 | TFLOPs: 16.93 |
[default7]: iteration     4947/  128728 | consumed samples:        81408 | consumed tokens:    166723584 | elapsed time per iteration (s): 14.41 | learning rate: 2.668E-05 | global batch size:    32 | lm loss: 5.002282E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.220 | TFLOPs: 17.00 |
[default7]: iteration     4948/  128728 | consumed samples:        81440 | consumed tokens:    166789120 | elapsed time per iteration (s): 14.34 | learning rate: 2.669E-05 | global batch size:    32 | lm loss: 4.788313E+00 | grad norm: 0.660 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.232 | TFLOPs: 17.09 |
[default7]: iteration     4949/  128728 | consumed samples:        81472 | consumed tokens:    166854656 | elapsed time per iteration (s): 14.49 | learning rate: 2.670E-05 | global batch size:    32 | lm loss: 4.989025E+00 | grad norm: 0.515 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.91 |
[default7]: iteration     4950/  128728 | consumed samples:        81504 | consumed tokens:    166920192 | elapsed time per iteration (s): 14.41 | learning rate: 2.671E-05 | global batch size:    32 | lm loss: 4.957456E+00 | grad norm: 0.542 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.221 | TFLOPs: 17.00 |
[default7]: iteration     4951/  128728 | consumed samples:        81536 | consumed tokens:    166985728 | elapsed time per iteration (s): 14.53 | learning rate: 2.672E-05 | global batch size:    32 | lm loss: 4.846943E+00 | grad norm: 0.478 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.203 | TFLOPs: 16.87 |
[default7]: iteration     4952/  128728 | consumed samples:        81568 | consumed tokens:    167051264 | elapsed time per iteration (s): 14.39 | learning rate: 2.673E-05 | global batch size:    32 | lm loss: 5.056949E+00 | grad norm: 0.601 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.224 | TFLOPs: 17.03 |
[default7]: iteration     4953/  128728 | consumed samples:        81600 | consumed tokens:    167116800 | elapsed time per iteration (s): 14.53 | learning rate: 2.674E-05 | global batch size:    32 | lm loss: 5.134397E+00 | grad norm: 0.462 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.203 | TFLOPs: 16.87 |
[default7]: iteration     4954/  128728 | consumed samples:        81632 | consumed tokens:    167182336 | elapsed time per iteration (s): 14.46 | learning rate: 2.675E-05 | global batch size:    32 | lm loss: 5.089641E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.214 | TFLOPs: 16.95 |
[default7]: iteration     4955/  128728 | consumed samples:        81664 | consumed tokens:    167247872 | elapsed time per iteration (s): 14.40 | learning rate: 2.676E-05 | global batch size:    32 | lm loss: 4.803981E+00 | grad norm: 0.487 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.223 | TFLOPs: 17.02 |
[default7]: iteration     4956/  128728 | consumed samples:        81696 | consumed tokens:    167313408 | elapsed time per iteration (s): 14.46 | learning rate: 2.677E-05 | global batch size:    32 | lm loss: 4.882299E+00 | grad norm: 0.565 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.214 | TFLOPs: 16.95 |
[default7]: iteration     4957/  128728 | consumed samples:        81728 | consumed tokens:    167378944 | elapsed time per iteration (s): 14.52 | learning rate: 2.678E-05 | global batch size:    32 | lm loss: 4.925550E+00 | grad norm: 0.562 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.203 | TFLOPs: 16.87 |
[default7]: iteration     4958/  128728 | consumed samples:        81760 | consumed tokens:    167444480 | elapsed time per iteration (s): 14.47 | learning rate: 2.679E-05 | global batch size:    32 | lm loss: 4.976944E+00 | grad norm: 0.503 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.211 | TFLOPs: 16.93 |
[default7]: iteration     4959/  128728 | consumed samples:        81792 | consumed tokens:    167510016 | elapsed time per iteration (s): 14.44 | learning rate: 2.680E-05 | global batch size:    32 | lm loss: 4.880012E+00 | grad norm: 0.553 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.217 | TFLOPs: 16.97 |
[default7]: iteration     4960/  128728 | consumed samples:        81824 | consumed tokens:    167575552 | elapsed time per iteration (s): 14.31 | learning rate: 2.681E-05 | global batch size:    32 | lm loss: 4.842023E+00 | grad norm: 0.565 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.237 | TFLOPs: 17.13 |
[default7]: iteration     4961/  128728 | consumed samples:        81856 | consumed tokens:    167641088 | elapsed time per iteration (s): 14.45 | learning rate: 2.682E-05 | global batch size:    32 | lm loss: 4.906799E+00 | grad norm: 1.398 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.215 | TFLOPs: 16.96 |
[default7]: iteration     4962/  128728 | consumed samples:        81888 | consumed tokens:    167706624 | elapsed time per iteration (s): 14.37 | learning rate: 2.683E-05 | global batch size:    32 | lm loss: 5.035071E+00 | grad norm: 1.697 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.226 | TFLOPs: 17.04 |
[default7]: iteration     4963/  128728 | consumed samples:        81920 | consumed tokens:    167772160 | elapsed time per iteration (s): 14.49 | learning rate: 2.684E-05 | global batch size:    32 | lm loss: 4.912130E+00 | grad norm: 0.547 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.91 |
[default7]: iteration     4964/  128728 | consumed samples:        81952 | consumed tokens:    167837696 | elapsed time per iteration (s): 14.78 | learning rate: 2.685E-05 | global batch size:    32 | lm loss: 4.826226E+00 | grad norm: 0.514 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.165 | TFLOPs: 16.58 |
[default7]: iteration     4965/  128728 | consumed samples:        81984 | consumed tokens:    167903232 | elapsed time per iteration (s): 14.65 | learning rate: 2.686E-05 | global batch size:    32 | lm loss: 4.870893E+00 | grad norm: 0.498 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.184 | TFLOPs: 16.72 |
[default7]: iteration     4966/  128728 | consumed samples:        82016 | consumed tokens:    167968768 | elapsed time per iteration (s): 14.47 | learning rate: 2.688E-05 | global batch size:    32 | lm loss: 4.855809E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.212 | TFLOPs: 16.93 |
[default7]: iteration     4967/  128728 | consumed samples:        82048 | consumed tokens:    168034304 | elapsed time per iteration (s): 14.40 | learning rate: 2.689E-05 | global batch size:    32 | lm loss: 5.050081E+00 | grad norm: 0.540 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.223 | TFLOPs: 17.02 |
[default7]: iteration     4968/  128728 | consumed samples:        82080 | consumed tokens:    168099840 | elapsed time per iteration (s): 14.46 | learning rate: 2.690E-05 | global batch size:    32 | lm loss: 4.922202E+00 | grad norm: 0.543 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.213 | TFLOPs: 16.95 |
[default7]: iteration     4969/  128728 | consumed samples:        82112 | consumed tokens:    168165376 | elapsed time per iteration (s): 14.40 | learning rate: 2.691E-05 | global batch size:    32 | lm loss: 4.749779E+00 | grad norm: 0.614 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.222 | TFLOPs: 17.01 |
[default7]: iteration     4970/  128728 | consumed samples:        82144 | consumed tokens:    168230912 | elapsed time per iteration (s): 14.47 | learning rate: 2.692E-05 | global batch size:    32 | lm loss: 4.917465E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.211 | TFLOPs: 16.93 |
[default7]: iteration     4971/  128728 | consumed samples:        82176 | consumed tokens:    168296448 | elapsed time per iteration (s): 14.66 | learning rate: 2.693E-05 | global batch size:    32 | lm loss: 4.817117E+00 | grad norm: 0.604 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.183 | TFLOPs: 16.72 |
[default7]: iteration     4972/  128728 | consumed samples:        82208 | consumed tokens:    168361984 | elapsed time per iteration (s): 14.46 | learning rate: 2.694E-05 | global batch size:    32 | lm loss: 4.984338E+00 | grad norm: 0.438 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.213 | TFLOPs: 16.94 |
[default7]: iteration     4973/  128728 | consumed samples:        82240 | consumed tokens:    168427520 | elapsed time per iteration (s): 14.69 | learning rate: 2.695E-05 | global batch size:    32 | lm loss: 4.920941E+00 | grad norm: 0.513 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.178 | TFLOPs: 16.67 |
[default7]: iteration     4974/  128728 | consumed samples:        82272 | consumed tokens:    168493056 | elapsed time per iteration (s): 14.38 | learning rate: 2.696E-05 | global batch size:    32 | lm loss: 4.996977E+00 | grad norm: 0.565 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.225 | TFLOPs: 17.04 |
[default7]: iteration     4975/  128728 | consumed samples:        82304 | consumed tokens:    168558592 | elapsed time per iteration (s): 14.49 | learning rate: 2.697E-05 | global batch size:    32 | lm loss: 4.976236E+00 | grad norm: 0.542 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.209 | TFLOPs: 16.91 |
[default7]: iteration     4976/  128728 | consumed samples:        82336 | consumed tokens:    168624128 | elapsed time per iteration (s): 14.49 | learning rate: 2.698E-05 | global batch size:    32 | lm loss: 4.962553E+00 | grad norm: 0.533 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.91 |
[default7]: iteration     4977/  128728 | consumed samples:        82368 | consumed tokens:    168689664 | elapsed time per iteration (s): 14.66 | learning rate: 2.699E-05 | global batch size:    32 | lm loss: 4.922136E+00 | grad norm: 0.471 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.182 | TFLOPs: 16.71 |
[default7]: iteration     4978/  128728 | consumed samples:        82400 | consumed tokens:    168755200 | elapsed time per iteration (s): 14.49 | learning rate: 2.700E-05 | global batch size:    32 | lm loss: 4.775953E+00 | grad norm: 0.470 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.208 | TFLOPs: 16.91 |
[default7]: iteration     4979/  128728 | consumed samples:        82432 | consumed tokens:    168820736 | elapsed time per iteration (s): 14.36 | learning rate: 2.701E-05 | global batch size:    32 | lm loss: 4.841665E+00 | grad norm: 0.487 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.228 | TFLOPs: 17.06 |
[default7]: iteration     4980/  128728 | consumed samples:        82464 | consumed tokens:    168886272 | elapsed time per iteration (s): 14.41 | learning rate: 2.702E-05 | global batch size:    32 | lm loss: 4.885078E+00 | grad norm: 0.477 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.221 | TFLOPs: 17.01 |
[default7]: iteration     4981/  128728 | consumed samples:        82496 | consumed tokens:    168951808 | elapsed time per iteration (s): 14.67 | learning rate: 2.703E-05 | global batch size:    32 | lm loss: 4.872721E+00 | grad norm: 0.550 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.181 | TFLOPs: 16.70 |
[default7]: iteration     4982/  128728 | consumed samples:        82528 | consumed tokens:    169017344 | elapsed time per iteration (s): 14.43 | learning rate: 2.704E-05 | global batch size:    32 | lm loss: 4.986514E+00 | grad norm: 0.547 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.218 | TFLOPs: 16.98 |
[default7]: iteration     4983/  128728 | consumed samples:        82560 | consumed tokens:    169082880 | elapsed time per iteration (s): 14.35 | learning rate: 2.705E-05 | global batch size:    32 | lm loss: 4.904243E+00 | grad norm: 0.485 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.230 | TFLOPs: 17.07 |