lm1-misc / 146m174b100m /3319491.out
Muennighoff's picture
Add
f9fc05c
Model parameters: d_model 768 ffw_size 3072 kv_size 64 n_heads 12 n_layers 15
Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 15 --hidden-size 768 --num-attention-heads 12 --kv-channels 64 --ffn-hidden-size 3072 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 4 --global-batch-size 256 --train-samples 84_762_549 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --loss-scale 12 --clip-grad 1.0 --kill-switch-path kill-switch-146m174b100m --bf16 --checkpoint-activations --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 84_762_549 --lr-warmup-samples 847_625 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 100 --save-interval 10000 --eval-interval 10000 --eval-iters 1 --tensorboard-dir tensorboard_146m174b100m --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_146m174b100m --load checkpoints_146m174b100m --train-weighted-split-paths-path train100m.txt --valid-weighted-split-paths-path val.txt --data-impl mmap --deepspeed --deepspeed_config ds_configs/3319491.json --zero-stage 0
START 3319491: Fri 17 Mar 2023 01:50:53 PM EET
0:
0:
0: ======================= ROCm System Management Interface =======================
0: ================================= Concise Info =================================
0: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0: 0 46.0c 96.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
0: 1 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
0: 2 40.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
0: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
0: 4 45.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
0: 5 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
0: 6 38.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
0: 7 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
0: ================================================================================
0: ============================= End of ROCm SMI Log ==============================
7:
7:
7: ======================= ROCm System Management Interface =======================
7: ================================= Concise Info =================================
7: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
7: 0 45.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
7: 1 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
7: 2 38.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
7: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
7: 4 43.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
7: 5 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
7: 6 42.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
7: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
7: ================================================================================
7: ============================= End of ROCm SMI Log ==============================
1:
1:
1: ======================= ROCm System Management Interface =======================
1: ================================= Concise Info =================================
1: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
1: 0 45.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
1: 1 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
1: 2 42.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
1: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
1: 4 49.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
1: 5 52.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
1: 6 42.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
1: 7 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
1: ================================================================================
1: ============================= End of ROCm SMI Log ==============================
4:
4:
4: ======================= ROCm System Management Interface =======================
4: ================================= Concise Info =================================
4: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
4: 0 49.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
4: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
4: 2 41.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
4: 3 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
4: 4 42.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
4: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
4: 6 43.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
4: 7 39.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
4: ================================================================================
4: ============================= End of ROCm SMI Log ==============================
5:
5:
5: ======================= ROCm System Management Interface =======================
5: ================================= Concise Info =================================
5: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
5: 0 46.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
5: 1 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
5: 2 42.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
5: 3 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
5: 4 44.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
5: 5 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
5: 6 35.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
5: 7 40.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
5: ================================================================================
5: ============================= End of ROCm SMI Log ==============================
3:
3:
3: ======================= ROCm System Management Interface =======================
3: ================================= Concise Info =================================
3: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
3: 0 46.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
3: 1 52.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
3: 2 46.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
3: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
3: 4 42.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
3: 5 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
3: 6 47.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
3: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
3: ================================================================================
3: ============================= End of ROCm SMI Log ==============================
2:
2:
2: ======================= ROCm System Management Interface =======================
2: ================================= Concise Info =================================
2: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
2: 0 45.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
2: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
2: 2 40.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
2: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
2: 4 45.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
2: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
2: 6 40.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
2: 7 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
2: ================================================================================
2: ============================= End of ROCm SMI Log ==============================
6:
6:
6: ======================= ROCm System Management Interface =======================
6: ================================= Concise Info =================================
6: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
6: 0 48.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
6: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
6: 2 41.0c 96.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
6: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
6: 4 42.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
6: 5 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
6: 6 39.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0%
6: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0%
6: ================================================================================
6: ============================= End of ROCm SMI Log ==============================
7: Launching on nid006946 (7/8), master nid006939 port 9999, GPUs 8, CUDA: True
4: Launching on nid006943 (4/8), master nid006939 port 9999, GPUs 8, CUDA: True
6: Launching on nid006945 (6/8), master nid006939 port 9999, GPUs 8, CUDA: True
3: Launching on nid006942 (3/8), master nid006939 port 9999, GPUs 8, CUDA: True
0: Launching on nid006939 (0/8), master nid006939 port 9999, GPUs 8, CUDA: True
5: Launching on nid006944 (5/8), master nid006939 port 9999, GPUs 8, CUDA: True
1: Launching on nid006940 (1/8), master nid006939 port 9999, GPUs 8, CUDA: True
2: Launching on nid006941 (2/8), master nid006939 port 9999, GPUs 8, CUDA: True
7: > setting tensorboard ...
0: using world size: 64, data-parallel-size: 64, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
0: accumulate and all-reduce gradients in fp32 for bfloat16 data type.
0: using torch.bfloat16 for parameters ...
0: ------------------------ arguments ------------------------
0: abort_on_unmet_fused_kernel_constraints ......... False
0: accumulate_allreduce_grads_in_fp32 .............. True
0: adam_beta1 ...................................... 0.9
0: adam_beta2 ...................................... 0.999
0: adam_eps ........................................ 1e-08
0: adlr_autoresume ................................. False
0: adlr_autoresume_interval ........................ 1000
0: apply_query_key_layer_scaling ................... True
0: apply_residual_connection_post_layernorm ........ False
0: attention_dropout ............................... 0.1
0: attention_softmax_in_fp32 ....................... False
0: bert_binary_head ................................ True
0: bert_load ....................................... None
0: bf16 ............................................ True
0: bias_dropout_fusion ............................. True
0: bias_gelu_fusion ................................ True
0: biencoder_projection_dim ........................ 0
0: biencoder_shared_query_context_model ............ False
0: block_data_path ................................. None
0: checkpoint_activations .......................... True
0: checkpoint_in_cpu ............................... False
0: checkpoint_num_layers ........................... 1
0: clip_grad ....................................... 1.0
0: codecarbon_dir .................................. None
0: consumed_train_samples .......................... 0
0: consumed_train_tokens ........................... 0
0: consumed_valid_samples .......................... 0
0: contigious_checkpointing ........................ False
0: cpu_optimizer ................................... False
0: cpu_torch_adam .................................. False
0: curriculum_learning ............................. False
0: data_impl ....................................... mmap
0: data_parallel_size .............................. 64
0: data_path ....................................... None
0: dataloader_type ................................. single
0: DDP_impl ........................................ local
0: decoder_seq_length .............................. None
0: deepscale ....................................... False
0: deepscale_config ................................ None
0: deepspeed ....................................... True
0: deepspeed_activation_checkpointing .............. False
0: deepspeed_config ................................ ds_configs/3319491.json
0: deepspeed_mpi ................................... False
0: distribute_checkpointed_activations ............. False
0: distributed_backend ............................. nccl
0: embed_layernorm ................................. False
0: embedding_path .................................. None
0: encoder_seq_length .............................. 2048
0: eod_mask_loss ................................... False
0: eval_interval ................................... 10000
0: eval_iters ...................................... 1
0: eval_only ....................................... None
0: evidence_data_path .............................. None
0: exit_duration_in_mins ........................... None
0: exit_interval ................................... None
0: ffn_hidden_size ................................. 3072
0: finetune ........................................ False
0: fp16 ............................................ False
0: fp16_lm_cross_entropy ........................... False
0: fp32_residual_connection ........................ False
0: gigaflos_no_embeds .............................. 0
0: global_batch_size ............................... 256
0: glu_activation .................................. None
0: hidden_dropout .................................. 0.1
0: hidden_size ..................................... 768
0: hysteresis ...................................... 2
0: ict_head_size ................................... None
0: ict_load ........................................ None
0: img_dim ......................................... 224
0: indexer_batch_size .............................. 128
0: indexer_log_interval ............................ 1000
0: inference ....................................... False
0: init_method_std ................................. 0.02
0: init_method_xavier_uniform ...................... False
0: initial_loss_scale .............................. 4294967296
0: kill_switch_path ................................ kill-switch-146m174b100m
0: kv_channels ..................................... 64
0: layer_norm_fusion ............................... True
0: layernorm_epsilon ............................... 1e-05
0: lazy_mpu_init ................................... None
0: load ............................................ checkpoints_146m174b100m
0: local_rank ...................................... None
0: log_batch_size_to_tensorboard ................... True
0: log_interval .................................... 100
0: log_learning_rate_to_tensorboard ................ True
0: log_level ....................................... None
0: log_level_replica ............................... None
0: log_loss_scale_to_tensorboard ................... True
0: log_num_zeros_in_grad ........................... False
0: log_params_norm ................................. False
0: log_path ........................................ None
0: log_timers_to_tensorboard ....................... True
0: log_validation_ppl_to_tensorboard ............... True
0: loss_on_targets_only ............................ False
0: loss_scale ...................................... 12.0
0: loss_scale_window ............................... 1000
0: lr .............................................. 0.0002
0: lr_decay_iters .................................. None
0: lr_decay_samples ................................ 84762549
0: lr_decay_style .................................. cosine
0: lr_decay_tokens ................................. None
0: lr_warmup_fraction .............................. None
0: lr_warmup_iters ................................. 0
0: lr_warmup_samples ............................... 847625
0: make_vocab_size_divisible_by .................... 128
0: mask_prob ....................................... 0.15
0: masked_softmax_fusion ........................... True
0: max_position_embeddings ......................... 2048
0: mean_noise_span_length .......................... None
0: memory_centric_tiled_linear ..................... False
0: merge_file ...................................... gpt2/merges.txt
0: micro_batch_size ................................ 4
0: min_loss_scale .................................. 1.0
0: min_lr .......................................... 2e-05
0: mmap_warmup ..................................... False
0: no_load_optim ................................... None
0: no_load_rng ..................................... None
0: no_save_optim ................................... None
0: no_save_rng ..................................... None
0: noise_density ................................... None
0: num_attention_heads ............................. 12
0: num_channels .................................... 3
0: num_classes ..................................... 1000
0: num_layers ...................................... 15
0: num_layers_per_virtual_pipeline_stage ........... None
0: num_workers ..................................... 2
0: onnx_safe ....................................... None
0: openai_gelu ..................................... False
0: optimizer ....................................... adam
0: optimizer_fusion ................................ True
0: override_lr_scheduler ........................... False
0: pad_vocab_size_to ............................... None
0: params_dtype .................................... torch.bfloat16
0: partition_activations ........................... False
0: patch_dim ....................................... 16
0: pipeline_model_parallel_size .................... 1
0: position_embedding_type ......................... PositionEmbeddingType.absolute
0: pp_partition_method ............................. None
0: profile_backward ................................ False
0: query_in_block_prob ............................. 0.1
0: rampup_batch_size ............................... None
0: rank ............................................ 0
0: remote_device ................................... none
0: reset_attention_mask ............................ False
0: reset_position_ids .............................. False
0: reset_progress .................................. None
0: retriever_report_topk_accuracies ................ []
0: retriever_score_scaling ......................... False
0: retriever_seq_length ............................ 256
0: reweight_loss_based_on_position_frequency ....... False
0: sample_rate ..................................... 1.0
0: save ............................................ checkpoints_146m174b100m
0: save_interval ................................... 10000
0: scatter_gather_tensors_in_pipeline .............. True
0: scattered_embeddings ............................ False
0: seed ............................................ 1234
0: seq_length ...................................... 2048
0: sgd_momentum .................................... 0.9
0: short_seq_prob .................................. 0.1
0: skip_train_iteration_range ...................... None
0: split ........................................... None
0: split_transformers .............................. False
0: sync_tp_duplicated_parameters ................... False
0: synchronize_each_layer .......................... False
0: tensor_model_parallel_size ...................... 1
0: tensorboard_dir ................................. tensorboard_146m174b100m
0: tensorboard_log_interval ........................ 1
0: tensorboard_queue_size .......................... 5
0: test_weighted_split_paths ....................... None
0: test_weighted_split_paths_path .................. None
0: tile_factor ..................................... 1
0: titles_data_path ................................ None
0: tokenizer_name_or_path .......................... None
0: tokenizer_type .................................. GPT2BPETokenizer
0: train_iters ..................................... None
0: train_samples ................................... 84762549
0: train_tokens .................................... None
0: train_weighted_split_names ...................... ['train']
0: train_weighted_split_paths ...................... [['/scratch/project_462000119/data/c4_subsampled/gpt2tok_c4_en_100M_text_document']]
0: train_weighted_split_paths_path ................. None
0: train_weighted_split_splits ..................... [['0:1']]
0: train_weighted_split_weights .................... [['1.0']]
0: universal_checkpoint ............................ False
0: use_bnb_optimizer ............................... False
0: use_checkpoint_lr_scheduler ..................... False
0: use_contiguous_buffers_in_ddp ................... True
0: use_cpu_initialization .......................... None
0: use_one_sent_docs ............................... False
0: use_pin_memory .................................. False
0: valid_num_workers ............................... 2
0: valid_weighted_split_names ...................... ['validation']
0: valid_weighted_split_paths ...................... [['/scratch/project_462000119/data/c4_validation/gpt2tok_c4validation_rerun_text_document']]
0: valid_weighted_split_paths_path ................. None
0: valid_weighted_split_splits ..................... [['0:1']]
0: valid_weighted_split_weights .................... [['1.0']]
0: virtual_pipeline_model_parallel_size ............ None
0: vocab_extra_ids ................................. 0
0: vocab_file ...................................... gpt2/vocab.json
0: weight_decay .................................... 0.1
0: world_size ...................................... 64
0: zero_allgather_bucket_size ...................... 0.0
0: zero_contigious_gradients ....................... False
0: zero_reduce_bucket_size ......................... 0.0
0: zero_reduce_scatter ............................. False
0: zero_stage ...................................... 0
0: -------------------- end of arguments ---------------------
0: setting number of micro-batches to constant 1
0: > building GPT2BPETokenizer tokenizer ...
0: > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
0: DeepSpeed general environment info:
0: torch install path ............... ['/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch']
0: torch version .................... 1.13.0+rocm5.2
0: torch cuda version ............... None
0: torch hip version ................ 5.2.21151-afdc89f8
0: nvcc version ..................... None
0: deepspeed install path ........... ['/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed']
0: deepspeed info ................... 0.7.5, unknown, unknown
0: deepspeed wheel compiled w. ...... torch 1.13, hip 5.1
0: **** Git info for Megatron: git_hash=unknown git_branch=unknown ****
0: > initializing torch distributed ...
0: [2023-03-17 13:53:41,482] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
0: > initializing tensor model parallel with size 1
0: > initializing pipeline model parallel with size 1
0: > setting random seeds to 1234 ...
0: > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
0: > compiling dataset index builder ...
0: make: Entering directory '/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/data'
0: make: Nothing to be done for 'default'.
0: make: Leaving directory '/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/data'
0: >>> done with dataset index builder. Compilation time: 0.065 seconds
0: > compiling and loading fused kernels ...