gpt-neo-1.3b-sft_epoch2 / training_run1.log

Upload with huggingface_hub

18f5e6c almost 2 years ago

146 kB

	[2023-04-19 16:55:10,332] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
	[2023-04-19 16:55:10,380] [INFO] [runner.py:540:main] cmd = /home/ubuntu/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path iamketan25/alpaca-instructions-dataset iamketan25/dolly-instructions-15k iamketan25/gsm-general-qa-instructions --model_name_or_path iamketan25/gpt-neo-1.3b-sft --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 1e-3 --weight_decay 0.1 --num_train_epochs 1 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 3 --lora_dim 16 --lora_module_name h. --only_optimize_lora --deepspeed --output_dir ./gpt_neo_1.3b_sft_lora_dim_32_zero_stage3_epoch2
	[2023-04-19 16:55:12,124] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
	[2023-04-19 16:55:12,124] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
	[2023-04-19 16:55:12,124] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
	[2023-04-19 16:55:12,124] [INFO] [launch.py:247:main] dist_world_size=4
	[2023-04-19 16:55:12,124] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
	[2023-04-19 16:55:15,363] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
	[2023-04-19 16:55:20,249] [INFO] [partition_parameters.py:436:__exit__] finished initializing model with 1.42B parameters
	Some weights of GPTNeoForCausalLM were not initialized from the model checkpoint at iamketan25/gpt-neo-1.3b-sft and are newly initialized: ['transformer.h.10.attn.attention.masked_bias', 'transformer.h.9.attn.attention.masked_bias', 'transformer.h.17.attn.attention.masked_bias', 'transformer.h.1.attn.attention.masked_bias', 'transformer.h.12.attn.attention.masked_bias', 'transformer.h.16.attn.attention.masked_bias', 'transformer.h.11.attn.attention.masked_bias', 'transformer.h.6.attn.attention.masked_bias', 'transformer.h.4.attn.attention.masked_bias', 'transformer.h.20.attn.attention.masked_bias', 'transformer.h.14.attn.attention.masked_bias', 'transformer.h.13.attn.attention.masked_bias', 'transformer.h.8.attn.attention.masked_bias', 'transformer.h.7.attn.attention.masked_bias', 'transformer.h.3.attn.attention.masked_bias', 'transformer.h.23.attn.attention.masked_bias', 'transformer.h.18.attn.attention.masked_bias', 'transformer.h.2.attn.attention.masked_bias', 'transformer.h.0.attn.attention.masked_bias', 'transformer.h.5.attn.attention.masked_bias', 'transformer.h.22.attn.attention.masked_bias', 'transformer.h.19.attn.attention.masked_bias', 'transformer.h.21.attn.attention.masked_bias', 'transformer.h.15.attn.attention.masked_bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
	Some weights of GPTNeoForCausalLM were not initialized from the model checkpoint at iamketan25/gpt-neo-1.3b-sft and are newly initialized: ['transformer.h.20.attn.attention.masked_bias', 'transformer.h.12.attn.attention.masked_bias', 'transformer.h.15.attn.attention.masked_bias', 'transformer.h.5.attn.attention.masked_bias', 'transformer.h.3.attn.attention.masked_bias', 'transformer.h.7.attn.attention.masked_bias', 'transformer.h.16.attn.attention.masked_bias', 'transformer.h.4.attn.attention.masked_bias', 'transformer.h.9.attn.attention.masked_bias', 'transformer.h.11.attn.attention.masked_bias', 'transformer.h.19.attn.attention.masked_bias', 'transformer.h.2.attn.attention.masked_bias', 'transformer.h.14.attn.attention.masked_bias', 'transformer.h.8.attn.attention.masked_bias', 'transformer.h.18.attn.attention.masked_bias', 'transformer.h.22.attn.attention.masked_bias', 'transformer.h.0.attn.attention.masked_bias', 'transformer.h.21.attn.attention.masked_bias', 'transformer.h.1.attn.attention.masked_bias', 'transformer.h.17.attn.attention.masked_bias', 'transformer.h.6.attn.attention.masked_bias', 'transformer.h.23.attn.attention.masked_bias', 'transformer.h.10.attn.attention.masked_bias', 'transformer.h.13.attn.attention.masked_bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
	Some weights of GPTNeoForCausalLM were not initialized from the model checkpoint at iamketan25/gpt-neo-1.3b-sft and are newly initialized: ['transformer.h.13.attn.attention.masked_bias', 'transformer.h.7.attn.attention.masked_bias', 'transformer.h.0.attn.attention.masked_bias', 'transformer.h.2.attn.attention.masked_bias', 'transformer.h.10.attn.attention.masked_bias', 'transformer.h.4.attn.attention.masked_bias', 'transformer.h.23.attn.attention.masked_bias', 'transformer.h.6.attn.attention.masked_bias', 'transformer.h.15.attn.attention.masked_bias', 'transformer.h.16.attn.attention.masked_bias', 'transformer.h.18.attn.attention.masked_bias', 'transformer.h.5.attn.attention.masked_bias', 'transformer.h.3.attn.attention.masked_bias', 'transformer.h.1.attn.attention.masked_bias', 'transformer.h.11.attn.attention.masked_bias', 'transformer.h.9.attn.attention.masked_bias', 'transformer.h.19.attn.attention.masked_bias', 'transformer.h.8.attn.attention.masked_bias', 'transformer.h.17.attn.attention.masked_bias', 'transformer.h.20.attn.attention.masked_bias', 'transformer.h.21.attn.attention.masked_bias', 'transformer.h.22.attn.attention.masked_bias', 'transformer.h.12.attn.attention.masked_bias', 'transformer.h.14.attn.attention.masked_bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
	Some weights of GPTNeoForCausalLM were not initialized from the model checkpoint at iamketan25/gpt-neo-1.3b-sft and are newly initialized: ['transformer.h.18.attn.attention.masked_bias', 'transformer.h.10.attn.attention.masked_bias', 'transformer.h.13.attn.attention.masked_bias', 'transformer.h.4.attn.attention.masked_bias', 'transformer.h.20.attn.attention.masked_bias', 'transformer.h.21.attn.attention.masked_bias', 'transformer.h.15.attn.attention.masked_bias', 'transformer.h.7.attn.attention.masked_bias', 'transformer.h.2.attn.attention.masked_bias', 'transformer.h.14.attn.attention.masked_bias', 'transformer.h.6.attn.attention.masked_bias', 'transformer.h.1.attn.attention.masked_bias', 'transformer.h.9.attn.attention.masked_bias', 'transformer.h.11.attn.attention.masked_bias', 'transformer.h.12.attn.attention.masked_bias', 'transformer.h.5.attn.attention.masked_bias', 'transformer.h.17.attn.attention.masked_bias', 'transformer.h.16.attn.attention.masked_bias', 'transformer.h.22.attn.attention.masked_bias', 'transformer.h.0.attn.attention.masked_bias', 'transformer.h.8.attn.attention.masked_bias', 'transformer.h.19.attn.attention.masked_bias', 'transformer.h.23.attn.attention.masked_bias', 'transformer.h.3.attn.attention.masked_bias']
	You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
	Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--alpaca-instructions-dataset-57eb880093a82a29/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|██████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 744.40it/s]
	Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--alpaca-instructions-dataset-57eb880093a82a29/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|██████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 800.97it/s]
	Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--alpaca-instructions-dataset-57eb880093a82a29/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|██████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 802.12it/s]
	Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--alpaca-instructions-dataset-57eb880093a82a29/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|██████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 803.74it/s]
	Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--dolly-instructions-15k-4899fa89efa5cbf6/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|█████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 1080.87it/s]
	Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--dolly-instructions-15k-4899fa89efa5cbf6/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|█████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 1050.28it/s]
	Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--dolly-instructions-15k-4899fa89efa5cbf6/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|█████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 1057.57it/s]
	Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--dolly-instructions-15k-4899fa89efa5cbf6/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|█████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 1041.67it/s]
	Downloading and preparing dataset parquet/iamketan25--gsm-general-qa-instructions to /home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--gsm-general-qa-instructions-013277d5a826dcd4/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
	Downloading data files: 0%\| \| 0/2 [00:00<?, ?it/s]
	Downloading data: 0%\| \| 0.00/8.56M [00:00<?, ?B/s][ADownloading data: 100%\|██████████████████████████████████████████████████████████████\| 8.56M/8.56M [00:00<00:00, 90.5MB/s]
	Downloading data files: 50%\|███████████████████████████████▌ \| 1/2 [00:00<00:00, 3.94it/s]
	Downloading data: 0%\| \| 0.00/967k [00:00<?, ?B/s][ADownloading data: 100%\|████████████████████████████████████████████████████████████████\| 967k/967k [00:00<00:00, 33.7MB/s]
	Downloading data files: 100%\|███████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 4.63it/s] Downloading data files: 100%\|███████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 4.51it/s]
	Extracting data files: 0%\| \| 0/2 [00:00<?, ?it/s] Extracting data files: 100%\|██████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 2247.15it/s]
	Generating train split: 0 examples [00:00, ? examples/s] Generating test split: 0 examples [00:00, ? examples/s] Dataset parquet downloaded and prepared to /home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--gsm-general-qa-instructions-013277d5a826dcd4/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|██████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 969.45it/s]
	Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--gsm-general-qa-instructions-013277d5a826dcd4/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|██████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 622.16it/s]
	Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--gsm-general-qa-instructions-013277d5a826dcd4/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|██████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 895.74it/s]
	Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/iamketan25___parquet/iamketan25--gsm-general-qa-instructions-013277d5a826dcd4/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
	0%\| \| 0/2 [00:00<?, ?it/s] 100%\|██████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:00<00:00, 859.58it/s]
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	Detected CUDA files, patching ldflags
	Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja...
	Building extension module fused_adam...
	Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	ninja: no work to do.
	Loading extension module fused_adam...
	Time to load fused_adam op: 0.5939269065856934 seconds
	Loading extension module fused_adam...
	Time to load fused_adam op: 0.6027216911315918 seconds
	Loading extension module fused_adam...
	Time to load fused_adam op: 0.6026906967163086 seconds
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	Detected CUDA files, patching ldflags
	Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja...
	Building extension module fused_adam...
	Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	ninja: no work to do.
	Loading extension module fused_adam...
	Time to load fused_adam op: 1.034358024597168 seconds
	[2023-04-19 16:57:05,669] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.0, git-hash=unknown, git-branch=unknown
	[2023-04-19 16:57:05,675] [INFO] [comm.py:580:init_distributed] Distributed backend already initialized
	[2023-04-19 16:57:05,871] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
	[2023-04-19 16:57:05,873] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
	[2023-04-19 16:57:05,873] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
	[2023-04-19 16:57:05,893] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
	[2023-04-19 16:57:05,893] [INFO] [utils.py:51:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
	[2023-04-19 16:57:05,894] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py39_cu117/utils/build.ninja...
	Building extension module utils...
	Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
	[2023-04-19 16:57:06,328] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning
	[2023-04-19 16:57:06,329] [INFO] [utils.py:786:see_memory_usage] MA 0.78 GB Max_MA 1.28 GB CA 3.41 GB Max_CA 3 GB
	[2023-04-19 16:57:06,329] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.22 GB, percent = 18.9%
	[2023-04-19 16:57:06,331] [INFO] [stage3.py:113:__init__] Reduce bucket size 500,000,000
	[2023-04-19 16:57:06,331] [INFO] [stage3.py:114:__init__] Prefetch bucket size 30000000
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	ninja: no work to do.
	Loading extension module utils...
	Time to load utils op: 0.5918192863464355 seconds
	Loading extension module utils...
	Loading extension module utils...
	Time to load utils op: 0.6023216247558594 seconds
	Time to load utils op: 0.6022861003875732 seconds
	Loading extension module utils...
	Time to load utils op: 0.20152544975280762 seconds
	[2023-04-19 16:57:06,875] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
	[2023-04-19 16:57:06,876] [INFO] [utils.py:786:see_memory_usage] MA 0.78 GB Max_MA 0.78 GB CA 3.41 GB Max_CA 3 GB
	[2023-04-19 16:57:06,876] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.23 GB, percent = 18.9%
	Parameter Offload: Total persistent parameters: 495616 in 170 params
	[2023-04-19 16:57:07,289] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
	[2023-04-19 16:57:07,289] [INFO] [utils.py:786:see_memory_usage] MA 0.76 GB Max_MA 0.78 GB CA 3.41 GB Max_CA 3 GB
	[2023-04-19 16:57:07,290] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.23 GB, percent = 18.9%
	[2023-04-19 16:57:07,630] [INFO] [utils.py:785:see_memory_usage] Before creating fp16 partitions
	[2023-04-19 16:57:07,631] [INFO] [utils.py:786:see_memory_usage] MA 0.76 GB Max_MA 0.76 GB CA 3.41 GB Max_CA 3 GB
	[2023-04-19 16:57:07,631] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.23 GB, percent = 18.9%
	[2023-04-19 16:57:08,444] [INFO] [utils.py:785:see_memory_usage] After creating fp16 partitions: 1
	[2023-04-19 16:57:08,444] [INFO] [utils.py:786:see_memory_usage] MA 0.76 GB Max_MA 0.76 GB CA 1.67 GB Max_CA 3 GB
	[2023-04-19 16:57:08,444] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9%
	[2023-04-19 16:57:08,786] [INFO] [utils.py:785:see_memory_usage] Before creating fp32 partitions
	[2023-04-19 16:57:08,786] [INFO] [utils.py:786:see_memory_usage] MA 0.76 GB Max_MA 0.76 GB CA 1.67 GB Max_CA 2 GB
	[2023-04-19 16:57:08,787] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9%
	[2023-04-19 16:57:09,128] [INFO] [utils.py:785:see_memory_usage] After creating fp32 partitions
	[2023-04-19 16:57:09,128] [INFO] [utils.py:786:see_memory_usage] MA 0.77 GB Max_MA 0.78 GB CA 1.67 GB Max_CA 2 GB
	[2023-04-19 16:57:09,129] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9%
	[2023-04-19 16:57:09,469] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
	[2023-04-19 16:57:09,470] [INFO] [utils.py:786:see_memory_usage] MA 0.77 GB Max_MA 0.77 GB CA 1.67 GB Max_CA 2 GB
	[2023-04-19 16:57:09,470] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9%
	[2023-04-19 16:57:09,810] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states
	[2023-04-19 16:57:09,811] [INFO] [utils.py:786:see_memory_usage] MA 0.8 GB Max_MA 0.81 GB CA 1.67 GB Max_CA 2 GB
	[2023-04-19 16:57:09,811] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9%
	[2023-04-19 16:57:09,812] [INFO] [stage3.py:366:_setup_for_real_optimizer] optimizer state initialized
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	No modifications detected for re-loaded extension module utils, skipping build step...
	Loading extension module utils...
	Time to load utils op: 0.00035953521728515625 seconds
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	No modifications detected for re-loaded extension module utils, skipping build step...
	Loading extension module utils...
	Time to load utils op: 0.00038123130798339844 seconds
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	No modifications detected for re-loaded extension module utils, skipping build step...
	Loading extension module utils...
	Time to load utils op: 0.0003180503845214844 seconds
	[2023-04-19 16:57:10,317] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
	[2023-04-19 16:57:10,317] [INFO] [utils.py:786:see_memory_usage] MA 1.74 GB Max_MA 1.74 GB CA 2.61 GB Max_CA 3 GB
	[2023-04-19 16:57:10,318] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 35.25 GB, percent = 18.9%
	[2023-04-19 16:57:10,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
	[2023-04-19 16:57:10,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
	[2023-04-19 16:57:10,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f66ae604310>
	[2023-04-19 16:57:10,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[(0.9, 0.95)]
	[2023-04-19 16:57:10,319] [INFO] [config.py:953:print] DeepSpeedEngine configuration:
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] activation_checkpointing_config {
	"partition_activations": false,
	"contiguous_memory_optimization": false,
	"cpu_checkpointing": false,
	"number_checkpoints": null,
	"synchronize_checkpoint_boundary": false,
	"profile": false
	}
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] amp_enabled .................. False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] amp_params ................... False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] autotuning_config ............ {
	"enabled": false,
	"start_step": null,
	"end_step": null,
	"metric_path": null,
	"arg_mappings": null,
	"metric": "throughput",
	"model_info": null,
	"results_dir": "autotuning_results",
	"exps_dir": "autotuning_exps",
	"overwrite": true,
	"fast": true,
	"start_profile_step": 3,
	"end_profile_step": 5,
	"tuner_type": "gridsearch",
	"tuner_early_stopping": 5,
	"tuner_num_trials": 50,
	"model_info_path": null,
	"mp_size": 1,
	"max_train_batch_size": null,
	"min_train_batch_size": 1,
	"max_train_micro_batch_size_per_gpu": 1.024000e+03,
	"min_train_micro_batch_size_per_gpu": 1,
	"num_tuning_micro_batch_sizes": 3
	}
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] bfloat16_enabled ............. False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] checkpoint_parallel_write_pipeline False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] checkpoint_tag_validation_enabled True
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] checkpoint_tag_validation_fail False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f66f708c490>
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] communication_data_type ...... None
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] curriculum_enabled_legacy .... False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] curriculum_params_legacy ..... False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] data_efficiency_enabled ...... False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] dataloader_drop_last ......... False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] disable_allgather ............ False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] dump_state ................... False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1}
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_enabled ........... False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_gas_boundary_resolution 1
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_layer_name ........ bert.encoder.layer
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_layer_num ......... 0
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_max_iter .......... 100
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_stability ......... 1e-06
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_tol ............... 0.01
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] eigenvalue_verbose ........... False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] elasticity_enabled ........... False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] flops_profiler_config ........ {
	"enabled": false,
	"profile_step": 1,
	"module_depth": -1,
	"top_modules": 1,
	"detailed": true,
	"output_file": null
	}
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] fp16_auto_cast ............... False
	[2023-04-19 16:57:10,320] [INFO] [config.py:957:print] fp16_enabled ................. True
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] fp16_master_weights_and_gradients False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] global_rank .................. 0
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] grad_accum_dtype ............. None
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] gradient_accumulation_steps .. 2
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] gradient_clipping ............ 1.0
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] gradient_predivide_factor .... 1.0
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] initial_dynamic_scale ........ 65536
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] load_universal_checkpoint .... False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] loss_scale ................... 0
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] memory_breakdown ............. False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] nebula_config ................ {
	"enabled": false,
	"persistent_storage_path": null,
	"persistent_time_interval": 100,
	"num_of_version_in_retention": 2,
	"enable_nebula_load": true,
	"load_path": null
	}
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] optimizer_legacy_fusion ...... False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] optimizer_name ............... None
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] optimizer_params ............. None
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] pld_enabled .................. False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] pld_params ................... False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] prescale_gradients ........... False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] scheduler_name ............... None
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] scheduler_params ............. None
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] sparse_attention ............. None
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] sparse_gradients_enabled ..... False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] steps_per_print .............. 10
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] train_batch_size ............. 32
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] train_micro_batch_size_per_gpu 4
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] use_node_local_storage ....... False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] wall_clock_breakdown ......... False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] world_size ................... 4
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] zero_allow_untested_optimizer False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=False
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] zero_enabled ................. True
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] zero_force_ds_cpu_optimizer .. True
	[2023-04-19 16:57:10,321] [INFO] [config.py:957:print] zero_optimization_stage ...... 3
	[2023-04-19 16:57:10,321] [INFO] [config.py:943:print_user_config] json = {
	"train_batch_size": 32,
	"train_micro_batch_size_per_gpu": 4,
	"steps_per_print": 10,
	"zero_optimization": {
	"stage": 3,
	"offload_param": {
	"device": "none"
	},
	"offload_optimizer": {
	"device": "none"
	},
	"stage3_param_persistence_threshold": 1.000000e+04,
	"stage3_max_live_parameters": 3.000000e+07,
	"stage3_prefetch_bucket_size": 3.000000e+07,
	"memory_efficient_linear": false
	},
	"fp16": {
	"enabled": true,
	"loss_scale_window": 100
	},
	"gradient_clipping": 1.0,
	"prescale_gradients": false,
	"wall_clock_breakdown": false,
	"hybrid_engine": {
	"enabled": false,
	"inference_tp_size": 1,
	"release_inference_cache": false,
	"pin_parameters": true,
	"tp_gather_partition_size": 8
	}
	}
	Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
	No modifications detected for re-loaded extension module utils, skipping build step...
	Loading extension module utils...
	Time to load utils op: 0.0003190040588378906 seconds
	*** Running training ***
	*** Evaluating perplexity, Epoch 0/1 ***
	ppl: 1.932552456855774
	Beginning of Epoch 1/1, Total Micro Batches 5116
	Invalidate trace cache @ step 0: expected module 16, but got module 0
	[2023-04-19 17:04:45,205] [WARNING] [stage3.py:1787:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
	[2023-04-19 17:05:06,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[0.000999962292024615], mom=[(0.9, 0.95)]
	[2023-04-19 17:05:06,421] [INFO] [timer.py:199:stop] epoch=0/micro_step=20/global_step=10, RunningAvgSamplesPerSec=12.07828969080794, CurrSamplesPerSec=12.495106931233762, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:05:32,168] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[0.0009998491737860256], mom=[(0.9, 0.95)]
	[2023-04-19 17:05:32,169] [INFO] [timer.py:199:stop] epoch=0/micro_step=40/global_step=20, RunningAvgSamplesPerSec=12.27707734265837, CurrSamplesPerSec=12.316409530577367, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:05:58,637] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[0.0009996606623460709], mom=[(0.9, 0.95)]
	[2023-04-19 17:05:58,637] [INFO] [timer.py:199:stop] epoch=0/micro_step=60/global_step=30, RunningAvgSamplesPerSec=12.213891713093837, CurrSamplesPerSec=12.335717709296935, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:06:25,208] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[0.0009993967861382895], mom=[(0.9, 0.95)]
	[2023-04-19 17:06:25,208] [INFO] [timer.py:199:stop] epoch=0/micro_step=80/global_step=40, RunningAvgSamplesPerSec=12.171658146585271, CurrSamplesPerSec=12.415960783326993, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:06:50,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[0.0009990575849636322], mom=[(0.9, 0.95)]
	[2023-04-19 17:06:50,949] [INFO] [timer.py:199:stop] epoch=0/micro_step=100/global_step=50, RunningAvgSamplesPerSec=12.227460835941864, CurrSamplesPerSec=12.498281061505862, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:07:17,048] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[0.0009986431099844567], mom=[(0.9, 0.95)]
	[2023-04-19 17:07:17,049] [INFO] [timer.py:199:stop] epoch=0/micro_step=120/global_step=60, RunningAvgSamplesPerSec=12.235269962686647, CurrSamplesPerSec=12.427871199814515, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:07:42,760] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=0, lr=[0.0009981534237168124], mom=[(0.9, 0.95)]
	[2023-04-19 17:07:42,760] [INFO] [timer.py:199:stop] epoch=0/micro_step=140/global_step=70, RunningAvgSamplesPerSec=12.26760719734709, CurrSamplesPerSec=12.447521475862185, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:08:09,216] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=0, lr=[0.0009975886000210103], mom=[(0.9, 0.95)]
	[2023-04-19 17:08:09,216] [INFO] [timer.py:199:stop] epoch=0/micro_step=160/global_step=80, RunningAvgSamplesPerSec=12.246875617612654, CurrSamplesPerSec=12.463447132852547, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:08:34,878] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=0, lr=[0.0009969487240904821], mom=[(0.9, 0.95)]
	[2023-04-19 17:08:34,878] [INFO] [timer.py:199:stop] epoch=0/micro_step=180/global_step=90, RunningAvgSamplesPerSec=12.273182191155067, CurrSamplesPerSec=12.479842927239387, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:09:01,201] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=0, lr=[0.0009962338924389318], mom=[(0.9, 0.95)]
	[2023-04-19 17:09:01,202] [INFO] [timer.py:199:stop] epoch=0/micro_step=200/global_step=100, RunningAvgSamplesPerSec=12.262435671957197, CurrSamplesPerSec=12.444532300839375, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:09:03,712] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:09:06,241] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:09:27,458] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=2, lr=[0.0009956081310737383], mom=[(0.9, 0.95)]
	[2023-04-19 17:09:27,458] [INFO] [timer.py:199:stop] epoch=0/micro_step=220/global_step=110, RunningAvgSamplesPerSec=12.256601643093727, CurrSamplesPerSec=10.073982804736758, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:09:53,231] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=2, lr=[0.0009947586584163801], mom=[(0.9, 0.95)]
	[2023-04-19 17:09:53,232] [INFO] [timer.py:199:stop] epoch=0/micro_step=240/global_step=120, RunningAvgSamplesPerSec=12.270975908562303, CurrSamplesPerSec=12.399814784909182, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:10:19,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=2, lr=[0.0009938345603697695], mom=[(0.9, 0.95)]
	[2023-04-19 17:10:19,854] [INFO] [timer.py:199:stop] epoch=0/micro_step=260/global_step=130, RunningAvgSamplesPerSec=12.25194301856043, CurrSamplesPerSec=12.366203557310797, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:10:45,602] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=2, lr=[0.0009928359763173725], mom=[(0.9, 0.95)]
	[2023-04-19 17:10:45,603] [INFO] [timer.py:199:stop] epoch=0/micro_step=280/global_step=140, RunningAvgSamplesPerSec=12.26540750325886, CurrSamplesPerSec=12.465680069991771, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:11:12,054] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=2, lr=[0.0009917630568775197], mom=[(0.9, 0.95)]
	[2023-04-19 17:11:12,055] [INFO] [timer.py:199:stop] epoch=0/micro_step=300/global_step=150, RunningAvgSamplesPerSec=12.254730873875417, CurrSamplesPerSec=12.451094211070863, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:11:37,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=2, lr=[0.0009906159638806912], mom=[(0.9, 0.95)]
	[2023-04-19 17:11:37,805] [INFO] [timer.py:199:stop] epoch=0/micro_step=320/global_step=160, RunningAvgSamplesPerSec=12.266273586942694, CurrSamplesPerSec=12.372864453261226, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:12:04,310] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=2, lr=[0.0009893948703451048], mom=[(0.9, 0.95)]
	[2023-04-19 17:12:04,311] [INFO] [timer.py:199:stop] epoch=0/micro_step=340/global_step=170, RunningAvgSamplesPerSec=12.255316195957107, CurrSamplesPerSec=12.35051278684349, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:12:30,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=2, lr=[0.00098809996045062], mom=[(0.9, 0.95)]
	[2023-04-19 17:12:30,058] [INFO] [timer.py:199:stop] epoch=0/micro_step=360/global_step=180, RunningAvgSamplesPerSec=12.265586794678816, CurrSamplesPerSec=12.53894458266291, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:12:55,974] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=2, lr=[0.0009867314295109592], mom=[(0.9, 0.95)]
	[2023-04-19 17:12:55,974] [INFO] [timer.py:199:stop] epoch=0/micro_step=380/global_step=190, RunningAvgSamplesPerSec=12.270574293164143, CurrSamplesPerSec=12.805881607610917, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:13:21,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=2, lr=[0.0009852894839442454], mom=[(0.9, 0.95)]
	[2023-04-19 17:13:21,597] [INFO] [timer.py:199:stop] epoch=0/micro_step=400/global_step=200, RunningAvgSamplesPerSec=12.28206325598975, CurrSamplesPerSec=10.304444444615054, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:13:29,036] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:13:31,505] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:13:46,471] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=4, lr=[0.0009840832147423797], mom=[(0.9, 0.95)]
	[2023-04-19 17:13:46,471] [INFO] [timer.py:199:stop] epoch=0/micro_step=420/global_step=210, RunningAvgSamplesPerSec=12.309450362425165, CurrSamplesPerSec=12.82604532079847, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:14:12,095] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=4, lr=[0.0009825096783456148], mom=[(0.9, 0.95)]
	[2023-04-19 17:14:12,095] [INFO] [timer.py:199:stop] epoch=0/micro_step=440/global_step=220, RunningAvgSamplesPerSec=12.318099158821909, CurrSamplesPerSec=12.881070542377875, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:14:36,998] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=4, lr=[0.000980863364096554], mom=[(0.9, 0.95)]
	[2023-04-19 17:14:36,999] [INFO] [timer.py:199:stop] epoch=0/micro_step=460/global_step=230, RunningAvgSamplesPerSec=12.341027356371558, CurrSamplesPerSec=12.877603892322144, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:15:02,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=4, lr=[0.0009791445203119053], mom=[(0.9, 0.95)]
	[2023-04-19 17:15:02,518] [INFO] [timer.py:199:stop] epoch=0/micro_step=480/global_step=240, RunningAvgSamplesPerSec=12.349763302828249, CurrSamplesPerSec=12.845641906787987, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:15:27,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=4, lr=[0.0009773534062481454], mom=[(0.9, 0.95)]
	[2023-04-19 17:15:27,392] [INFO] [timer.py:199:stop] epoch=0/micro_step=500/global_step=250, RunningAvgSamplesPerSec=12.370237821031711, CurrSamplesPerSec=12.822156205013146, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:15:52,996] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=4, lr=[0.0009754902920624147], mom=[(0.9, 0.95)]
	[2023-04-19 17:15:52,996] [INFO] [timer.py:199:stop] epoch=0/micro_step=520/global_step=260, RunningAvgSamplesPerSec=12.375631075615024, CurrSamplesPerSec=12.870813186986009, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:16:17,856] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=4, lr=[0.0009735554587717682], mom=[(0.9, 0.95)]
	[2023-04-19 17:16:17,857] [INFO] [timer.py:199:stop] epoch=0/micro_step=540/global_step=270, RunningAvgSamplesPerSec=12.393919730920452, CurrSamplesPerSec=12.862547808911936, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:16:43,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=4, lr=[0.0009715491982107905], mom=[(0.9, 0.95)]
	[2023-04-19 17:16:43,461] [INFO] [timer.py:199:stop] epoch=0/micro_step=560/global_step=280, RunningAvgSamplesPerSec=12.398073557379359, CurrSamplesPerSec=12.863134582943513, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:17:08,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=4, lr=[0.0009694718129875771], mom=[(0.9, 0.95)]
	[2023-04-19 17:17:08,936] [INFO] [timer.py:199:stop] epoch=0/micro_step=580/global_step=290, RunningAvgSamplesPerSec=12.404108615541837, CurrSamplesPerSec=12.860413192082941, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:17:33,853] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=4, lr=[0.0009673236164380912], mom=[(0.9, 0.95)]
	[2023-04-19 17:17:33,853] [INFO] [timer.py:199:stop] epoch=0/micro_step=600/global_step=300, RunningAvgSamplesPerSec=12.418762449265945, CurrSamplesPerSec=12.851047499700309, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:17:46,275] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:17:48,741] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:17:59,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=310, skipped=6, lr=[0.0009655542924250932], mom=[(0.9, 0.95)]
	[2023-04-19 17:17:59,496] [INFO] [timer.py:199:stop] epoch=0/micro_step=620/global_step=310, RunningAvgSamplesPerSec=12.421115391503212, CurrSamplesPerSec=12.816023451561925, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:18:24,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=320, skipped=6, lr=[0.0009632794591562836], mom=[(0.9, 0.95)]
	[2023-04-19 17:18:24,478] [INFO] [timer.py:199:stop] epoch=0/micro_step=640/global_step=320, RunningAvgSamplesPerSec=12.433364398393774, CurrSamplesPerSec=12.829373894544624, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:18:50,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=330, skipped=6, lr=[0.000960934748565705], mom=[(0.9, 0.95)]
	[2023-04-19 17:18:50,281] [INFO] [timer.py:199:stop] epoch=0/micro_step=660/global_step=330, RunningAvgSamplesPerSec=12.4327725639688, CurrSamplesPerSec=12.797726234057041, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:19:15,444] [INFO] [logging.py:96:log_dist] [Rank 0] step=340, skipped=6, lr=[0.0009585205143105142], mom=[(0.9, 0.95)]
	[2023-04-19 17:19:15,444] [INFO] [timer.py:199:stop] epoch=0/micro_step=680/global_step=340, RunningAvgSamplesPerSec=12.441374838659565, CurrSamplesPerSec=12.701174220816803, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:19:41,502] [INFO] [logging.py:96:log_dist] [Rank 0] step=350, skipped=6, lr=[0.0009560371205342551], mom=[(0.9, 0.95)]
	[2023-04-19 17:19:41,502] [INFO] [timer.py:199:stop] epoch=0/micro_step=700/global_step=350, RunningAvgSamplesPerSec=12.437047446225632, CurrSamplesPerSec=12.582267551983822, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:20:07,018] [INFO] [logging.py:96:log_dist] [Rank 0] step=360, skipped=6, lr=[0.0009534849418119328], mom=[(0.9, 0.95)]
	[2023-04-19 17:20:07,019] [INFO] [timer.py:199:stop] epoch=0/micro_step=720/global_step=360, RunningAvgSamplesPerSec=12.440282802443646, CurrSamplesPerSec=12.476507663951448, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:20:33,388] [INFO] [logging.py:96:log_dist] [Rank 0] step=370, skipped=6, lr=[0.0009508643630935172], mom=[(0.9, 0.95)]
	[2023-04-19 17:20:33,388] [INFO] [timer.py:199:stop] epoch=0/micro_step=740/global_step=370, RunningAvgSamplesPerSec=12.43213407083404, CurrSamplesPerSec=12.501919098490374, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:20:59,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=380, skipped=6, lr=[0.0009481757796458796], mom=[(0.9, 0.95)]
	[2023-04-19 17:20:59,805] [INFO] [timer.py:199:stop] epoch=0/micro_step=760/global_step=380, RunningAvgSamplesPerSec=12.423823090381852, CurrSamplesPerSec=12.381460225246393, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:21:25,673] [INFO] [logging.py:96:log_dist] [Rank 0] step=390, skipped=6, lr=[0.0009454195969931738], mom=[(0.9, 0.95)]
	[2023-04-19 17:21:25,673] [INFO] [timer.py:199:stop] epoch=0/micro_step=780/global_step=390, RunningAvgSamplesPerSec=12.422769578096686, CurrSamplesPerSec=12.376197022211878, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:21:52,290] [INFO] [logging.py:96:log_dist] [Rank 0] step=400, skipped=6, lr=[0.0009425962308556705], mom=[(0.9, 0.95)]
	[2023-04-19 17:21:52,290] [INFO] [timer.py:199:stop] epoch=0/micro_step=800/global_step=400, RunningAvgSamplesPerSec=12.412703792380244, CurrSamplesPerSec=12.393729084304733, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:22:10,308] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:22:12,846] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:22:18,036] [INFO] [logging.py:96:log_dist] [Rank 0] step=410, skipped=8, lr=[0.0009402894516714383], mom=[(0.9, 0.95)]
	[2023-04-19 17:22:18,036] [INFO] [timer.py:199:stop] epoch=0/micro_step=820/global_step=410, RunningAvgSamplesPerSec=12.41341628187407, CurrSamplesPerSec=12.390996757255785, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:22:44,528] [INFO] [logging.py:96:log_dist] [Rank 0] step=420, skipped=8, lr=[0.0009373462351812672], mom=[(0.9, 0.95)]
	[2023-04-19 17:22:44,528] [INFO] [timer.py:199:stop] epoch=0/micro_step=840/global_step=420, RunningAvgSamplesPerSec=12.405510418676558, CurrSamplesPerSec=12.443687744002439, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:23:10,274] [INFO] [logging.py:96:log_dist] [Rank 0] step=430, skipped=8, lr=[0.0009343370529268123], mom=[(0.9, 0.95)]
	[2023-04-19 17:23:10,274] [INFO] [timer.py:199:stop] epoch=0/micro_step=860/global_step=430, RunningAvgSamplesPerSec=12.406355781250136, CurrSamplesPerSec=12.453858887280555, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:23:36,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=440, skipped=8, lr=[0.000931262358788755], mom=[(0.9, 0.95)]
	[2023-04-19 17:23:36,829] [INFO] [timer.py:199:stop] epoch=0/micro_step=880/global_step=440, RunningAvgSamplesPerSec=12.398296176552364, CurrSamplesPerSec=12.41375136191017, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:24:03,063] [INFO] [logging.py:96:log_dist] [Rank 0] step=450, skipped=8, lr=[0.000928122616529059], mom=[(0.9, 0.95)]
	[2023-04-19 17:24:03,064] [INFO] [timer.py:199:stop] epoch=0/micro_step=900/global_step=450, RunningAvgSamplesPerSec=12.394027658246998, CurrSamplesPerSec=10.898765305383344, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:24:29,250] [INFO] [logging.py:96:log_dist] [Rank 0] step=460, skipped=8, lr=[0.0009249182997210198], mom=[(0.9, 0.95)]
	[2023-04-19 17:24:29,250] [INFO] [timer.py:199:stop] epoch=0/micro_step=920/global_step=460, RunningAvgSamplesPerSec=12.390452563746633, CurrSamplesPerSec=12.425945131237496, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:24:55,622] [INFO] [logging.py:96:log_dist] [Rank 0] step=470, skipped=8, lr=[0.0009216498916778344], mom=[(0.9, 0.95)]
	[2023-04-19 17:24:55,623] [INFO] [timer.py:199:stop] epoch=0/micro_step=940/global_step=470, RunningAvgSamplesPerSec=12.38512612561435, CurrSamplesPerSec=12.463959862063918, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:25:21,314] [INFO] [logging.py:96:log_dist] [Rank 0] step=480, skipped=8, lr=[0.0009183178853797029], mom=[(0.9, 0.95)]
	[2023-04-19 17:25:21,314] [INFO] [timer.py:199:stop] epoch=0/micro_step=960/global_step=480, RunningAvgSamplesPerSec=12.386858313326691, CurrSamplesPerSec=12.438629725788537, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:25:47,838] [INFO] [logging.py:96:log_dist] [Rank 0] step=490, skipped=8, lr=[0.0009149227833994717], mom=[(0.9, 0.95)]
	[2023-04-19 17:25:47,838] [INFO] [timer.py:199:stop] epoch=0/micro_step=980/global_step=490, RunningAvgSamplesPerSec=12.38034147155676, CurrSamplesPerSec=12.433708277773485, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:26:13,689] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=8, lr=[0.000911465097826828], mom=[(0.9, 0.95)]
	[2023-04-19 17:26:13,690] [INFO] [timer.py:199:stop] epoch=0/micro_step=1000/global_step=500, RunningAvgSamplesPerSec=12.380554445613633, CurrSamplesPerSec=12.456449068622906, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:26:37,501] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:26:40,028] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:26:40,028] [INFO] [logging.py:96:log_dist] [Rank 0] step=510, skipped=10, lr=[0.0009086542393346895], mom=[(0.9, 0.95)]
	[2023-04-19 17:26:40,029] [INFO] [timer.py:199:stop] epoch=0/micro_step=1020/global_step=510, RunningAvgSamplesPerSec=12.376163880112658, CurrSamplesPerSec=12.67544033782103, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:27:05,740] [INFO] [logging.py:96:log_dist] [Rank 0] step=520, skipped=10, lr=[0.0009050852238427441], mom=[(0.9, 0.95)]
	[2023-04-19 17:27:05,741] [INFO] [timer.py:199:stop] epoch=0/micro_step=1040/global_step=520, RunningAvgSamplesPerSec=12.377737660987972, CurrSamplesPerSec=12.420433708306238, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:27:32,180] [INFO] [logging.py:96:log_dist] [Rank 0] step=530, skipped=10, lr=[0.0009014551085762004], mom=[(0.9, 0.95)]
	[2023-04-19 17:27:32,180] [INFO] [timer.py:199:stop] epoch=0/micro_step=1060/global_step=530, RunningAvgSamplesPerSec=12.372661889023702, CurrSamplesPerSec=12.377309799385882, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:27:58,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=540, skipped=10, lr=[0.0008977644410722474], mom=[(0.9, 0.95)]
	[2023-04-19 17:27:58,848] [INFO] [timer.py:199:stop] epoch=0/micro_step=1080/global_step=540, RunningAvgSamplesPerSec=12.3657484083089, CurrSamplesPerSec=10.83456878461236, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:28:24,615] [INFO] [logging.py:96:log_dist] [Rank 0] step=550, skipped=10, lr=[0.0008940137780012825], mom=[(0.9, 0.95)]
	[2023-04-19 17:28:24,615] [INFO] [timer.py:199:stop] epoch=0/micro_step=1100/global_step=550, RunningAvgSamplesPerSec=12.366938514326245, CurrSamplesPerSec=12.479740812725497, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:28:50,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=560, skipped=10, lr=[0.0008902036850829485], mom=[(0.9, 0.95)]
	[2023-04-19 17:28:50,954] [INFO] [timer.py:199:stop] epoch=0/micro_step=1120/global_step=560, RunningAvgSamplesPerSec=12.363198818138457, CurrSamplesPerSec=12.443068244459347, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:29:16,787] [INFO] [logging.py:96:log_dist] [Rank 0] step=570, skipped=10, lr=[0.0008863347370008057], mom=[(0.9, 0.95)]
	[2023-04-19 17:29:16,787] [INFO] [timer.py:199:stop] epoch=0/micro_step=1140/global_step=570, RunningAvgSamplesPerSec=12.3638415350952, CurrSamplesPerSec=12.362678227713483, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:29:43,335] [INFO] [logging.py:96:log_dist] [Rank 0] step=580, skipped=10, lr=[0.0008824075173156499], mom=[(0.9, 0.95)]
	[2023-04-19 17:29:43,336] [INFO] [timer.py:199:stop] epoch=0/micro_step=1160/global_step=580, RunningAvgSamplesPerSec=12.35855420493839, CurrSamplesPerSec=12.404735802011816, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:30:09,222] [INFO] [logging.py:96:log_dist] [Rank 0] step=590, skipped=10, lr=[0.0008784226183774943], mom=[(0.9, 0.95)]
	[2023-04-19 17:30:09,222] [INFO] [timer.py:199:stop] epoch=0/micro_step=1180/global_step=590, RunningAvgSamplesPerSec=12.358818701228529, CurrSamplesPerSec=12.365673775252885, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:30:35,854] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=10, lr=[0.000874380641236223], mom=[(0.9, 0.95)]
	[2023-04-19 17:30:35,854] [INFO] [timer.py:199:stop] epoch=0/micro_step=1200/global_step=600, RunningAvgSamplesPerSec=12.353132909214159, CurrSamplesPerSec=12.35099239814929, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:31:01,682] [INFO] [logging.py:96:log_dist] [Rank 0] step=610, skipped=10, lr=[0.0008702821955509344], mom=[(0.9, 0.95)]
	[2023-04-19 17:31:01,682] [INFO] [timer.py:199:stop] epoch=0/micro_step=1220/global_step=610, RunningAvgSamplesPerSec=12.353940832433471, CurrSamplesPerSec=12.39328276766209, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:31:04,560] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:31:07,486] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:31:28,208] [INFO] [logging.py:96:log_dist] [Rank 0] step=620, skipped=12, lr=[0.0008669631967817167], mom=[(0.9, 0.95)]
	[2023-04-19 17:31:28,208] [INFO] [timer.py:199:stop] epoch=0/micro_step=1240/global_step=620, RunningAvgSamplesPerSec=12.349334304280147, CurrSamplesPerSec=12.38552657235039, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:31:54,781] [INFO] [logging.py:96:log_dist] [Rank 0] step=630, skipped=12, lr=[0.0008627646711857188], mom=[(0.9, 0.95)]
	[2023-04-19 17:31:54,782] [INFO] [timer.py:199:stop] epoch=0/micro_step=1260/global_step=630, RunningAvgSamplesPerSec=12.344519090997533, CurrSamplesPerSec=10.799963854672287, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:32:20,670] [INFO] [logging.py:96:log_dist] [Rank 0] step=640, skipped=12, lr=[0.0008585114291045544], mom=[(0.9, 0.95)]
	[2023-04-19 17:32:20,671] [INFO] [timer.py:199:stop] epoch=0/micro_step=1280/global_step=640, RunningAvgSamplesPerSec=12.344965329862202, CurrSamplesPerSec=12.400174503397956, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:32:47,133] [INFO] [logging.py:96:log_dist] [Rank 0] step=650, skipped=12, lr=[0.0008542041120628143], mom=[(0.9, 0.95)]
	[2023-04-19 17:32:47,133] [INFO] [timer.py:199:stop] epoch=0/micro_step=1300/global_step=650, RunningAvgSamplesPerSec=12.341185303447821, CurrSamplesPerSec=12.410061177588837, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:33:13,000] [INFO] [logging.py:96:log_dist] [Rank 0] step=660, skipped=12, lr=[0.0008498433697413186], mom=[(0.9, 0.95)]
	[2023-04-19 17:33:13,001] [INFO] [timer.py:199:stop] epoch=0/micro_step=1320/global_step=660, RunningAvgSamplesPerSec=12.341826339333801, CurrSamplesPerSec=12.325113758079137, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:33:39,578] [INFO] [logging.py:96:log_dist] [Rank 0] step=670, skipped=12, lr=[0.0008454298598791235], mom=[(0.9, 0.95)]
	[2023-04-19 17:33:39,579] [INFO] [timer.py:199:stop] epoch=0/micro_step=1340/global_step=670, RunningAvgSamplesPerSec=12.337384792748615, CurrSamplesPerSec=12.432642918761928, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:34:05,425] [INFO] [logging.py:96:log_dist] [Rank 0] step=680, skipped=12, lr=[0.000840964248174314], mom=[(0.9, 0.95)]
	[2023-04-19 17:34:05,426] [INFO] [timer.py:199:stop] epoch=0/micro_step=1360/global_step=680, RunningAvgSamplesPerSec=12.338203936574846, CurrSamplesPerSec=12.39753896095559, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:34:32,004] [INFO] [logging.py:96:log_dist] [Rank 0] step=690, skipped=12, lr=[0.0008364472081835954], mom=[(0.9, 0.95)]
	[2023-04-19 17:34:32,004] [INFO] [timer.py:199:stop] epoch=0/micro_step=1380/global_step=690, RunningAvgSamplesPerSec=12.333943151295635, CurrSamplesPerSec=12.388178734300265, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:34:58,287] [INFO] [logging.py:96:log_dist] [Rank 0] step=700, skipped=12, lr=[0.0008318794212206986], mom=[(0.9, 0.95)]
	[2023-04-19 17:34:58,288] [INFO] [timer.py:199:stop] epoch=0/micro_step=1400/global_step=700, RunningAvgSamplesPerSec=12.331818651748383, CurrSamplesPerSec=10.815059116797121, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:35:24,531] [INFO] [logging.py:96:log_dist] [Rank 0] step=710, skipped=12, lr=[0.0008272615762536171], mom=[(0.9, 0.95)]
	[2023-04-19 17:35:24,532] [INFO] [timer.py:199:stop] epoch=0/micro_step=1420/global_step=710, RunningAvgSamplesPerSec=12.330020191929695, CurrSamplesPerSec=12.390247522970165, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:35:32,232] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:35:34,771] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:35:50,892] [INFO] [logging.py:96:log_dist] [Rank 0] step=720, skipped=14, lr=[0.0008235317263262469], mom=[(0.9, 0.95)]
	[2023-04-19 17:35:50,892] [INFO] [timer.py:199:stop] epoch=0/micro_step=1440/global_step=720, RunningAvgSamplesPerSec=12.327496611969542, CurrSamplesPerSec=12.364812549050821, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:36:16,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=730, skipped=14, lr=[0.0008188255371846346], mom=[(0.9, 0.95)]
	[2023-04-19 17:36:16,725] [INFO] [timer.py:199:stop] epoch=0/micro_step=1460/global_step=730, RunningAvgSamplesPerSec=12.328485193644248, CurrSamplesPerSec=12.39404266958273, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:36:43,182] [INFO] [logging.py:96:log_dist] [Rank 0] step=740, skipped=14, lr=[0.0008140712589809891], mom=[(0.9, 0.95)]
	[2023-04-19 17:36:43,182] [INFO] [timer.py:199:stop] epoch=0/micro_step=1480/global_step=740, RunningAvgSamplesPerSec=12.32543440337834, CurrSamplesPerSec=12.378232129443344, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:37:09,039] [INFO] [logging.py:96:log_dist] [Rank 0] step=750, skipped=14, lr=[0.0008092696088121323], mom=[(0.9, 0.95)]
	[2023-04-19 17:37:09,039] [INFO] [timer.py:199:stop] epoch=0/micro_step=1500/global_step=750, RunningAvgSamplesPerSec=12.326270629881153, CurrSamplesPerSec=12.410741659317063, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:37:35,654] [INFO] [logging.py:96:log_dist] [Rank 0] step=760, skipped=14, lr=[0.0008044213109200901], mom=[(0.9, 0.95)]
	[2023-04-19 17:37:35,655] [INFO] [timer.py:199:stop] epoch=0/micro_step=1520/global_step=760, RunningAvgSamplesPerSec=12.322338561705967, CurrSamplesPerSec=12.403199713747565, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:38:01,485] [INFO] [logging.py:96:log_dist] [Rank 0] step=770, skipped=14, lr=[0.0007995270965828522], mom=[(0.9, 0.95)]
	[2023-04-19 17:38:01,486] [INFO] [timer.py:199:stop] epoch=0/micro_step=1540/global_step=770, RunningAvgSamplesPerSec=12.32335401862153, CurrSamplesPerSec=12.418311166214071, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:38:27,898] [INFO] [logging.py:96:log_dist] [Rank 0] step=780, skipped=14, lr=[0.0007945877040040741], mom=[(0.9, 0.95)]
	[2023-04-19 17:38:27,899] [INFO] [timer.py:199:stop] epoch=0/micro_step=1560/global_step=780, RunningAvgSamplesPerSec=12.320793923911, CurrSamplesPerSec=12.407547599586076, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:38:54,088] [INFO] [logging.py:96:log_dist] [Rank 0] step=790, skipped=14, lr=[0.0007896038782017308], mom=[(0.9, 0.95)]
	[2023-04-19 17:38:54,089] [INFO] [timer.py:199:stop] epoch=0/micro_step=1580/global_step=790, RunningAvgSamplesPerSec=12.319644966599517, CurrSamplesPerSec=12.290209522225812, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:39:19,990] [INFO] [logging.py:96:log_dist] [Rank 0] step=800, skipped=14, lr=[0.0007845763708957448], mom=[(0.9, 0.95)]
	[2023-04-19 17:39:19,990] [INFO] [timer.py:199:stop] epoch=0/micro_step=1600/global_step=800, RunningAvgSamplesPerSec=12.320236636211249, CurrSamplesPerSec=12.744466835576597, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:39:45,824] [INFO] [logging.py:96:log_dist] [Rank 0] step=810, skipped=14, lr=[0.0007795059403946033], mom=[(0.9, 0.95)]
	[2023-04-19 17:39:45,825] [INFO] [timer.py:199:stop] epoch=0/micro_step=1620/global_step=810, RunningAvgSamplesPerSec=12.321208267862733, CurrSamplesPerSec=12.790884182215338, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:39:58,273] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:40:00,745] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:40:10,738] [INFO] [logging.py:96:log_dist] [Rank 0] step=820, skipped=16, lr=[0.0007754192050125431], mom=[(0.9, 0.95)]
	[2023-04-19 17:40:10,738] [INFO] [timer.py:199:stop] epoch=0/micro_step=1640/global_step=820, RunningAvgSamplesPerSec=12.327498662124329, CurrSamplesPerSec=12.849750726700075, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:40:36,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=830, skipped=16, lr=[0.000770273444289497], mom=[(0.9, 0.95)]
	[2023-04-19 17:40:36,307] [INFO] [timer.py:199:stop] epoch=0/micro_step=1660/global_step=830, RunningAvgSamplesPerSec=12.329883167590443, CurrSamplesPerSec=12.826796704158907, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:41:01,233] [INFO] [logging.py:96:log_dist] [Rank 0] step=840, skipped=16, lr=[0.0007650869177089128], mom=[(0.9, 0.95)]
	[2023-04-19 17:41:01,234] [INFO] [timer.py:199:stop] epoch=0/micro_step=1680/global_step=840, RunningAvgSamplesPerSec=12.335855108865013, CurrSamplesPerSec=12.83786436378276, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:41:26,908] [INFO] [logging.py:96:log_dist] [Rank 0] step=850, skipped=16, lr=[0.0007598604075644574], mom=[(0.9, 0.95)]
	[2023-04-19 17:41:26,909] [INFO] [timer.py:199:stop] epoch=0/micro_step=1700/global_step=850, RunningAvgSamplesPerSec=12.337496481370808, CurrSamplesPerSec=12.849296794964276, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:41:51,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=860, skipped=16, lr=[0.0007545947021805939], mom=[(0.9, 0.95)]
	[2023-04-19 17:41:51,839] [INFO] [timer.py:199:stop] epoch=0/micro_step=1720/global_step=860, RunningAvgSamplesPerSec=12.343223968422315, CurrSamplesPerSec=12.837561070790468, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:42:17,574] [INFO] [logging.py:96:log_dist] [Rank 0] step=870, skipped=16, lr=[0.0007492905957936784], mom=[(0.9, 0.95)]
	[2023-04-19 17:42:17,575] [INFO] [timer.py:199:stop] epoch=0/micro_step=1740/global_step=870, RunningAvgSamplesPerSec=12.344409548735772, CurrSamplesPerSec=12.804170059275858, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:42:43,355] [INFO] [logging.py:96:log_dist] [Rank 0] step=880, skipped=16, lr=[0.0007439488884321635], mom=[(0.9, 0.95)]
	[2023-04-19 17:42:43,355] [INFO] [timer.py:199:stop] epoch=0/micro_step=1760/global_step=880, RunningAvgSamplesPerSec=12.345321702239133, CurrSamplesPerSec=11.084045608043938, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:43:08,447] [INFO] [logging.py:96:log_dist] [Rank 0] step=890, skipped=16, lr=[0.0007385703857959276], mom=[(0.9, 0.95)]
	[2023-04-19 17:43:08,447] [INFO] [timer.py:199:stop] epoch=0/micro_step=1780/global_step=890, RunningAvgSamplesPerSec=12.349907530652988, CurrSamplesPerSec=12.766375184074995, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:43:34,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=900, skipped=16, lr=[0.0007331558991347511], mom=[(0.9, 0.95)]
	[2023-04-19 17:43:34,363] [INFO] [timer.py:199:stop] epoch=0/micro_step=1800/global_step=900, RunningAvgSamplesPerSec=12.350024217868127, CurrSamplesPerSec=12.711335014884598, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:43:59,715] [INFO] [logging.py:96:log_dist] [Rank 0] step=910, skipped=16, lr=[0.0007277062451259528], mom=[(0.9, 0.95)]
	[2023-04-19 17:43:59,716] [INFO] [timer.py:199:stop] epoch=0/micro_step=1820/global_step=910, RunningAvgSamplesPerSec=12.353091868169075, CurrSamplesPerSec=12.581144742324476, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:44:18,143] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:44:20,668] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:44:25,811] [INFO] [logging.py:96:log_dist] [Rank 0] step=920, skipped=18, lr=[0.0007233217536252489], mom=[(0.9, 0.95)]
	[2023-04-19 17:44:25,811] [INFO] [timer.py:199:stop] epoch=0/micro_step=1840/global_step=920, RunningAvgSamplesPerSec=12.35223511923478, CurrSamplesPerSec=12.48053108339018, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:44:51,480] [INFO] [logging.py:96:log_dist] [Rank 0] step=930, skipped=18, lr=[0.0007178108732699562], mom=[(0.9, 0.95)]
	[2023-04-19 17:44:51,480] [INFO] [timer.py:199:stop] epoch=0/micro_step=1860/global_step=930, RunningAvgSamplesPerSec=12.3535888911587, CurrSamplesPerSec=12.434895936355083, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:45:18,016] [INFO] [logging.py:96:log_dist] [Rank 0] step=940, skipped=18, lr=[0.000712267140086472], mom=[(0.9, 0.95)]
	[2023-04-19 17:45:18,016] [INFO] [timer.py:199:stop] epoch=0/micro_step=1880/global_step=940, RunningAvgSamplesPerSec=12.350506541077507, CurrSamplesPerSec=12.352032441613526, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:45:44,301] [INFO] [logging.py:96:log_dist] [Rank 0] step=950, skipped=18, lr=[0.0007066913902466141], mom=[(0.9, 0.95)]
	[2023-04-19 17:45:44,301] [INFO] [timer.py:199:stop] epoch=0/micro_step=1900/global_step=950, RunningAvgSamplesPerSec=12.34875418553038, CurrSamplesPerSec=10.873309381187074, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:46:10,498] [INFO] [logging.py:96:log_dist] [Rank 0] step=960, skipped=18, lr=[0.0007010844647513335], mom=[(0.9, 0.95)]
	[2023-04-19 17:46:10,499] [INFO] [timer.py:199:stop] epoch=0/micro_step=1920/global_step=960, RunningAvgSamplesPerSec=12.347471208831822, CurrSamplesPerSec=12.461529691197287, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:46:37,017] [INFO] [logging.py:96:log_dist] [Rank 0] step=970, skipped=18, lr=[0.000695447209303864], mom=[(0.9, 0.95)]
	[2023-04-19 17:46:37,018] [INFO] [timer.py:199:stop] epoch=0/micro_step=1940/global_step=970, RunningAvgSamplesPerSec=12.344635758493899, CurrSamplesPerSec=10.841980905932228, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:47:02,810] [INFO] [logging.py:96:log_dist] [Rank 0] step=980, skipped=18, lr=[0.0006897804741821649], mom=[(0.9, 0.95)]
	[2023-04-19 17:47:02,810] [INFO] [timer.py:199:stop] epoch=0/micro_step=1960/global_step=980, RunningAvgSamplesPerSec=12.345394722587116, CurrSamplesPerSec=12.393337697206483, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:47:29,294] [INFO] [logging.py:96:log_dist] [Rank 0] step=990, skipped=18, lr=[0.0006840851141106694], mom=[(0.9, 0.95)]
	[2023-04-19 17:47:29,294] [INFO] [timer.py:199:stop] epoch=0/micro_step=1980/global_step=990, RunningAvgSamplesPerSec=12.34280801200078, CurrSamplesPerSec=12.368272990522472, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:47:55,162] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=18, lr=[0.0006783619881313676], mom=[(0.9, 0.95)]
	[2023-04-19 17:47:55,162] [INFO] [timer.py:199:stop] epoch=0/micro_step=2000/global_step=1000, RunningAvgSamplesPerSec=12.343210472186934, CurrSamplesPerSec=12.339334311891653, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:48:21,596] [INFO] [logging.py:96:log_dist] [Rank 0] step=1010, skipped=18, lr=[0.0006726119594742333], mom=[(0.9, 0.95)]
	[2023-04-19 17:48:21,597] [INFO] [timer.py:199:stop] epoch=0/micro_step=2020/global_step=1010, RunningAvgSamplesPerSec=12.340929143415744, CurrSamplesPerSec=12.375618457152315, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:48:44,790] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:48:47,330] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:48:47,331] [INFO] [logging.py:96:log_dist] [Rank 0] step=1020, skipped=20, lr=[0.0006679931493048548], mom=[(0.9, 0.95)]
	[2023-04-19 17:48:47,331] [INFO] [timer.py:199:stop] epoch=0/micro_step=2040/global_step=1020, RunningAvgSamplesPerSec=12.341967169670106, CurrSamplesPerSec=12.609502682829376, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:49:13,879] [INFO] [logging.py:96:log_dist] [Rank 0] step=1030, skipped=20, lr=[0.000662196884036101], mom=[(0.9, 0.95)]
	[2023-04-19 17:49:13,880] [INFO] [timer.py:199:stop] epoch=0/micro_step=2060/global_step=1030, RunningAvgSamplesPerSec=12.339214394571659, CurrSamplesPerSec=12.447448749159538, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:49:40,041] [INFO] [logging.py:96:log_dist] [Rank 0] step=1040, skipped=20, lr=[0.0006563761543029039], mom=[(0.9, 0.95)]
	[2023-04-19 17:49:40,042] [INFO] [timer.py:199:stop] epoch=0/micro_step=2080/global_step=1040, RunningAvgSamplesPerSec=12.338287948072862, CurrSamplesPerSec=10.883461229547047, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:50:06,208] [INFO] [logging.py:96:log_dist] [Rank 0] step=1050, skipped=20, lr=[0.000650531838056998], mom=[(0.9, 0.95)]
	[2023-04-19 17:50:06,208] [INFO] [timer.py:199:stop] epoch=0/micro_step=2100/global_step=1050, RunningAvgSamplesPerSec=12.337358766530718, CurrSamplesPerSec=12.446037095975667, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:50:32,591] [INFO] [logging.py:96:log_dist] [Rank 0] step=1060, skipped=20, lr=[0.0006446648168077156], mom=[(0.9, 0.95)]
	[2023-04-19 17:50:32,592] [INFO] [timer.py:199:stop] epoch=0/micro_step=2120/global_step=1060, RunningAvgSamplesPerSec=12.335472446032648, CurrSamplesPerSec=12.43088691461683, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:50:58,360] [INFO] [logging.py:96:log_dist] [Rank 0] step=1070, skipped=20, lr=[0.000638775975489028], mom=[(0.9, 0.95)]
	[2023-04-19 17:50:58,361] [INFO] [timer.py:199:stop] epoch=0/micro_step=2140/global_step=1070, RunningAvgSamplesPerSec=12.336356910225373, CurrSamplesPerSec=12.431173598785648, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:51:24,880] [INFO] [logging.py:96:log_dist] [Rank 0] step=1080, skipped=20, lr=[0.0006328662023260695], mom=[(0.9, 0.95)]
	[2023-04-19 17:51:24,880] [INFO] [timer.py:199:stop] epoch=0/micro_step=2160/global_step=1080, RunningAvgSamplesPerSec=12.333914084704624, CurrSamplesPerSec=12.393311376739035, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:51:50,708] [INFO] [logging.py:96:log_dist] [Rank 0] step=1090, skipped=20, lr=[0.0006269363887011636], mom=[(0.9, 0.95)]
	[2023-04-19 17:51:50,709] [INFO] [timer.py:199:stop] epoch=0/micro_step=2180/global_step=1090, RunningAvgSamplesPerSec=12.334540626156128, CurrSamplesPerSec=12.313492024901896, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:52:17,372] [INFO] [logging.py:96:log_dist] [Rank 0] step=1100, skipped=20, lr=[0.0006209874290193754], mom=[(0.9, 0.95)]
	[2023-04-19 17:52:17,372] [INFO] [timer.py:199:stop] epoch=0/micro_step=2200/global_step=1100, RunningAvgSamplesPerSec=12.33154887382222, CurrSamplesPerSec=12.279735312421261, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:52:43,396] [INFO] [logging.py:96:log_dist] [Rank 0] step=1110, skipped=20, lr=[0.0006150202205736057], mom=[(0.9, 0.95)]
	[2023-04-19 17:52:43,397] [INFO] [timer.py:199:stop] epoch=0/micro_step=2220/global_step=1110, RunningAvgSamplesPerSec=12.331350008579633, CurrSamplesPerSec=12.308929838645925, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:53:10,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=1120, skipped=20, lr=[0.0006090356634092513], mom=[(0.9, 0.95)]
	[2023-04-19 17:53:10,047] [INFO] [timer.py:199:stop] epoch=0/micro_step=2240/global_step=1120, RunningAvgSamplesPerSec=12.328499180436237, CurrSamplesPerSec=12.302516879811852, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:53:12,580] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:53:15,130] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:53:36,361] [INFO] [logging.py:96:log_dist] [Rank 0] step=1130, skipped=22, lr=[0.0006042361331048955], mom=[(0.9, 0.95)]
	[2023-04-19 17:53:36,361] [INFO] [timer.py:199:stop] epoch=0/micro_step=2260/global_step=1130, RunningAvgSamplesPerSec=12.327112608764459, CurrSamplesPerSec=12.31348072818655, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:54:02,770] [INFO] [logging.py:96:log_dist] [Rank 0] step=1140, skipped=22, lr=[0.0005982226246272145], mom=[(0.9, 0.95)]
	[2023-04-19 17:54:02,771] [INFO] [timer.py:199:stop] epoch=0/micro_step=2280/global_step=1140, RunningAvgSamplesPerSec=12.3253524395009, CurrSamplesPerSec=12.32361769217392, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:54:29,536] [INFO] [logging.py:96:log_dist] [Rank 0] step=1150, skipped=22, lr=[0.0005921943010442869], mom=[(0.9, 0.95)]
	[2023-04-19 17:54:29,537] [INFO] [timer.py:199:stop] epoch=0/micro_step=2300/global_step=1150, RunningAvgSamplesPerSec=12.322150881860248, CurrSamplesPerSec=12.302494326659394, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:54:55,579] [INFO] [logging.py:96:log_dist] [Rank 0] step=1160, skipped=22, lr=[0.0005861520716196217], mom=[(0.9, 0.95)]
	[2023-04-19 17:54:55,579] [INFO] [timer.py:199:stop] epoch=0/micro_step=2320/global_step=1160, RunningAvgSamplesPerSec=12.321969112440607, CurrSamplesPerSec=12.287758879731456, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:55:22,343] [INFO] [logging.py:96:log_dist] [Rank 0] step=1170, skipped=22, lr=[0.0005800968477141724], mom=[(0.9, 0.95)]
	[2023-04-19 17:55:22,344] [INFO] [timer.py:199:stop] epoch=0/micro_step=2340/global_step=1170, RunningAvgSamplesPerSec=12.318860355457225, CurrSamplesPerSec=12.258176434330014, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:55:48,361] [INFO] [logging.py:96:log_dist] [Rank 0] step=1180, skipped=22, lr=[0.000574029542648875], mom=[(0.9, 0.95)]
	[2023-04-19 17:55:48,361] [INFO] [timer.py:199:stop] epoch=0/micro_step=2360/global_step=1180, RunningAvgSamplesPerSec=12.318809050479222, CurrSamplesPerSec=12.348782177117398, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:56:14,938] [INFO] [logging.py:96:log_dist] [Rank 0] step=1190, skipped=22, lr=[0.0005679510715668897], mom=[(0.9, 0.95)]
	[2023-04-19 17:56:14,939] [INFO] [timer.py:199:stop] epoch=0/micro_step=2380/global_step=1190, RunningAvgSamplesPerSec=12.316525641512154, CurrSamplesPerSec=12.347010026434933, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:56:41,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=1200, skipped=22, lr=[0.0005618623512955685], mom=[(0.9, 0.95)]
	[2023-04-19 17:56:41,308] [INFO] [timer.py:199:stop] epoch=0/micro_step=2400/global_step=1200, RunningAvgSamplesPerSec=12.315105038960837, CurrSamplesPerSec=10.827393883169135, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:57:07,704] [INFO] [logging.py:96:log_dist] [Rank 0] step=1210, skipped=22, lr=[0.0005557643002081674], mom=[(0.9, 0.95)]
	[2023-04-19 17:57:07,705] [INFO] [timer.py:199:stop] epoch=0/micro_step=2420/global_step=1210, RunningAvgSamplesPerSec=12.313597870894732, CurrSamplesPerSec=12.29607905829239, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:57:34,436] [INFO] [logging.py:96:log_dist] [Rank 0] step=1220, skipped=22, lr=[0.000549657838085328], mom=[(0.9, 0.95)]
	[2023-04-19 17:57:34,437] [INFO] [timer.py:199:stop] epoch=0/micro_step=2440/global_step=1220, RunningAvgSamplesPerSec=12.3108149927377, CurrSamplesPerSec=10.727947607851789, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:57:42,160] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 17:57:44,721] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 17:58:00,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=1230, skipped=24, lr=[0.000544767231347586], mom=[(0.9, 0.95)]
	[2023-04-19 17:58:00,351] [INFO] [timer.py:199:stop] epoch=0/micro_step=2460/global_step=1230, RunningAvgSamplesPerSec=12.311230504672189, CurrSamplesPerSec=12.343848682564769, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:58:27,105] [INFO] [logging.py:96:log_dist] [Rank 0] step=1240, skipped=24, lr=[0.0005386479511690275], mom=[(0.9, 0.95)]
	[2023-04-19 17:58:27,106] [INFO] [timer.py:199:stop] epoch=0/micro_step=2480/global_step=1240, RunningAvgSamplesPerSec=12.3084254813281, CurrSamplesPerSec=12.358059046593949, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:58:53,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=1250, skipped=24, lr=[0.0005325228416465036], mom=[(0.9, 0.95)]
	[2023-04-19 17:58:53,096] [INFO] [timer.py:199:stop] epoch=0/micro_step=2500/global_step=1250, RunningAvgSamplesPerSec=12.308565558430116, CurrSamplesPerSec=12.362821707537037, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:59:19,726] [INFO] [logging.py:96:log_dist] [Rank 0] step=1260, skipped=24, lr=[0.0005263928266419306], mom=[(0.9, 0.95)]
	[2023-04-19 17:59:19,726] [INFO] [timer.py:199:stop] epoch=0/micro_step=2520/global_step=1260, RunningAvgSamplesPerSec=12.306294872236686, CurrSamplesPerSec=12.285429537813988, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 17:59:45,724] [INFO] [logging.py:96:log_dist] [Rank 0] step=1270, skipped=24, lr=[0.0005202588307571282], mom=[(0.9, 0.95)]
	[2023-04-19 17:59:45,725] [INFO] [timer.py:199:stop] epoch=0/micro_step=2540/global_step=1270, RunningAvgSamplesPerSec=12.306418820802854, CurrSamplesPerSec=12.375375407494394, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 18:00:12,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=1280, skipped=24, lr=[0.0005141217791943596], mom=[(0.9, 0.95)]
	[2023-04-19 18:00:12,350] [INFO] [timer.py:199:stop] epoch=0/micro_step=2560/global_step=1280, RunningAvgSamplesPerSec=12.304221787202156, CurrSamplesPerSec=12.32757139885218, MemAllocated=1.95GB, MaxMemAllocated=13.61GB
	[2023-04-19 18:00:38,706] [INFO] [logging.py:96:log_dist] [Rank 0] step=1290, skipped=24, lr=[0.0005079825976167822], mom=[(0.9, 0.95)]
	[2023-04-19 18:00:38,707] [INFO] [timer.py:199:stop] epoch=0/micro_step=2580/global_step=1290, RunningAvgSamplesPerSec=12.303041637793852, CurrSamplesPerSec=10.799450282422233, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:01:05,115] [INFO] [logging.py:96:log_dist] [Rank 0] step=1300, skipped=24, lr=[0.000501842212008827], mom=[(0.9, 0.95)]
	[2023-04-19 18:01:05,116] [INFO] [timer.py:199:stop] epoch=0/micro_step=2600/global_step=1300, RunningAvgSamplesPerSec=12.301692628837143, CurrSamplesPerSec=12.321747554864698, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:01:31,885] [INFO] [logging.py:96:log_dist] [Rank 0] step=1310, skipped=24, lr=[0.0004957015485365313], mom=[(0.9, 0.95)]
	[2023-04-19 18:01:31,886] [INFO] [timer.py:199:stop] epoch=0/micro_step=2620/global_step=1310, RunningAvgSamplesPerSec=12.299057767730432, CurrSamplesPerSec=10.716507333570151, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:01:57,912] [INFO] [logging.py:96:log_dist] [Rank 0] step=1320, skipped=24, lr=[0.0004895615334078436], mom=[(0.9, 0.95)]
	[2023-04-19 18:01:57,913] [INFO] [timer.py:199:stop] epoch=0/micro_step=2640/global_step=1320, RunningAvgSamplesPerSec=12.299131376088003, CurrSamplesPerSec=12.290784630892936, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:02:11,118] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:02:13,676] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:02:24,488] [INFO] [logging.py:96:log_dist] [Rank 0] step=1330, skipped=26, lr=[0.0004846506104651698], mom=[(0.9, 0.95)]
	[2023-04-19 18:02:24,489] [INFO] [timer.py:199:stop] epoch=0/micro_step=2660/global_step=1330, RunningAvgSamplesPerSec=12.297245436028547, CurrSamplesPerSec=12.321097156185356, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:02:50,544] [INFO] [logging.py:96:log_dist] [Rank 0] step=1340, skipped=26, lr=[0.00047851409599768043], mom=[(0.9, 0.95)]
	[2023-04-19 18:02:50,545] [INFO] [timer.py:199:stop] epoch=0/micro_step=2680/global_step=1340, RunningAvgSamplesPerSec=12.297227368370496, CurrSamplesPerSec=12.2613119643179, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:03:17,310] [INFO] [logging.py:96:log_dist] [Rank 0] step=1350, skipped=26, lr=[0.0004723808222899481], mom=[(0.9, 0.95)]
	[2023-04-19 18:03:17,311] [INFO] [timer.py:199:stop] epoch=0/micro_step=2700/global_step=1350, RunningAvgSamplesPerSec=12.294722758257807, CurrSamplesPerSec=12.307697273572408, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:03:43,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=1360, skipped=26, lr=[0.0004662517144353085], mom=[(0.9, 0.95)]
	[2023-04-19 18:03:43,363] [INFO] [timer.py:199:stop] epoch=0/micro_step=2720/global_step=1360, RunningAvgSamplesPerSec=12.294736732905998, CurrSamplesPerSec=12.305249806208746, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:04:10,013] [INFO] [logging.py:96:log_dist] [Rank 0] step=1370, skipped=26, lr=[0.0004601276968987546], mom=[(0.9, 0.95)]
	[2023-04-19 18:04:10,014] [INFO] [timer.py:199:stop] epoch=0/micro_step=2740/global_step=1370, RunningAvgSamplesPerSec=12.292682779748246, CurrSamplesPerSec=12.342967791868034, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:04:36,396] [INFO] [logging.py:96:log_dist] [Rank 0] step=1380, skipped=26, lr=[0.0004540096933774962], mom=[(0.9, 0.95)]
	[2023-04-19 18:04:36,396] [INFO] [timer.py:199:stop] epoch=0/micro_step=2760/global_step=1380, RunningAvgSamplesPerSec=12.291579956405803, CurrSamplesPerSec=12.317193944852322, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:05:02,819] [INFO] [logging.py:96:log_dist] [Rank 0] step=1390, skipped=26, lr=[0.00044789862666163807], mom=[(0.9, 0.95)]
	[2023-04-19 18:05:02,819] [INFO] [timer.py:199:stop] epoch=0/micro_step=2780/global_step=1390, RunningAvgSamplesPerSec=12.290354577889605, CurrSamplesPerSec=12.290194891983273, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:05:29,595] [INFO] [logging.py:96:log_dist] [Rank 0] step=1400, skipped=26, lr=[0.0004417954184949932], mom=[(0.9, 0.95)]
	[2023-04-19 18:05:29,596] [INFO] [timer.py:199:stop] epoch=0/micro_step=2800/global_step=1400, RunningAvgSamplesPerSec=12.287955655202952, CurrSamplesPerSec=12.235999243692683, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:05:55,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=1410, skipped=26, lr=[0.0004357009894360553], mom=[(0.9, 0.95)]
	[2023-04-19 18:05:55,671] [INFO] [timer.py:199:stop] epoch=0/micro_step=2820/global_step=1410, RunningAvgSamplesPerSec=12.287937989215635, CurrSamplesPerSec=12.295943882201728, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:06:22,415] [INFO] [logging.py:96:log_dist] [Rank 0] step=1420, skipped=26, lr=[0.0004296162587191479], mom=[(0.9, 0.95)]
	[2023-04-19 18:06:22,416] [INFO] [timer.py:199:stop] epoch=0/micro_step=2840/global_step=1420, RunningAvgSamplesPerSec=12.285696192675177, CurrSamplesPerSec=12.326868307105308, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:06:40,557] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:06:43,112] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:06:48,337] [INFO] [logging.py:96:log_dist] [Rank 0] step=1430, skipped=28, lr=[0.0004247560737470216], mom=[(0.9, 0.95)]
	[2023-04-19 18:06:48,338] [INFO] [timer.py:199:stop] epoch=0/micro_step=2860/global_step=1430, RunningAvgSamplesPerSec=12.286200975609571, CurrSamplesPerSec=12.302398476683939, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:07:15,113] [INFO] [logging.py:96:log_dist] [Rank 0] step=1440, skipped=28, lr=[0.00041869111175856633], mom=[(0.9, 0.95)]
	[2023-04-19 18:07:15,114] [INFO] [timer.py:199:stop] epoch=0/micro_step=2880/global_step=1440, RunningAvgSamplesPerSec=12.283901533897822, CurrSamplesPerSec=12.290463868931733, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:07:41,149] [INFO] [logging.py:96:log_dist] [Rank 0] step=1450, skipped=28, lr=[0.00041263841374433654], mom=[(0.9, 0.95)]
	[2023-04-19 18:07:41,149] [INFO] [timer.py:199:stop] epoch=0/micro_step=2900/global_step=1450, RunningAvgSamplesPerSec=12.284042842085487, CurrSamplesPerSec=12.325609509044535, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:08:07,778] [INFO] [logging.py:96:log_dist] [Rank 0] step=1460, skipped=28, lr=[0.00040659889264428324], mom=[(0.9, 0.95)]
	[2023-04-19 18:08:07,779] [INFO] [timer.py:199:stop] epoch=0/micro_step=2920/global_step=1460, RunningAvgSamplesPerSec=12.282261851329789, CurrSamplesPerSec=12.314158567797463, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:08:34,176] [INFO] [logging.py:96:log_dist] [Rank 0] step=1470, skipped=28, lr=[0.0004005734594108583], mom=[(0.9, 0.95)]
	[2023-04-19 18:08:34,177] [INFO] [timer.py:199:stop] epoch=0/micro_step=2940/global_step=1470, RunningAvgSamplesPerSec=12.28124751646962, CurrSamplesPerSec=12.284246646807013, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:09:00,567] [INFO] [logging.py:96:log_dist] [Rank 0] step=1480, skipped=28, lr=[0.00039456302287161396], mom=[(0.9, 0.95)]
	[2023-04-19 18:09:00,567] [INFO] [timer.py:199:stop] epoch=0/micro_step=2960/global_step=1480, RunningAvgSamplesPerSec=12.280273099429602, CurrSamplesPerSec=12.304348471794587, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:09:27,304] [INFO] [logging.py:96:log_dist] [Rank 0] step=1490, skipped=28, lr=[0.0003885684895921226], mom=[(0.9, 0.95)]
	[2023-04-19 18:09:27,304] [INFO] [timer.py:199:stop] epoch=0/micro_step=2980/global_step=1490, RunningAvgSamplesPerSec=12.278213162378892, CurrSamplesPerSec=12.288593653679008, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:09:53,333] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=28, lr=[0.0003825907637392375], mom=[(0.9, 0.95)]
	[2023-04-19 18:09:53,334] [INFO] [timer.py:199:stop] epoch=0/micro_step=3000/global_step=1500, RunningAvgSamplesPerSec=12.27840720321145, CurrSamplesPerSec=12.275688721970516, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:10:20,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=1510, skipped=28, lr=[0.0003766307469447161], mom=[(0.9, 0.95)]
	[2023-04-19 18:10:20,080] [INFO] [timer.py:199:stop] epoch=0/micro_step=3020/global_step=1510, RunningAvgSamplesPerSec=12.276358165025238, CurrSamplesPerSec=12.341332347628354, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:10:46,100] [INFO] [logging.py:96:log_dist] [Rank 0] step=1520, skipped=28, lr=[0.00037068933816922456], mom=[(0.9, 0.95)]
	[2023-04-19 18:10:46,100] [INFO] [timer.py:199:stop] epoch=0/micro_step=3040/global_step=1520, RunningAvgSamplesPerSec=12.276591206833402, CurrSamplesPerSec=12.318812821547843, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:11:10,214] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:11:12,767] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:11:12,767] [INFO] [logging.py:96:log_dist] [Rank 0] step=1530, skipped=30, lr=[0.000365950211235768], mom=[(0.9, 0.95)]
	[2023-04-19 18:11:12,768] [INFO] [timer.py:199:stop] epoch=0/micro_step=3060/global_step=1530, RunningAvgSamplesPerSec=12.274823812996173, CurrSamplesPerSec=12.545723730485774, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:11:39,159] [INFO] [logging.py:96:log_dist] [Rank 0] step=1540, skipped=30, lr=[0.00036004455323017474], mom=[(0.9, 0.95)]
	[2023-04-19 18:11:39,159] [INFO] [timer.py:199:stop] epoch=0/micro_step=3080/global_step=1540, RunningAvgSamplesPerSec=12.273924370839671, CurrSamplesPerSec=10.841243528321511, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:12:05,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=1550, skipped=30, lr=[0.00035416000497074865], mom=[(0.9, 0.95)]
	[2023-04-19 18:12:05,393] [INFO] [timer.py:199:stop] epoch=0/micro_step=3100/global_step=1550, RunningAvgSamplesPerSec=12.273519460152688, CurrSamplesPerSec=12.274678332960356, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:12:31,989] [INFO] [logging.py:96:log_dist] [Rank 0] step=1560, skipped=30, lr=[0.0003482974540350933], mom=[(0.9, 0.95)]
	[2023-04-19 18:12:31,990] [INFO] [timer.py:199:stop] epoch=0/micro_step=3120/global_step=1560, RunningAvgSamplesPerSec=12.272017635829767, CurrSamplesPerSec=10.848107117940044, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:12:57,722] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=30, lr=[0.0003424577846829144], mom=[(0.9, 0.95)]
	[2023-04-19 18:12:57,722] [INFO] [timer.py:199:stop] epoch=0/micro_step=3140/global_step=1570, RunningAvgSamplesPerSec=12.273125881399084, CurrSamplesPerSec=12.488360282731884, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:13:24,190] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=30, lr=[0.00033664187772264466], mom=[(0.9, 0.95)]
	[2023-04-19 18:13:24,190] [INFO] [timer.py:199:stop] epoch=0/micro_step=3160/global_step=1580, RunningAvgSamplesPerSec=12.27202700089158, CurrSamplesPerSec=12.450442790397208, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:13:49,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=1590, skipped=30, lr=[0.00033085061037859], mom=[(0.9, 0.95)]
	[2023-04-19 18:13:49,949] [INFO] [timer.py:199:stop] epoch=0/micro_step=3180/global_step=1590, RunningAvgSamplesPerSec=12.27304390587419, CurrSamplesPerSec=12.406213786289506, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:14:16,439] [INFO] [logging.py:96:log_dist] [Rank 0] step=1600, skipped=30, lr=[0.00032508485615861607], mom=[(0.9, 0.95)]
	[2023-04-19 18:14:16,439] [INFO] [timer.py:199:stop] epoch=0/micro_step=3200/global_step=1600, RunningAvgSamplesPerSec=12.271896058667723, CurrSamplesPerSec=12.42768362926357, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:14:42,173] [INFO] [logging.py:96:log_dist] [Rank 0] step=1610, skipped=30, lr=[0.0003193454847223962], mom=[(0.9, 0.95)]
	[2023-04-19 18:14:42,174] [INFO] [timer.py:199:stop] epoch=0/micro_step=3220/global_step=1610, RunningAvgSamplesPerSec=12.272971178673968, CurrSamplesPerSec=12.442486870831228, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:15:08,528] [INFO] [logging.py:96:log_dist] [Rank 0] step=1620, skipped=30, lr=[0.00031363336175023725], mom=[(0.9, 0.95)]
	[2023-04-19 18:15:08,528] [INFO] [timer.py:199:stop] epoch=0/micro_step=3240/global_step=1620, RunningAvgSamplesPerSec=12.27222947378135, CurrSamplesPerSec=12.439632699307893, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:15:34,663] [INFO] [logging.py:96:log_dist] [Rank 0] step=1630, skipped=30, lr=[0.0003079493488125092], mom=[(0.9, 0.95)]
	[2023-04-19 18:15:34,664] [INFO] [timer.py:199:stop] epoch=0/micro_step=3260/global_step=1630, RunningAvgSamplesPerSec=12.272130563979738, CurrSamplesPerSec=12.395891314029205, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:15:37,182] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:15:39,970] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:16:00,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=1640, skipped=32, lr=[0.0003034229539589651], mom=[(0.9, 0.95)]
	[2023-04-19 18:16:00,635] [INFO] [timer.py:199:stop] epoch=0/micro_step=3280/global_step=1640, RunningAvgSamplesPerSec=12.272505412199054, CurrSamplesPerSec=12.391153478786137, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:16:27,180] [INFO] [logging.py:96:log_dist] [Rank 0] step=1650, skipped=32, lr=[0.00029779169662424564], mom=[(0.9, 0.95)]
	[2023-04-19 18:16:27,180] [INFO] [timer.py:199:stop] epoch=0/micro_step=3300/global_step=1650, RunningAvgSamplesPerSec=12.271236433244209, CurrSamplesPerSec=10.890016115419574, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:16:52,970] [INFO] [logging.py:96:log_dist] [Rank 0] step=1660, skipped=32, lr=[0.00029219093875243143], mom=[(0.9, 0.95)]
	[2023-04-19 18:16:52,971] [INFO] [timer.py:199:stop] epoch=0/micro_step=3320/global_step=1660, RunningAvgSamplesPerSec=12.27212462539587, CurrSamplesPerSec=12.348976463033916, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:17:19,514] [INFO] [logging.py:96:log_dist] [Rank 0] step=1670, skipped=32, lr=[0.0002866215251164824], mom=[(0.9, 0.95)]
	[2023-04-19 18:17:19,515] [INFO] [timer.py:199:stop] epoch=0/micro_step=3340/global_step=1670, RunningAvgSamplesPerSec=12.270876288262468, CurrSamplesPerSec=12.430595638622705, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:17:45,354] [INFO] [logging.py:96:log_dist] [Rank 0] step=1680, skipped=32, lr=[0.0002810842957616477], mom=[(0.9, 0.95)]
	[2023-04-19 18:17:45,355] [INFO] [timer.py:199:stop] epoch=0/micro_step=3360/global_step=1680, RunningAvgSamplesPerSec=12.271617283094267, CurrSamplesPerSec=12.406374333326339, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:18:11,907] [INFO] [logging.py:96:log_dist] [Rank 0] step=1690, skipped=32, lr=[0.00027558008587876047], mom=[(0.9, 0.95)]
	[2023-04-19 18:18:11,907] [INFO] [timer.py:199:stop] epoch=0/micro_step=3380/global_step=1690, RunningAvgSamplesPerSec=12.270363834350324, CurrSamplesPerSec=12.39498237596127, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:18:37,648] [INFO] [logging.py:96:log_dist] [Rank 0] step=1700, skipped=32, lr=[0.00027010972567826367], mom=[(0.9, 0.95)]
	[2023-04-19 18:18:37,649] [INFO] [timer.py:199:stop] epoch=0/micro_step=3400/global_step=1700, RunningAvgSamplesPerSec=12.271373139771677, CurrSamplesPerSec=12.584368640670158, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:19:03,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=1710, skipped=32, lr=[0.000264674040264988], mom=[(0.9, 0.95)]
	[2023-04-19 18:19:03,554] [INFO] [timer.py:199:stop] epoch=0/micro_step=3420/global_step=1710, RunningAvgSamplesPerSec=12.271917693197654, CurrSamplesPerSec=12.828423573004878, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:19:28,898] [INFO] [logging.py:96:log_dist] [Rank 0] step=1720, skipped=32, lr=[0.00025927384951370127], mom=[(0.9, 0.95)]
	[2023-04-19 18:19:28,899] [INFO] [timer.py:199:stop] epoch=0/micro_step=3440/global_step=1720, RunningAvgSamplesPerSec=12.273993814762228, CurrSamplesPerSec=12.802104838251223, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:19:54,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=1730, skipped=32, lr=[0.0002539099679454425], mom=[(0.9, 0.95)]
	[2023-04-19 18:19:54,047] [INFO] [timer.py:199:stop] epoch=0/micro_step=3460/global_step=1730, RunningAvgSamplesPerSec=12.276580675964897, CurrSamplesPerSec=12.900282740307876, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:20:01,466] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:20:03,937] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:20:19,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=1740, skipped=34, lr=[0.00024964554916762446], mom=[(0.9, 0.95)]
	[2023-04-19 18:20:19,637] [INFO] [timer.py:199:stop] epoch=0/micro_step=3480/global_step=1740, RunningAvgSamplesPerSec=12.277943442484174, CurrSamplesPerSec=12.880564950258988, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:20:44,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=1750, skipped=34, lr=[0.00024434905916265827], mom=[(0.9, 0.95)]
	[2023-04-19 18:20:44,555] [INFO] [timer.py:199:stop] epoch=0/micro_step=3500/global_step=1750, RunningAvgSamplesPerSec=12.281099911613431, CurrSamplesPerSec=12.852529143234488, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:21:10,261] [INFO] [logging.py:96:log_dist] [Rank 0] step=1760, skipped=34, lr=[0.00023909112947522872], mom=[(0.9, 0.95)]
	[2023-04-19 18:21:10,262] [INFO] [timer.py:199:stop] epoch=0/micro_step=3520/global_step=1760, RunningAvgSamplesPerSec=12.282106025091206, CurrSamplesPerSec=12.809001687377384, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:21:35,230] [INFO] [logging.py:96:log_dist] [Rank 0] step=1770, skipped=34, lr=[0.00023387255316886947], mom=[(0.9, 0.95)]
	[2023-04-19 18:21:35,231] [INFO] [timer.py:199:stop] epoch=0/micro_step=3540/global_step=1770, RunningAvgSamplesPerSec=12.285070635156188, CurrSamplesPerSec=12.802605510726918, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:22:00,969] [INFO] [logging.py:96:log_dist] [Rank 0] step=1780, skipped=34, lr=[0.00022869411737136774], mom=[(0.9, 0.95)]
	[2023-04-19 18:22:00,969] [INFO] [timer.py:199:stop] epoch=0/micro_step=3560/global_step=1780, RunningAvgSamplesPerSec=12.285961243756674, CurrSamplesPerSec=12.798800161308426, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:22:26,394] [INFO] [logging.py:96:log_dist] [Rank 0] step=1790, skipped=34, lr=[0.0002235566031560417], mom=[(0.9, 0.95)]
	[2023-04-19 18:22:26,395] [INFO] [timer.py:199:stop] epoch=0/micro_step=3580/global_step=1790, RunningAvgSamplesPerSec=12.287666145186579, CurrSamplesPerSec=11.188044943992558, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:22:51,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=1800, skipped=34, lr=[0.00021846078542393004], mom=[(0.9, 0.95)]
	[2023-04-19 18:22:51,936] [INFO] [timer.py:199:stop] epoch=0/micro_step=3600/global_step=1800, RunningAvgSamplesPerSec=12.289049568018969, CurrSamplesPerSec=12.733015954391846, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:23:17,617] [INFO] [logging.py:96:log_dist] [Rank 0] step=1810, skipped=34, lr=[0.00021340743278691076], mom=[(0.9, 0.95)]
	[2023-04-19 18:23:17,617] [INFO] [timer.py:199:stop] epoch=0/micro_step=3620/global_step=1810, RunningAvgSamplesPerSec=12.290053744673086, CurrSamplesPerSec=12.60355859829003, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:23:43,421] [INFO] [logging.py:96:log_dist] [Rank 0] step=1820, skipped=34, lr=[0.00020839730745177148], mom=[(0.9, 0.95)]
	[2023-04-19 18:23:43,421] [INFO] [timer.py:199:stop] epoch=0/micro_step=3640/global_step=1820, RunningAvgSamplesPerSec=12.290727365115005, CurrSamplesPerSec=12.491374039968793, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:24:09,755] [INFO] [logging.py:96:log_dist] [Rank 0] step=1830, skipped=34, lr=[0.00020343116510524367], mom=[(0.9, 0.95)]
	[2023-04-19 18:24:09,756] [INFO] [timer.py:199:stop] epoch=0/micro_step=3660/global_step=1830, RunningAvgSamplesPerSec=12.290024273445455, CurrSamplesPerSec=12.415638048515792, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:24:22,647] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:24:25,197] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:24:35,610] [INFO] [logging.py:96:log_dist] [Rank 0] step=1840, skipped=36, lr=[0.00019949042256902537], mom=[(0.9, 0.95)]
	[2023-04-19 18:24:35,610] [INFO] [timer.py:199:stop] epoch=0/micro_step=3680/global_step=1840, RunningAvgSamplesPerSec=12.290559956150119, CurrSamplesPerSec=12.350383229989983, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:25:02,172] [INFO] [logging.py:96:log_dist] [Rank 0] step=1850, skipped=36, lr=[0.00019460533268455865], mom=[(0.9, 0.95)]
	[2023-04-19 18:25:02,173] [INFO] [timer.py:199:stop] epoch=0/micro_step=3700/global_step=1850, RunningAvgSamplesPerSec=12.289282459116265, CurrSamplesPerSec=12.435905221269186, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:25:27,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=1860, skipped=36, lr=[0.00018976630605848356], mom=[(0.9, 0.95)]
	[2023-04-19 18:25:27,920] [INFO] [timer.py:199:stop] epoch=0/micro_step=3720/global_step=1860, RunningAvgSamplesPerSec=12.290090474590235, CurrSamplesPerSec=12.435068747489805, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:25:54,350] [INFO] [logging.py:96:log_dist] [Rank 0] step=1870, skipped=36, lr=[0.00018497407257038722], mom=[(0.9, 0.95)]
	[2023-04-19 18:25:54,351] [INFO] [timer.py:199:stop] epoch=0/micro_step=3740/global_step=1870, RunningAvgSamplesPerSec=12.28916153999188, CurrSamplesPerSec=12.468483660129504, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:26:20,422] [INFO] [logging.py:96:log_dist] [Rank 0] step=1880, skipped=36, lr=[0.00018022935504195952], mom=[(0.9, 0.95)]
	[2023-04-19 18:26:20,422] [INFO] [timer.py:199:stop] epoch=0/micro_step=3760/global_step=1880, RunningAvgSamplesPerSec=12.289146106288744, CurrSamplesPerSec=10.895669559644977, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:26:46,605] [INFO] [logging.py:96:log_dist] [Rank 0] step=1890, skipped=36, lr=[0.00017553286912796773], mom=[(0.9, 0.95)]
	[2023-04-19 18:26:46,605] [INFO] [timer.py:199:stop] epoch=0/micro_step=3780/global_step=1890, RunningAvgSamplesPerSec=12.288853181453334, CurrSamplesPerSec=12.376522274843671, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:27:12,953] [INFO] [logging.py:96:log_dist] [Rank 0] step=1900, skipped=36, lr=[0.00017088532320831245], mom=[(0.9, 0.95)]
	[2023-04-19 18:27:12,954] [INFO] [timer.py:199:stop] epoch=0/micro_step=3800/global_step=1900, RunningAvgSamplesPerSec=12.288151161040204, CurrSamplesPerSec=11.314906634422599, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:27:38,340] [INFO] [logging.py:96:log_dist] [Rank 0] step=1910, skipped=36, lr=[0.00016628741828118255], mom=[(0.9, 0.95)]
	[2023-04-19 18:27:38,340] [INFO] [timer.py:199:stop] epoch=0/micro_step=3820/global_step=1910, RunningAvgSamplesPerSec=12.289834917827946, CurrSamplesPerSec=12.758622957963134, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:28:04,093] [INFO] [logging.py:96:log_dist] [Rank 0] step=1920, skipped=36, lr=[0.0001617398478573211], mom=[(0.9, 0.95)]
	[2023-04-19 18:28:04,094] [INFO] [timer.py:199:stop] epoch=0/micro_step=3840/global_step=1920, RunningAvgSamplesPerSec=12.290598323679253, CurrSamplesPerSec=12.841313334019006, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:28:29,058] [INFO] [logging.py:96:log_dist] [Rank 0] step=1930, skipped=36, lr=[0.0001572432978554223], mom=[(0.9, 0.95)]
	[2023-04-19 18:28:29,058] [INFO] [timer.py:199:stop] epoch=0/micro_step=3860/global_step=1930, RunningAvgSamplesPerSec=12.293287579333615, CurrSamplesPerSec=12.792861649178567, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:28:47,200] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:28:49,671] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:28:54,671] [INFO] [logging.py:96:log_dist] [Rank 0] step=1940, skipped=38, lr=[0.0001536832485848859], mom=[(0.9, 0.95)]
	[2023-04-19 18:28:54,671] [INFO] [timer.py:199:stop] epoch=0/micro_step=3880/global_step=1940, RunningAvgSamplesPerSec=12.294367966683168, CurrSamplesPerSec=12.86009034984407, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:29:19,558] [INFO] [logging.py:96:log_dist] [Rank 0] step=1950, skipped=38, lr=[0.00014928023922823442], mom=[(0.9, 0.95)]
	[2023-04-19 18:29:19,558] [INFO] [timer.py:199:stop] epoch=0/micro_step=3900/global_step=1950, RunningAvgSamplesPerSec=12.29720127164094, CurrSamplesPerSec=12.879749164030441, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:29:45,078] [INFO] [logging.py:96:log_dist] [Rank 0] step=1960, skipped=38, lr=[0.00014493012960000785], mom=[(0.9, 0.95)]
	[2023-04-19 18:29:45,079] [INFO] [timer.py:199:stop] epoch=0/micro_step=3920/global_step=1960, RunningAvgSamplesPerSec=12.298476526249685, CurrSamplesPerSec=12.867855376739639, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:30:10,380] [INFO] [logging.py:96:log_dist] [Rank 0] step=1970, skipped=38, lr=[0.0001406335758355134], mom=[(0.9, 0.95)]
	[2023-04-19 18:30:10,380] [INFO] [timer.py:199:stop] epoch=0/micro_step=3940/global_step=1970, RunningAvgSamplesPerSec=12.300264143525318, CurrSamplesPerSec=12.820006808415691, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:30:35,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=1980, skipped=38, lr=[0.00013639122599212533], mom=[(0.9, 0.95)]
	[2023-04-19 18:30:35,733] [INFO] [timer.py:199:stop] epoch=0/micro_step=3960/global_step=1980, RunningAvgSamplesPerSec=12.301912515264485, CurrSamplesPerSec=12.802970659982275, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:31:01,381] [INFO] [logging.py:96:log_dist] [Rank 0] step=1990, skipped=38, lr=[0.00013220371995153736], mom=[(0.9, 0.95)]
	[2023-04-19 18:31:01,382] [INFO] [timer.py:199:stop] epoch=0/micro_step=3980/global_step=1990, RunningAvgSamplesPerSec=12.302840564663645, CurrSamplesPerSec=11.593263562391549, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:31:26,506] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=38, lr=[0.00012807168932324857], mom=[(0.9, 0.95)]
	[2023-04-19 18:31:26,506] [INFO] [timer.py:199:stop] epoch=0/micro_step=4000/global_step=2000, RunningAvgSamplesPerSec=12.304999928758503, CurrSamplesPerSec=12.715702879906292, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:31:52,481] [INFO] [logging.py:96:log_dist] [Rank 0] step=2010, skipped=38, lr=[0.0001239957573492957], mom=[(0.9, 0.95)]
	[2023-04-19 18:31:52,481] [INFO] [timer.py:199:stop] epoch=0/micro_step=4020/global_step=2010, RunningAvgSamplesPerSec=12.305134776463118, CurrSamplesPerSec=12.599764184596898, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:32:17,970] [INFO] [logging.py:96:log_dist] [Rank 0] step=2020, skipped=38, lr=[0.00011997653881024884], mom=[(0.9, 0.95)]
	[2023-04-19 18:32:17,971] [INFO] [timer.py:199:stop] epoch=0/micro_step=4040/global_step=2020, RunningAvgSamplesPerSec=12.306405867559443, CurrSamplesPerSec=12.496697285427834, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:32:44,382] [INFO] [logging.py:96:log_dist] [Rank 0] step=2030, skipped=38, lr=[0.0001160146399324833], mom=[(0.9, 0.95)]
	[2023-04-19 18:32:44,383] [INFO] [timer.py:199:stop] epoch=0/micro_step=4060/global_step=2030, RunningAvgSamplesPerSec=12.305514093554024, CurrSamplesPerSec=12.39382407379021, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:33:07,621] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:33:10,518] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:33:10,518] [INFO] [logging.py:96:log_dist] [Rank 0] step=2040, skipped=40, lr=[0.00011288679285345288], mom=[(0.9, 0.95)]
	[2023-04-19 18:33:10,518] [INFO] [timer.py:199:stop] epoch=0/micro_step=4080/global_step=2040, RunningAvgSamplesPerSec=12.305270434756762, CurrSamplesPerSec=11.058220762964236, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:33:36,786] [INFO] [logging.py:96:log_dist] [Rank 0] step=2050, skipped=40, lr=[0.0001090295694020207], mom=[(0.9, 0.95)]
	[2023-04-19 18:33:36,786] [INFO] [timer.py:199:stop] epoch=0/micro_step=4100/global_step=2050, RunningAvgSamplesPerSec=12.304723993392827, CurrSamplesPerSec=12.338405292495112, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:34:02,975] [INFO] [logging.py:96:log_dist] [Rank 0] step=2060, skipped=40, lr=[0.00010523131676408154], mom=[(0.9, 0.95)]
	[2023-04-19 18:34:02,975] [INFO] [timer.py:199:stop] epoch=0/micro_step=4120/global_step=2060, RunningAvgSamplesPerSec=12.304364466002099, CurrSamplesPerSec=12.451194702243312, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:34:29,152] [INFO] [logging.py:96:log_dist] [Rank 0] step=2070, skipped=40, lr=[0.00010149260783730319], mom=[(0.9, 0.95)]
	[2023-04-19 18:34:29,153] [INFO] [timer.py:199:stop] epoch=0/micro_step=4140/global_step=2070, RunningAvgSamplesPerSec=12.304034186898377, CurrSamplesPerSec=12.411747026891463, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:34:55,549] [INFO] [logging.py:96:log_dist] [Rank 0] step=2080, skipped=40, lr=[9.781400653826244e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:34:55,550] [INFO] [timer.py:199:stop] epoch=0/micro_step=4160/global_step=2080, RunningAvgSamplesPerSec=12.303206574085898, CurrSamplesPerSec=12.411290230109326, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:35:21,383] [INFO] [logging.py:96:log_dist] [Rank 0] step=2090, skipped=40, lr=[9.419606771738853e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:35:21,383] [INFO] [timer.py:199:stop] epoch=0/micro_step=4180/global_step=2090, RunningAvgSamplesPerSec=12.30366604428486, CurrSamplesPerSec=12.406835355771703, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:35:47,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=2100, skipped=40, lr=[9.063933707527306e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:35:47,968] [INFO] [timer.py:199:stop] epoch=0/micro_step=4200/global_step=2100, RunningAvgSamplesPerSec=12.302425094480283, CurrSamplesPerSec=12.383200007011915, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:36:13,793] [INFO] [logging.py:96:log_dist] [Rank 0] step=2110, skipped=40, lr=[8.714435108036234e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:36:13,794] [INFO] [timer.py:199:stop] epoch=0/micro_step=4220/global_step=2110, RunningAvgSamplesPerSec=12.302901308765746, CurrSamplesPerSec=12.434363708237937, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:36:40,308] [INFO] [logging.py:96:log_dist] [Rank 0] step=2120, skipped=40, lr=[8.371163688803967e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:36:40,308] [INFO] [timer.py:199:stop] epoch=0/micro_step=4240/global_step=2120, RunningAvgSamplesPerSec=12.301832484359094, CurrSamplesPerSec=12.420096949094322, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:37:06,496] [INFO] [logging.py:96:log_dist] [Rank 0] step=2130, skipped=40, lr=[8.034171226111403e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:37:06,497] [INFO] [timer.py:199:stop] epoch=0/micro_step=4260/global_step=2130, RunningAvgSamplesPerSec=12.301499367685778, CurrSamplesPerSec=10.856264810636182, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:37:32,691] [INFO] [logging.py:96:log_dist] [Rank 0] step=2140, skipped=40, lr=[7.703508549172528e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:37:32,691] [INFO] [timer.py:199:stop] epoch=0/micro_step=4280/global_step=2140, RunningAvgSamplesPerSec=12.30115622408466, CurrSamplesPerSec=12.401573479735115, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:37:35,215] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:37:37,759] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:37:58,792] [INFO] [logging.py:96:log_dist] [Rank 0] step=2150, skipped=42, lr=[7.443569401286737e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:37:58,792] [INFO] [timer.py:199:stop] epoch=0/micro_step=4300/global_step=2150, RunningAvgSamplesPerSec=12.30102199734571, CurrSamplesPerSec=12.400159610180129, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:38:24,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=2160, skipped=42, lr=[7.124425376007727e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:38:24,948] [INFO] [timer.py:199:stop] epoch=0/micro_step=4320/global_step=2160, RunningAvgSamplesPerSec=12.300768461536611, CurrSamplesPerSec=12.453067368968957, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:38:51,245] [INFO] [logging.py:96:log_dist] [Rank 0] step=2170, skipped=42, lr=[6.811748355178887e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:38:51,245] [INFO] [timer.py:199:stop] epoch=0/micro_step=4340/global_step=2170, RunningAvgSamplesPerSec=12.300209402649584, CurrSamplesPerSec=12.43781132701424, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:39:17,070] [INFO] [logging.py:96:log_dist] [Rank 0] step=2180, skipped=42, lr=[6.505585500469818e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:39:17,071] [INFO] [timer.py:199:stop] epoch=0/micro_step=4360/global_step=2180, RunningAvgSamplesPerSec=12.300679290112301, CurrSamplesPerSec=12.365838971239125, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:39:43,643] [INFO] [logging.py:96:log_dist] [Rank 0] step=2190, skipped=42, lr=[6.205982991006093e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:39:43,643] [INFO] [timer.py:199:stop] epoch=0/micro_step=4380/global_step=2190, RunningAvgSamplesPerSec=12.299530650271372, CurrSamplesPerSec=12.405328558682902, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:40:09,493] [INFO] [logging.py:96:log_dist] [Rank 0] step=2200, skipped=42, lr=[5.912986016403909e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:40:09,494] [INFO] [timer.py:199:stop] epoch=0/micro_step=4400/global_step=2200, RunningAvgSamplesPerSec=12.299945811894498, CurrSamplesPerSec=12.441138616114486, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:40:36,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=2210, skipped=42, lr=[5.6266387699540786e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:40:36,097] [INFO] [timer.py:199:stop] epoch=0/micro_step=4420/global_step=2210, RunningAvgSamplesPerSec=12.298745310472013, CurrSamplesPerSec=12.415833295329127, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:41:02,293] [INFO] [logging.py:96:log_dist] [Rank 0] step=2220, skipped=42, lr=[5.346984441956315e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:41:02,293] [INFO] [timer.py:199:stop] epoch=0/micro_step=4440/global_step=2220, RunningAvgSamplesPerSec=12.298422423831754, CurrSamplesPerSec=12.331312379976227, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:41:28,489] [INFO] [logging.py:96:log_dist] [Rank 0] step=2230, skipped=42, lr=[5.074065213204676e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:41:28,490] [INFO] [timer.py:199:stop] epoch=0/micro_step=4460/global_step=2230, RunningAvgSamplesPerSec=12.298102818031259, CurrSamplesPerSec=12.428839061774376, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:41:54,606] [INFO] [logging.py:96:log_dist] [Rank 0] step=2240, skipped=42, lr=[4.8079222486253736e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:41:54,607] [INFO] [timer.py:199:stop] epoch=0/micro_step=4480/global_step=2240, RunningAvgSamplesPerSec=12.297953704777502, CurrSamplesPerSec=12.434558392459168, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:42:02,610] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:42:05,144] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:42:20,545] [INFO] [logging.py:96:log_dist] [Rank 0] step=2250, skipped=44, lr=[4.599913797658045e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:42:20,546] [INFO] [timer.py:199:stop] epoch=0/micro_step=4500/global_step=2250, RunningAvgSamplesPerSec=12.298179937217506, CurrSamplesPerSec=12.529763874087092, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:42:46,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=2260, skipped=44, lr=[4.346068577864587e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:42:46,846] [INFO] [timer.py:199:stop] epoch=0/micro_step=4520/global_step=2260, RunningAvgSamplesPerSec=12.29764823685467, CurrSamplesPerSec=12.45570692530282, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:43:12,672] [INFO] [logging.py:96:log_dist] [Rank 0] step=2270, skipped=44, lr=[4.099109427360304e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:43:12,672] [INFO] [timer.py:199:stop] epoch=0/micro_step=4540/global_step=2270, RunningAvgSamplesPerSec=12.298109115305559, CurrSamplesPerSec=12.437736408563667, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:43:38,689] [INFO] [logging.py:96:log_dist] [Rank 0] step=2280, skipped=44, lr=[3.859073595463469e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:43:38,690] [INFO] [timer.py:199:stop] epoch=0/micro_step=4560/global_step=2280, RunningAvgSamplesPerSec=12.298170178178243, CurrSamplesPerSec=12.75117692923243, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:44:03,731] [INFO] [logging.py:96:log_dist] [Rank 0] step=2290, skipped=44, lr=[3.6259972872350666e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:44:03,732] [INFO] [timer.py:199:stop] epoch=0/micro_step=4580/global_step=2290, RunningAvgSamplesPerSec=12.300246104187035, CurrSamplesPerSec=12.824455816125798, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:44:29,392] [INFO] [logging.py:96:log_dist] [Rank 0] step=2300, skipped=44, lr=[3.399915658017838e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:44:29,392] [INFO] [timer.py:199:stop] epoch=0/micro_step=4600/global_step=2300, RunningAvgSamplesPerSec=12.301030032165718, CurrSamplesPerSec=12.871611794362979, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:44:54,682] [INFO] [logging.py:96:log_dist] [Rank 0] step=2310, skipped=44, lr=[3.1808628081338496e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:44:54,682] [INFO] [timer.py:199:stop] epoch=0/micro_step=4620/global_step=2310, RunningAvgSamplesPerSec=12.302568709380266, CurrSamplesPerSec=12.835298490770182, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:45:19,976] [INFO] [logging.py:96:log_dist] [Rank 0] step=2320, skipped=44, lr=[2.9688717777409667e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:45:19,976] [INFO] [timer.py:199:stop] epoch=0/micro_step=4640/global_step=2320, RunningAvgSamplesPerSec=12.304085416510999, CurrSamplesPerSec=12.840191724087525, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:45:45,627] [INFO] [logging.py:96:log_dist] [Rank 0] step=2330, skipped=44, lr=[2.7639745418494234e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:45:45,627] [INFO] [timer.py:199:stop] epoch=0/micro_step=4660/global_step=2330, RunningAvgSamplesPerSec=12.304862619859534, CurrSamplesPerSec=11.164547013836554, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:46:10,564] [INFO] [logging.py:96:log_dist] [Rank 0] step=2340, skipped=44, lr=[2.5662020054989298e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:46:10,564] [INFO] [timer.py:199:stop] epoch=0/micro_step=4680/global_step=2340, RunningAvgSamplesPerSec=12.307078782682758, CurrSamplesPerSec=12.811729488667604, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:46:23,358] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:46:25,838] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:46:36,126] [INFO] [logging.py:96:log_dist] [Rank 0] step=2350, skipped=46, lr=[2.413133842345444e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:46:36,126] [INFO] [timer.py:199:stop] epoch=0/micro_step=4700/global_step=2350, RunningAvgSamplesPerSec=12.308017841166846, CurrSamplesPerSec=12.765987834726452, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:47:01,158] [INFO] [logging.py:96:log_dist] [Rank 0] step=2360, skipped=46, lr=[2.2282602127670638e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:47:01,159] [INFO] [timer.py:199:stop] epoch=0/micro_step=4720/global_step=2360, RunningAvgSamplesPerSec=12.310010809005004, CurrSamplesPerSec=12.779279202937284, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:47:27,031] [INFO] [logging.py:96:log_dist] [Rank 0] step=2370, skipped=46, lr=[2.0505920855406757e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:47:27,032] [INFO] [timer.py:199:stop] epoch=0/micro_step=4740/global_step=2370, RunningAvgSamplesPerSec=12.310307228959495, CurrSamplesPerSec=12.691875719261018, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:47:52,702] [INFO] [logging.py:96:log_dist] [Rank 0] step=2380, skipped=46, lr=[1.8801562586877375e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:47:52,702] [INFO] [timer.py:199:stop] epoch=0/micro_step=4760/global_step=2380, RunningAvgSamplesPerSec=12.311004314974385, CurrSamplesPerSec=11.0915473904342, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:48:18,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=2390, skipped=46, lr=[1.7169784393681164e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:48:18,610] [INFO] [timer.py:199:stop] epoch=0/micro_step=4780/global_step=2390, RunningAvgSamplesPerSec=12.311227600222733, CurrSamplesPerSec=12.524098052787979, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:48:44,677] [INFO] [logging.py:96:log_dist] [Rank 0] step=2400, skipped=46, lr=[1.561083240002592e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:48:44,678] [INFO] [timer.py:199:stop] epoch=0/micro_step=4800/global_step=2400, RunningAvgSamplesPerSec=12.311130099979922, CurrSamplesPerSec=12.407947914210096, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:49:10,952] [INFO] [logging.py:96:log_dist] [Rank 0] step=2410, skipped=46, lr=[1.4124941745605024e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:49:10,952] [INFO] [timer.py:199:stop] epoch=0/micro_step=4820/global_step=2410, RunningAvgSamplesPerSec=12.310627790060659, CurrSamplesPerSec=12.370306632484153, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:49:37,487] [INFO] [logging.py:96:log_dist] [Rank 0] step=2420, skipped=46, lr=[1.2712336550131598e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:49:37,487] [INFO] [timer.py:199:stop] epoch=0/micro_step=4840/global_step=2420, RunningAvgSamplesPerSec=12.309618788598172, CurrSamplesPerSec=10.852950064021522, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:50:03,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=2430, skipped=46, lr=[1.1373229879533375e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:50:03,211] [INFO] [timer.py:199:stop] epoch=0/micro_step=4860/global_step=2430, RunningAvgSamplesPerSec=12.310200725412857, CurrSamplesPerSec=12.48376864028326, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:50:29,508] [INFO] [logging.py:96:log_dist] [Rank 0] step=2440, skipped=46, lr=[1.010782371381569e-05], mom=[(0.9, 0.95)]
	[2023-04-19 18:50:29,508] [INFO] [timer.py:199:stop] epoch=0/micro_step=4880/global_step=2440, RunningAvgSamplesPerSec=12.309663735759242, CurrSamplesPerSec=12.454816932342066, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:50:47,517] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:50:50,067] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:50:55,269] [INFO] [logging.py:96:log_dist] [Rank 0] step=2450, skipped=48, lr=[9.14869177936145e-06], mom=[(0.9, 0.95)]
	[2023-04-19 18:50:55,269] [INFO] [timer.py:199:stop] epoch=0/micro_step=4900/global_step=2450, RunningAvgSamplesPerSec=12.31016926218093, CurrSamplesPerSec=12.382171843136206, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:51:21,646] [INFO] [logging.py:96:log_dist] [Rank 0] step=2460, skipped=48, lr=[8.016420010113156e-06], mom=[(0.9, 0.95)]
	[2023-04-19 18:51:21,647] [INFO] [timer.py:199:stop] epoch=0/micro_step=4920/global_step=2460, RunningAvgSamplesPerSec=12.309482308619254, CurrSamplesPerSec=12.48385456463192, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:51:47,716] [INFO] [logging.py:96:log_dist] [Rank 0] step=2470, skipped=48, lr=[6.958355059761279e-06], mom=[(0.9, 0.95)]
	[2023-04-19 18:51:47,717] [INFO] [timer.py:199:stop] epoch=0/micro_step=4940/global_step=2470, RunningAvgSamplesPerSec=12.309390904770943, CurrSamplesPerSec=12.388855674772541, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:52:13,820] [INFO] [logging.py:96:log_dist] [Rank 0] step=2480, skipped=48, lr=[5.974656518254129e-06], mom=[(0.9, 0.95)]
	[2023-04-19 18:52:13,820] [INFO] [timer.py:199:stop] epoch=0/micro_step=4960/global_step=2480, RunningAvgSamplesPerSec=12.309235890996996, CurrSamplesPerSec=12.403854224243707, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:52:40,021] [INFO] [logging.py:96:log_dist] [Rank 0] step=2490, skipped=48, lr=[5.06547275871333e-06], mom=[(0.9, 0.95)]
	[2023-04-19 18:52:40,021] [INFO] [timer.py:199:stop] epoch=0/micro_step=4980/global_step=2490, RunningAvgSamplesPerSec=12.308896336184107, CurrSamplesPerSec=12.31674295094959, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:53:06,296] [INFO] [logging.py:96:log_dist] [Rank 0] step=2500, skipped=48, lr=[4.23094091505416e-06], mom=[(0.9, 0.95)]
	[2023-04-19 18:53:06,297] [INFO] [timer.py:199:stop] epoch=0/micro_step=5000/global_step=2500, RunningAvgSamplesPerSec=12.30841818692135, CurrSamplesPerSec=12.305871454185846, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:53:32,845] [INFO] [logging.py:96:log_dist] [Rank 0] step=2510, skipped=48, lr=[3.471186861301545e-06], mom=[(0.9, 0.95)]
	[2023-04-19 18:53:32,846] [INFO] [timer.py:199:stop] epoch=0/micro_step=5020/global_step=2510, RunningAvgSamplesPerSec=12.307428673506953, CurrSamplesPerSec=12.437062183120354, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:53:58,554] [INFO] [logging.py:96:log_dist] [Rank 0] step=2520, skipped=48, lr=[2.7863251926040224e-06], mom=[(0.9, 0.95)]
	[2023-04-19 18:53:58,554] [INFO] [timer.py:199:stop] epoch=0/micro_step=5040/global_step=2520, RunningAvgSamplesPerSec=12.308026366186626, CurrSamplesPerSec=12.448089466013181, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:54:24,848] [INFO] [logging.py:96:log_dist] [Rank 0] step=2530, skipped=48, lr=[2.1764592079493996e-06], mom=[(0.9, 0.95)]
	[2023-04-19 18:54:24,848] [INFO] [timer.py:199:stop] epoch=0/micro_step=5060/global_step=2530, RunningAvgSamplesPerSec=12.307524238014608, CurrSamplesPerSec=12.462064247476162, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:54:50,593] [INFO] [logging.py:96:log_dist] [Rank 0] step=2540, skipped=48, lr=[1.6416808945838302e-06], mom=[(0.9, 0.95)]
	[2023-04-19 18:54:50,593] [INFO] [timer.py:199:stop] epoch=0/micro_step=5080/global_step=2540, RunningAvgSamplesPerSec=12.308049471202695, CurrSamplesPerSec=12.400358953137731, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	[2023-04-19 18:55:14,484] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, but hysteresis is 2. Reducing hysteresis to 1
	[2023-04-19 18:55:17,038] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072, reducing to 65536
	[2023-04-19 18:55:17,039] [INFO] [logging.py:96:log_dist] [Rank 0] step=2550, skipped=50, lr=[1.2679759526949552e-06], mom=[(0.9, 0.95)]
	[2023-04-19 18:55:17,039] [INFO] [timer.py:199:stop] epoch=0/micro_step=5100/global_step=2550, RunningAvgSamplesPerSec=12.307268386762189, CurrSamplesPerSec=12.540417228668503, MemAllocated=1.95GB, MaxMemAllocated=13.62GB
	*** Evaluating perplexity, Epoch 1/1 ***
	Invalidate trace cache @ step 0: expected module 0, but got module 16
	ppl: 1.6646381616592407
	saving the final model ...
	[2023-04-19 19:03:02,394] [INFO] [launch.py:460:main] Process 10813 exits successfully.
	[2023-04-19 19:03:02,395] [INFO] [launch.py:460:main] Process 10814 exits successfully.
	[2023-04-19 19:03:02,395] [INFO] [launch.py:460:main] Process 10815 exits successfully.
	[2023-04-19 19:03:14,408] [INFO] [launch.py:460:main] Process 10812 exits successfully.