/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/transformers/utils/import_utils.py:575: FutureWarning: `is_torch_tpu_available` is deprecated and will be removed in 4.41.0. Please use the `is_torch_xla_available` instead.
  warnings.warn(
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/transformers/utils/import_utils.py:575: FutureWarning: `is_torch_tpu_available` is deprecated and will be removed in 4.41.0. Please use the `is_torch_xla_available` instead.
  warnings.warn(
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/transformers/utils/import_utils.py:575: FutureWarning: `is_torch_tpu_available` is deprecated and will be removed in 4.41.0. Please use the `is_torch_xla_available` instead.
  warnings.warn(
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/transformers/utils/import_utils.py:575: FutureWarning: `is_torch_tpu_available` is deprecated and will be removed in 4.41.0. Please use the `is_torch_xla_available` instead.
  warnings.warn(
01/04/2025 20:31:15 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, fp16 training: False, bf16 training: True
01/04/2025 20:31:15 - INFO - __main__ - Additional arguments ScriptArguments(model_family='llama', model_name_or_path='/scratch3/workspace/ctpham_umass_edu-ft/models/Llama-3-8B-ProLong-512k-Base/', config_overrides=None, config_overrides_json='', config_name='/scratch3/workspace/ctpham_umass_edu-ft/models/Llama-3-8B-ProLong-512k-Base/', tokenizer_name='/scratch3/workspace/ctpham_umass_edu-ft/models/Llama-3-8B-ProLong-512k-Base/', cache_dir=None, use_fast_tokenizer=False, model_revision='main', use_auth_token=False, tokenized_mds_train=['/work/pi_miyyer_umass_edu/ctpham/BookClaim-dev/data/ft/bookclaim_wp_pack_complete'], tokenized_mds_validation=[], tokenized_mds_test=[], token_scaled_loss=True)
[INFO|tokenization_utils_base.py:2267] 2025-01-04 20:31:15,031 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2267] 2025-01-04 20:31:15,031 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2267] 2025-01-04 20:31:15,032 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2267] 2025-01-04 20:31:15,032 >> loading file tokenizer_config.json
01/04/2025 20:31:15 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, fp16 training: False, bf16 training: True
01/04/2025 20:31:15 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, fp16 training: False, bf16 training: True
01/04/2025 20:31:15 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, fp16 training: False, bf16 training: True
[INFO|tokenization_utils_base.py:2513] 2025-01-04 20:31:15,389 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:731] 2025-01-04 20:31:15,390 >> loading configuration file /scratch3/workspace/ctpham_umass_edu-ft/models/Llama-3-8B-ProLong-512k-Base/config.json
[INFO|configuration_utils.py:800] 2025-01-04 20:31:15,391 >> Model config LlamaConfig {
  "_name_or_path": "/scratch3/workspace/ctpham_umass_edu-ft/models/Llama-3-8B-ProLong-512k-Base/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 524288,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 128000000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.44.2",
  "use_cache": true,
  "vocab_size": 128256
}

01/04/2025 20:31:15 - INFO - __main__ - Loaded tokenizer - CPU Memory usage: 629.00 MB RSS
01/04/2025 20:31:15 - INFO - __main__ - GPU 0: 4/81920 MB
01/04/2025 20:31:15 - INFO - __main__ - GPU 1: 423/81920 MB
01/04/2025 20:31:15 - INFO - __main__ - GPU 2: 423/81920 MB
01/04/2025 20:31:15 - INFO - __main__ - GPU 3: 423/81920 MB
[INFO|modeling_utils.py:3675] 2025-01-04 20:31:15,437 >> loading weights file /scratch3/workspace/ctpham_umass_edu-ft/models/Llama-3-8B-ProLong-512k-Base/model.safetensors.index.json
[INFO|configuration_utils.py:1038] 2025-01-04 20:31:15,439 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "pad_token_id": 0
}

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]Loading checkpoint shards:  14%|█▍        | 1/7 [00:00<00:01,  5.51it/s]Loading checkpoint shards:  14%|█▍        | 1/7 [00:00<00:01,  5.55it/s]Loading checkpoint shards:  14%|█▍        | 1/7 [00:00<00:01,  5.57it/s]Loading checkpoint shards:  14%|█▍        | 1/7 [00:00<00:00,  6.23it/s]Loading checkpoint shards:  29%|██▊       | 2/7 [00:00<00:00,  5.94it/s]Loading checkpoint shards:  29%|██▊       | 2/7 [00:00<00:00,  5.88it/s]Loading checkpoint shards:  29%|██▊       | 2/7 [00:00<00:00,  6.19it/s]Loading checkpoint shards:  29%|██▊       | 2/7 [00:00<00:00,  5.83it/s]Loading checkpoint shards:  43%|████▎     | 3/7 [00:00<00:00,  5.82it/s]Loading checkpoint shards:  43%|████▎     | 3/7 [00:00<00:00,  5.81it/s]Loading checkpoint shards:  43%|████▎     | 3/7 [00:00<00:00,  5.79it/s]Loading checkpoint shards:  43%|████▎     | 3/7 [00:00<00:00,  5.97it/s]Loading checkpoint shards:  57%|█████▋    | 4/7 [00:00<00:00,  5.72it/s]Loading checkpoint shards:  57%|█████▋    | 4/7 [00:00<00:00,  5.71it/s]Loading checkpoint shards:  57%|█████▋    | 4/7 [00:00<00:00,  5.70it/s]Loading checkpoint shards:  57%|█████▋    | 4/7 [00:00<00:00,  5.80it/s]Loading checkpoint shards:  71%|███████▏  | 5/7 [00:00<00:00,  5.85it/s]Loading checkpoint shards:  71%|███████▏  | 5/7 [00:00<00:00,  5.84it/s]Loading checkpoint shards:  71%|███████▏  | 5/7 [00:00<00:00,  5.83it/s]Loading checkpoint shards:  71%|███████▏  | 5/7 [00:00<00:00,  5.90it/s]Loading checkpoint shards:  86%|████████▌ | 6/7 [00:01<00:00,  5.85it/s]Loading checkpoint shards:  86%|████████▌ | 6/7 [00:01<00:00,  5.84it/s]Loading checkpoint shards:  86%|████████▌ | 6/7 [00:01<00:00,  5.83it/s]Loading checkpoint shards:  86%|████████▌ | 6/7 [00:01<00:00,  5.88it/s]Loading checkpoint shards: 100%|██████████| 7/7 [00:01<00:00,  6.60it/s]Loading checkpoint shards: 100%|██████████| 7/7 [00:01<00:00,  6.13it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:01<00:00,  6.60it/s]Loading checkpoint shards: 100%|██████████| 7/7 [00:01<00:00,  6.12it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:01<00:00,  6.59it/s]Loading checkpoint shards: 100%|██████████| 7/7 [00:01<00:00,  6.10it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:01<00:00,  6.63it/s]Loading checkpoint shards: 100%|██████████| 7/7 [00:01<00:00,  6.23it/s]
[INFO|modeling_utils.py:4507] 2025-01-04 20:31:16,742 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4515] 2025-01-04 20:31:16,742 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /scratch3/workspace/ctpham_umass_edu-ft/models/Llama-3-8B-ProLong-512k-Base/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|modeling_utils.py:4003] 2025-01-04 20:31:16,745 >> Generation config file not found, using a generation config created from the model config.
01/04/2025 20:31:16 - INFO - __main__ - Model: LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
          (distributed_attn_func): DistributedAttention()
          (distributed_varlen_attn_func): DistributedAttention()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)
01/04/2025 20:31:16 - INFO - __main__ - Loaded model - CPU Memory usage: 649.12 MB RSS
01/04/2025 20:31:16 - INFO - __main__ - GPU 0: 4/81920 MB
01/04/2025 20:31:16 - INFO - __main__ - GPU 1: 423/81920 MB
01/04/2025 20:31:16 - INFO - __main__ - GPU 2: 423/81920 MB
01/04/2025 20:31:16 - INFO - __main__ - GPU 3: 423/81920 MB
01/04/2025 20:31:18 - WARNING - streaming.base.dataset - Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64. Prior to Streaming v0.7.0, `predownload` defaulted to max(batch_size, 256 * batch_size // num_canonical_nodes).
01/04/2025 20:31:18 - INFO - training.dataset - Loading datasets for training
01/04/2025 20:31:18 - WARNING - streaming.base.dataset - Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64. Prior to Streaming v0.7.0, `predownload` defaulted to max(batch_size, 256 * batch_size // num_canonical_nodes).
01/04/2025 20:31:18 - WARNING - streaming.base.dataset - Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64. Prior to Streaming v0.7.0, `predownload` defaulted to max(batch_size, 256 * batch_size // num_canonical_nodes).
01/04/2025 20:31:18 - WARNING - streaming.base.dataset - Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64. Prior to Streaming v0.7.0, `predownload` defaulted to max(batch_size, 256 * batch_size // num_canonical_nodes).
01/04/2025 20:31:18 - INFO - __main__ - Loaded training dataset
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert at verifying claims from fictional narratives.<|eot_id|><|start_header_id|>user<|end_header_id|>

You are provided with a context and a statement. Your task is to carefully read the context and then determine whether the statement is TRUE or FALSE. 

Answer TRUE if the statement is true in its entirety based on the context provided. 
Answer FALSE if any part of the statement is false based on the context provided.

<context>
“ Come along now Emma.” Said the worried mother to her daughter. Trying to pull her away from the bard sitting on the fence surrounded by children. 
“ But mommy I want to listen to the song.” 
“ Those are not the songs you should be listening to Emma. Now let's go home before your father gets mad.” The mother pulled the young girl away and continued on her path as if nothing happened. 
“ She is right little lady, these songs are not meant for the faint hearted and only those who are trul
gical current, causing extensive trauma and bleeding. Leaving the patient in a catatonic state. 
“ What of the parents” asked the woman, her lips a compressed into a thin white line, as she laid her hand on the child's forehead. 
“ The Obliviators are with them now. They’ re making the necessary changes for them to forget Magical Law Enforcement and the time period the patient was missing. The parents will wake up tomorrow to find in its bed the cadaver. To them it will look like a brain aneury-“ 
“ Cadaver?” interrupted the woman with a start, her head turning to fix the healer in a steely glare.“ As in corpse? I don’ t follow. The child is clearly still with us” returning her hand to the child's forehead. 
The smile finally came off the face of the third figure, reverting to its usual scowl.“ Leave us now, I will handle this” he said to the Healer, without even a glance at him. 
With a look of weariness mixed with relief the healer exited quickly, closing the door behind him. 
“ I’ m
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert at verifying claims from fictional narratives.<|eot_id|><|start_header_id|>user<|end_header_id|>

You are provided with a context and a statement. Your task is to carefully read the context and then determine whether the statement is TRUE or FALSE. 

Answer TRUE if the statement is true in its entirety based on the context provided. 
Answer FALSE if any part of the statement is false based on the context provided.

<context>
“ Come along now Emma.” Said the worried mother to her daughter. Trying to pull her away from the bard sitting on the fence surrounded by children. 
“ But mommy I want to listen to the song.” 
“ Those are not the songs you should be listening to Emma. Now let's go home before your father gets mad.” The mother pulled the young girl away and continued on her path as if nothing happened. 
“ She is right little lady, these songs are not meant for the faint hearted and only those who are trul
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert at verifying claims from fictional narratives.<|eot_id|><|start_header_id|>user<|end_header_id|>

You are provided with a context and a statement. Your task is to carefully read the context and then determine whether the statement is TRUE or FALSE. 

Answer TRUE if the statement is true in its entirety based on the context provided. 
Answer FALSE if any part of the statement is false based on the context provided.

<context>
“ Come along now Emma.” Said the worried mother to her daughter. Trying to pull her away from the bard sitting on the fence surrounded by children. 
“ But mommy I want to listen to the song.” 
“ Those are not the songs you should be listening to Emma. Now let's go home before your father gets mad.” The mother pulled the young girl away and continued on her path as if nothing happened. 
“ She is right little lady, these songs are not meant for the faint hearted and only those who are trul
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert at verifying claims from fictional narratives.<|eot_id|><|start_header_id|>user<|end_header_id|>

You are provided with a context and a statement. Your task is to carefully read the context and then determine whether the statement is TRUE or FALSE. 

Answer TRUE if the statement is true in its entirety based on the context provided. 
Answer FALSE if any part of the statement is false based on the context provided.

<context>
“ Come along now Emma.” Said the worried mother to her daughter. Trying to pull her away from the bard sitting on the fence surrounded by children. 
“ But mommy I want to listen to the song.” 
“ Those are not the songs you should be listening to Emma. Now let's go home before your father gets mad.” The mother pulled the young girl away and continued on her path as if nothing happened. 
“ She is right little lady, these songs are not meant for the faint hearted and only those who are trul
gical current, causing extensive trauma and bleeding. Leaving the patient in a catatonic state. 
“ What of the parents” asked the woman, her lips a compressed into a thin white line, as she laid her hand on the child's forehead. 
“ The Obliviators are with them now. They’ re making the necessary changes for them to forget Magical Law Enforcement and the time period the patient was missing. The parents will wake up tomorrow to find in its bed the cadaver. To them it will look like a brain aneury-“ 
“ Cadaver?” interrupted the woman with a start, her head turning to fix the healer in a steely glare.“ As in corpse? I don’ t follow. The child is clearly still with us” returning her hand to the child's forehead. 
The smile finally came off the face of the third figure, reverting to its usual scowl.“ Leave us now, I will handle this” he said to the Healer, without even a glance at him. 
With a look of weariness mixed with relief the healer exited quickly, closing the door behind him. 
“ I’ m
gical current, causing extensive trauma and bleeding. Leaving the patient in a catatonic state. 
“ What of the parents” asked the woman, her lips a compressed into a thin white line, as she laid her hand on the child's forehead. 
“ The Obliviators are with them now. They’ re making the necessary changes for them to forget Magical Law Enforcement and the time period the patient was missing. The parents will wake up tomorrow to find in its bed the cadaver. To them it will look like a brain aneury-“ 
“ Cadaver?” interrupted the woman with a start, her head turning to fix the healer in a steely glare.“ As in corpse? I don’ t follow. The child is clearly still with us” returning her hand to the child's forehead. 
The smile finally came off the face of the third figure, reverting to its usual scowl.“ Leave us now, I will handle this” he said to the Healer, without even a glance at him. 
With a look of weariness mixed with relief the healer exited quickly, closing the door behind him. 
“ I’ m
gical current, causing extensive trauma and bleeding. Leaving the patient in a catatonic state. 
“ What of the parents” asked the woman, her lips a compressed into a thin white line, as she laid her hand on the child's forehead. 
“ The Obliviators are with them now. They’ re making the necessary changes for them to forget Magical Law Enforcement and the time period the patient was missing. The parents will wake up tomorrow to find in its bed the cadaver. To them it will look like a brain aneury-“ 
“ Cadaver?” interrupted the woman with a start, her head turning to fix the healer in a steely glare.“ As in corpse? I don’ t follow. The child is clearly still with us” returning her hand to the child's forehead. 
The smile finally came off the face of the third figure, reverting to its usual scowl.“ Leave us now, I will handle this” he said to the Healer, without even a glance at him. 
With a look of weariness mixed with relief the healer exited quickly, closing the door behind him. 
“ I’ m
[INFO|trainer.py:648] 2025-01-04 20:31:19,354 >> Using auto half precision backend
[INFO|trainer.py:262] 2025-01-04 20:31:19,354 >> Using world as sequence parallel group
01/04/2025 20:31:19 - INFO - __main__ - Trainer created
01/04/2025 20:31:19 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.95,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
cuda_empty_cache=True,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=1,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=True,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=no,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[<FSDPOption.AUTO_WRAP: 'auto_wrap'>, <FSDPOption.OFFLOAD: 'offload'>],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=1e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/runs/Jan04_20-31-13_gpu016,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
min_lr_ratio=0.1,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
optim_target_modules=None,
output_dir=/scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['wandb'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=100,
save_strategy=steps,
save_total_limit=None,
seed=42,
seq_parallel_size=4,
skip_memory_metrics=True,
split_batches=None,
streaming_dataset=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.05,
warmup_steps=0,
weight_decay=0.1,
)
01/04/2025 20:31:19 - INFO - __main__ - *** Train ***
[WARNING|trainer.py:796] 2025-01-04 20:31:19,481 >> Use streaming dataloader for train
[WARNING|trainer.py:796] 2025-01-04 20:31:19,483 >> Use streaming dataloader for train
[WARNING|trainer.py:796] 2025-01-04 20:31:19,486 >> Use streaming dataloader for train
[WARNING|trainer.py:796] 2025-01-04 20:31:19,492 >> Use streaming dataloader for train
[INFO|trainer.py:2134] 2025-01-04 20:33:12,656 >> ***** Running training *****
[INFO|trainer.py:2135] 2025-01-04 20:33:12,656 >>   Num examples = 3,624
[INFO|trainer.py:2136] 2025-01-04 20:33:12,656 >>   Num Epochs = 1
[INFO|trainer.py:2137] 2025-01-04 20:33:12,656 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2140] 2025-01-04 20:33:12,656 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2141] 2025-01-04 20:33:12,656 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:2142] 2025-01-04 20:33:12,656 >>   Total optimization steps = 906
[INFO|trainer.py:2143] 2025-01-04 20:33:12,657 >>   Number of trainable parameters = 2,007,565,312
[INFO|integration_utils.py:807] 2025-01-04 20:33:12,658 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
01/04/2025 20:33:12 - WARNING - streaming.base.dataset - Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
01/04/2025 20:33:12 - WARNING - streaming.base.dataset - Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
01/04/2025 20:33:12 - WARNING - streaming.base.dataset - Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
01/04/2025 20:33:12 - WARNING - streaming.base.dataset - Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
01/04/2025 20:33:13 - WARNING - streaming.base.dataset - Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
01/04/2025 20:33:13 - WARNING - streaming.base.dataset - Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
wandb: Currently logged in as: chtmp223. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.19.1 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.17.3
wandb: Run data is saved locally in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/wandb/run-20250104_203314-95wq5z4x
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_
wandb: ⭐️ View project at https://wandb.ai/chtmp223/prolong
wandb: 🚀 View run at https://wandb.ai/chtmp223/prolong/runs/95wq5z4x
01/04/2025 20:33:22 - WARNING - streaming.base.dataset - Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
01/04/2025 20:33:22 - WARNING - streaming.base.dataset - Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
01/04/2025 20:33:22 - WARNING - streaming.base.dataset - The `replication` arg has been set to 4 and training is resuming from sample 0. Make sure you are accounting for sample replication when using StreamingDataset's `state_dict` method for deterministic resumption. Otherwise, you will resume training from the wrong sample.
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-04 20:33:57,685 >> {'loss': 2.1083, 'grad_norm': 35.722469329833984, 'learning_rate': 2.173913043478261e-07, 'epoch': 0.0011037527593818985, 'num_input_tokens_seen': 65536, 'completed': '0.11% (1 / 906)', 'remaining time': '8:47:59', 'throughput': '351.04', 'gpu_mem_free': '30111MB'}
[INFO|trainer.py:175] 2025-01-04 20:34:12,077 >> {'loss': 1.3569, 'grad_norm': 25.2639102935791, 'learning_rate': 4.347826086956522e-07, 'epoch': 0.002207505518763797, 'num_input_tokens_seen': 131072, 'completed': '0.22% (2 / 906)', 'remaining time': '6:12:07', 'throughput': '1138.41', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:34:26,282 >> {'loss': 2.3622, 'grad_norm': 31.971759796142578, 'learning_rate': 6.521739130434783e-07, 'epoch': 0.0033112582781456954, 'num_input_tokens_seen': 196608, 'completed': '0.33% (3 / 906)', 'remaining time': '5:19:04', 'throughput': '1153.39', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:34:40,606 >> {'loss': 1.1356, 'grad_norm': 17.20412826538086, 'learning_rate': 8.695652173913044e-07, 'epoch': 0.004415011037527594, 'num_input_tokens_seen': 262144, 'completed': '0.44% (4 / 906)', 'remaining time': '4:52:52', 'throughput': '1143.82', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:34:54,545 >> {'loss': 1.2981, 'grad_norm': 23.274988174438477, 'learning_rate': 1.0869565217391306e-06, 'epoch': 0.005518763796909493, 'num_input_tokens_seen': 327680, 'completed': '0.55% (5 / 906)', 'remaining time': '4:35:53', 'throughput': '1175.42', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:35:08,564 >> {'loss': 0.5643, 'grad_norm': 10.878853797912598, 'learning_rate': 1.3043478260869566e-06, 'epoch': 0.006622516556291391, 'num_input_tokens_seen': 393216, 'completed': '0.66% (6 / 906)', 'remaining time': '4:24:42', 'throughput': '1168.66', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:35:22,859 >> {'loss': 0.6975, 'grad_norm': 13.176485061645508, 'learning_rate': 1.521739130434783e-06, 'epoch': 0.00772626931567329, 'num_input_tokens_seen': 458752, 'completed': '0.77% (7 / 906)', 'remaining time': '4:17:14', 'throughput': '1146.17', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:35:37,012 >> {'loss': 0.9322, 'grad_norm': 16.82210350036621, 'learning_rate': 1.7391304347826088e-06, 'epoch': 0.008830022075055188, 'num_input_tokens_seen': 524288, 'completed': '0.88% (8 / 906)', 'remaining time': '4:11:18', 'throughput': '1157.61', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:35:51,057 >> {'loss': 0.7018, 'grad_norm': 12.083537101745605, 'learning_rate': 1.956521739130435e-06, 'epoch': 0.009933774834437087, 'num_input_tokens_seen': 589824, 'completed': '0.99% (9 / 906)', 'remaining time': '4:06:28', 'throughput': '1166.49', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:36:05,471 >> {'loss': 0.5493, 'grad_norm': 11.75355052947998, 'learning_rate': 2.173913043478261e-06, 'epoch': 0.011037527593818985, 'num_input_tokens_seen': 655360, 'completed': '1.10% (10 / 906)', 'remaining time': '4:03:06', 'throughput': '1136.68', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:36:19,741 >> {'loss': 0.7736, 'grad_norm': 18.269977569580078, 'learning_rate': 2.391304347826087e-06, 'epoch': 0.012141280353200883, 'num_input_tokens_seen': 720896, 'completed': '1.21% (11 / 906)', 'remaining time': '4:00:06', 'throughput': '1148.14', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:36:33,895 >> {'loss': 0.7684, 'grad_norm': 12.028225898742676, 'learning_rate': 2.6086956521739132e-06, 'epoch': 0.013245033112582781, 'num_input_tokens_seen': 786432, 'completed': '1.32% (12 / 906)', 'remaining time': '3:57:25', 'throughput': '1157.56', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:36:48,470 >> {'loss': 0.8609, 'grad_norm': 12.417435646057129, 'learning_rate': 2.8260869565217393e-06, 'epoch': 0.01434878587196468, 'num_input_tokens_seen': 851968, 'completed': '1.43% (13 / 906)', 'remaining time': '3:55:36', 'throughput': '1124.13', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:37:03,067 >> {'loss': 0.3993, 'grad_norm': 9.14816665649414, 'learning_rate': 3.043478260869566e-06, 'epoch': 0.01545253863134658, 'num_input_tokens_seen': 917504, 'completed': '1.55% (14 / 906)', 'remaining time': '3:54:01', 'throughput': '1122.44', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:37:17,424 >> {'loss': 0.8576, 'grad_norm': 11.77620792388916, 'learning_rate': 3.2608695652173914e-06, 'epoch': 0.016556291390728478, 'num_input_tokens_seen': 983040, 'completed': '1.66% (15 / 906)', 'remaining time': '3:52:23', 'throughput': '1141.22', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:37:32,043 >> {'loss': 0.2984, 'grad_norm': 6.811140060424805, 'learning_rate': 3.4782608695652175e-06, 'epoch': 0.017660044150110375, 'num_input_tokens_seen': 1048576, 'completed': '1.77% (16 / 906)', 'remaining time': '3:51:10', 'throughput': '1120.72', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:37:46,522 >> {'loss': 0.784, 'grad_norm': 10.383139610290527, 'learning_rate': 3.6956521739130436e-06, 'epoch': 0.018763796909492272, 'num_input_tokens_seen': 1114112, 'completed': '1.88% (17 / 906)', 'remaining time': '3:49:57', 'throughput': '1131.60', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:38:01,137 >> {'loss': 0.5068, 'grad_norm': 8.979279518127441, 'learning_rate': 3.91304347826087e-06, 'epoch': 0.019867549668874173, 'num_input_tokens_seen': 1179648, 'completed': '1.99% (18 / 906)', 'remaining time': '3:48:57', 'throughput': '1121.00', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:38:15,681 >> {'loss': 1.2048, 'grad_norm': 19.544546127319336, 'learning_rate': 4.130434782608696e-06, 'epoch': 0.02097130242825607, 'num_input_tokens_seen': 1245184, 'completed': '2.10% (19 / 906)', 'remaining time': '3:47:58', 'throughput': '1126.55', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:38:30,509 >> {'loss': 0.4331, 'grad_norm': 8.711148262023926, 'learning_rate': 4.347826086956522e-06, 'epoch': 0.02207505518763797, 'num_input_tokens_seen': 1310720, 'completed': '2.21% (20 / 906)', 'remaining time': '3:47:16', 'throughput': '1104.89', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:38:45,252 >> {'loss': 0.6358, 'grad_norm': 9.380172729492188, 'learning_rate': 4.565217391304348e-06, 'epoch': 0.023178807947019868, 'num_input_tokens_seen': 1376256, 'completed': '2.32% (21 / 906)', 'remaining time': '3:46:34', 'throughput': '1111.32', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:38:59,930 >> {'loss': 0.7366, 'grad_norm': 9.575282096862793, 'learning_rate': 4.782608695652174e-06, 'epoch': 0.024282560706401765, 'num_input_tokens_seen': 1441788, 'completed': '2.43% (22 / 906)', 'remaining time': '3:45:51', 'throughput': '1116.18', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:39:14,833 >> {'loss': 0.3091, 'grad_norm': 5.655991554260254, 'learning_rate': 5e-06, 'epoch': 0.025386313465783666, 'num_input_tokens_seen': 1507324, 'completed': '2.54% (23 / 906)', 'remaining time': '3:45:19', 'throughput': '1099.38', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:39:29,672 >> {'loss': 0.7247, 'grad_norm': 8.588181495666504, 'learning_rate': 5.2173913043478265e-06, 'epoch': 0.026490066225165563, 'num_input_tokens_seen': 1572860, 'completed': '2.65% (24 / 906)', 'remaining time': '3:44:46', 'throughput': '1104.06', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:39:44,656 >> {'loss': 0.8751, 'grad_norm': 9.483624458312988, 'learning_rate': 5.4347826086956525e-06, 'epoch': 0.02759381898454746, 'num_input_tokens_seen': 1638396, 'completed': '2.76% (25 / 906)', 'remaining time': '3:44:20', 'throughput': '1093.47', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:39:59,855 >> {'loss': 0.5548, 'grad_norm': 7.005770683288574, 'learning_rate': 5.652173913043479e-06, 'epoch': 0.02869757174392936, 'num_input_tokens_seen': 1703932, 'completed': '2.87% (26 / 906)', 'remaining time': '3:44:02', 'throughput': '1077.95', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:40:14,778 >> {'loss': 0.8409, 'grad_norm': 8.58704662322998, 'learning_rate': 5.8695652173913055e-06, 'epoch': 0.029801324503311258, 'num_input_tokens_seen': 1769468, 'completed': '2.98% (27 / 906)', 'remaining time': '3:43:36', 'throughput': '1097.93', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:40:29,823 >> {'loss': 0.469, 'grad_norm': 6.881068706512451, 'learning_rate': 6.086956521739132e-06, 'epoch': 0.03090507726269316, 'num_input_tokens_seen': 1835004, 'completed': '3.09% (28 / 906)', 'remaining time': '3:43:13', 'throughput': '1088.98', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:40:44,855 >> {'loss': 0.4414, 'grad_norm': 7.618829250335693, 'learning_rate': 6.304347826086958e-06, 'epoch': 0.03200883002207505, 'num_input_tokens_seen': 1900540, 'completed': '3.20% (29 / 906)', 'remaining time': '3:42:51', 'throughput': '1089.96', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:40:59,898 >> {'loss': 0.7077, 'grad_norm': 9.56350326538086, 'learning_rate': 6.521739130434783e-06, 'epoch': 0.033112582781456956, 'num_input_tokens_seen': 1966076, 'completed': '3.31% (30 / 906)', 'remaining time': '3:42:30', 'throughput': '1089.11', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:41:14,978 >> {'loss': 0.883, 'grad_norm': 11.033548355102539, 'learning_rate': 6.739130434782609e-06, 'epoch': 0.03421633554083885, 'num_input_tokens_seen': 2031612, 'completed': '3.42% (31 / 906)', 'remaining time': '3:42:10', 'throughput': '1086.48', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:41:30,002 >> {'loss': 0.6193, 'grad_norm': 8.739025115966797, 'learning_rate': 6.956521739130435e-06, 'epoch': 0.03532008830022075, 'num_input_tokens_seen': 2097148, 'completed': '3.53% (32 / 906)', 'remaining time': '3:41:49', 'throughput': '1090.55', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:41:45,053 >> {'loss': 0.5358, 'grad_norm': 8.231757164001465, 'learning_rate': 7.173913043478261e-06, 'epoch': 0.03642384105960265, 'num_input_tokens_seen': 2162684, 'completed': '3.64% (33 / 906)', 'remaining time': '3:41:30', 'throughput': '1088.53', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:42:00,054 >> {'loss': 0.6193, 'grad_norm': 7.457165241241455, 'learning_rate': 7.391304347826087e-06, 'epoch': 0.037527593818984545, 'num_input_tokens_seen': 2228220, 'completed': '3.75% (34 / 906)', 'remaining time': '3:41:09', 'throughput': '1092.22', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:42:15,286 >> {'loss': 0.5043, 'grad_norm': 7.066882133483887, 'learning_rate': 7.608695652173914e-06, 'epoch': 0.03863134657836645, 'num_input_tokens_seen': 2293756, 'completed': '3.86% (35 / 906)', 'remaining time': '3:40:54', 'throughput': '1075.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:42:30,431 >> {'loss': 0.5171, 'grad_norm': 7.486729621887207, 'learning_rate': 7.82608695652174e-06, 'epoch': 0.039735099337748346, 'num_input_tokens_seen': 2359292, 'completed': '3.97% (36 / 906)', 'remaining time': '3:40:37', 'throughput': '1081.77', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:42:45,477 >> {'loss': 0.6266, 'grad_norm': 7.478753089904785, 'learning_rate': 8.043478260869566e-06, 'epoch': 0.04083885209713024, 'num_input_tokens_seen': 2424828, 'completed': '4.08% (37 / 906)', 'remaining time': '3:40:18', 'throughput': '1088.95', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:43:00,393 >> {'loss': 0.9509, 'grad_norm': 9.259384155273438, 'learning_rate': 8.260869565217392e-06, 'epoch': 0.04194260485651214, 'num_input_tokens_seen': 2490364, 'completed': '4.19% (38 / 906)', 'remaining time': '3:39:56', 'throughput': '1098.39', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:43:15,693 >> {'loss': 0.5386, 'grad_norm': 6.441883087158203, 'learning_rate': 8.478260869565218e-06, 'epoch': 0.04304635761589404, 'num_input_tokens_seen': 2555900, 'completed': '4.30% (39 / 906)', 'remaining time': '3:39:43', 'throughput': '1070.84', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:43:30,984 >> {'loss': 0.8399, 'grad_norm': 8.299201011657715, 'learning_rate': 8.695652173913044e-06, 'epoch': 0.04415011037527594, 'num_input_tokens_seen': 2621436, 'completed': '4.42% (40 / 906)', 'remaining time': '3:39:29', 'throughput': '1071.47', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:43:46,240 >> {'loss': 0.6116, 'grad_norm': 6.8729095458984375, 'learning_rate': 8.91304347826087e-06, 'epoch': 0.04525386313465784, 'num_input_tokens_seen': 2686972, 'completed': '4.53% (41 / 906)', 'remaining time': '3:39:15', 'throughput': '1074.00', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:44:01,461 >> {'loss': 0.5301, 'grad_norm': 6.9387311935424805, 'learning_rate': 9.130434782608697e-06, 'epoch': 0.046357615894039736, 'num_input_tokens_seen': 2752508, 'completed': '4.64% (42 / 906)', 'remaining time': '3:39:00', 'throughput': '1076.39', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:44:16,738 >> {'loss': 0.513, 'grad_norm': 6.831243991851807, 'learning_rate': 9.347826086956523e-06, 'epoch': 0.04746136865342163, 'num_input_tokens_seen': 2818044, 'completed': '4.75% (43 / 906)', 'remaining time': '3:38:46', 'throughput': '1072.44', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:44:31,788 >> {'loss': 1.0538, 'grad_norm': 9.532112121582031, 'learning_rate': 9.565217391304349e-06, 'epoch': 0.04856512141280353, 'num_input_tokens_seen': 2883580, 'completed': '4.86% (44 / 906)', 'remaining time': '3:38:28', 'throughput': '1088.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:44:47,093 >> {'loss': 0.5253, 'grad_norm': 8.001193046569824, 'learning_rate': 9.782608695652175e-06, 'epoch': 0.04966887417218543, 'num_input_tokens_seen': 2949116, 'completed': '4.97% (45 / 906)', 'remaining time': '3:38:15', 'throughput': '1070.49', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:45:02,337 >> {'loss': 0.5761, 'grad_norm': 8.312308311462402, 'learning_rate': 1e-05, 'epoch': 0.05077262693156733, 'num_input_tokens_seen': 3014652, 'completed': '5.08% (46 / 906)', 'remaining time': '3:38:00', 'throughput': '1074.78', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:45:17,758 >> {'loss': 0.5442, 'grad_norm': 6.9460859298706055, 'learning_rate': 9.999969974871272e-06, 'epoch': 0.05187637969094923, 'num_input_tokens_seen': 3080188, 'completed': '5.19% (47 / 906)', 'remaining time': '3:37:49', 'throughput': '1062.42', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:45:33,038 >> {'loss': 0.7561, 'grad_norm': 8.236810684204102, 'learning_rate': 9.999879899885757e-06, 'epoch': 0.052980132450331126, 'num_input_tokens_seen': 3145724, 'completed': '5.30% (48 / 906)', 'remaining time': '3:37:35', 'throughput': '1072.25', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:45:48,235 >> {'loss': 0.4112, 'grad_norm': 5.9857964515686035, 'learning_rate': 9.99972977624546e-06, 'epoch': 0.05408388520971302, 'num_input_tokens_seen': 3211260, 'completed': '5.41% (49 / 906)', 'remaining time': '3:37:19', 'throughput': '1078.10', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:46:03,481 >> {'loss': 0.7197, 'grad_norm': 9.374804496765137, 'learning_rate': 9.999519605953706e-06, 'epoch': 0.05518763796909492, 'num_input_tokens_seen': 3276796, 'completed': '5.52% (50 / 906)', 'remaining time': '3:37:04', 'throughput': '1074.69', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:46:18,872 >> {'loss': 1.0819, 'grad_norm': 11.54023551940918, 'learning_rate': 9.999249391815115e-06, 'epoch': 0.056291390728476824, 'num_input_tokens_seen': 3342332, 'completed': '5.63% (51 / 906)', 'remaining time': '3:36:52', 'throughput': '1064.49', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:46:34,394 >> {'loss': 0.4057, 'grad_norm': 5.593741416931152, 'learning_rate': 9.998919137435558e-06, 'epoch': 0.05739514348785872, 'num_input_tokens_seen': 3407868, 'completed': '5.74% (52 / 906)', 'remaining time': '3:36:42', 'throughput': '1055.54', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:46:48,953 >> {'loss': 0.415, 'grad_norm': 5.5571489334106445, 'learning_rate': 9.998528847222116e-06, 'epoch': 0.05849889624724062, 'num_input_tokens_seen': 3473404, 'completed': '5.85% (53 / 906)', 'remaining time': '3:36:16', 'throughput': '1125.39', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:47:03,285 >> {'loss': 0.6491, 'grad_norm': 7.452723503112793, 'learning_rate': 9.998078526383018e-06, 'epoch': 0.059602649006622516, 'num_input_tokens_seen': 3538940, 'completed': '5.96% (54 / 906)', 'remaining time': '3:35:47', 'throughput': '1143.17', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:47:17,494 >> {'loss': 0.3988, 'grad_norm': 5.862286567687988, 'learning_rate': 9.99756818092757e-06, 'epoch': 0.06070640176600441, 'num_input_tokens_seen': 3604476, 'completed': '6.07% (55 / 906)', 'remaining time': '3:35:16', 'throughput': '1153.05', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:47:31,557 >> {'loss': 0.4672, 'grad_norm': 5.775778770446777, 'learning_rate': 9.996997817666077e-06, 'epoch': 0.06181015452538632, 'num_input_tokens_seen': 3670012, 'completed': '6.18% (56 / 906)', 'remaining time': '3:34:44', 'throughput': '1165.01', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:47:45,771 >> {'loss': 0.3624, 'grad_norm': 5.253780841827393, 'learning_rate': 9.996367444209756e-06, 'epoch': 0.06291390728476821, 'num_input_tokens_seen': 3735548, 'completed': '6.29% (57 / 906)', 'remaining time': '3:34:15', 'throughput': '1152.69', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:47:59,945 >> {'loss': 0.7668, 'grad_norm': 7.225327968597412, 'learning_rate': 9.995677068970624e-06, 'epoch': 0.0640176600441501, 'num_input_tokens_seen': 3801084, 'completed': '6.40% (58 / 906)', 'remaining time': '3:33:46', 'throughput': '1155.97', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:48:14,028 >> {'loss': 0.5316, 'grad_norm': 6.573258876800537, 'learning_rate': 9.994926701161394e-06, 'epoch': 0.06512141280353201, 'num_input_tokens_seen': 3866620, 'completed': '6.51% (59 / 906)', 'remaining time': '3:33:16', 'throughput': '1163.36', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:48:28,174 >> {'loss': 0.8709, 'grad_norm': 7.839727878570557, 'learning_rate': 9.99411635079535e-06, 'epoch': 0.06622516556291391, 'num_input_tokens_seen': 3932156, 'completed': '6.62% (60 / 906)', 'remaining time': '3:32:47', 'throughput': '1158.17', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:48:42,382 >> {'loss': 0.6911, 'grad_norm': 7.096211910247803, 'learning_rate': 9.993246028686216e-06, 'epoch': 0.0673289183222958, 'num_input_tokens_seen': 3997692, 'completed': '6.73% (61 / 906)', 'remaining time': '3:32:20', 'throughput': '1153.22', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:48:56,438 >> {'loss': 0.8105, 'grad_norm': 8.092432022094727, 'learning_rate': 9.992315746448009e-06, 'epoch': 0.0684326710816777, 'num_input_tokens_seen': 4063228, 'completed': '6.84% (62 / 906)', 'remaining time': '3:31:51', 'throughput': '1165.61', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:49:10,732 >> {'loss': 0.5278, 'grad_norm': 6.5860490798950195, 'learning_rate': 9.991325516494876e-06, 'epoch': 0.0695364238410596, 'num_input_tokens_seen': 4128764, 'completed': '6.95% (63 / 906)', 'remaining time': '3:31:25', 'throughput': '1146.17', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:49:24,969 >> {'loss': 0.6808, 'grad_norm': 7.127208709716797, 'learning_rate': 9.990275352040943e-06, 'epoch': 0.0706401766004415, 'num_input_tokens_seen': 4194300, 'completed': '7.06% (64 / 906)', 'remaining time': '3:31:00', 'throughput': '1150.80', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:49:39,109 >> {'loss': 0.7354, 'grad_norm': 7.488225936889648, 'learning_rate': 9.989165267100137e-06, 'epoch': 0.0717439293598234, 'num_input_tokens_seen': 4259836, 'completed': '7.17% (65 / 906)', 'remaining time': '3:30:33', 'throughput': '1158.75', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:49:53,428 >> {'loss': 0.6493, 'grad_norm': 6.8002800941467285, 'learning_rate': 9.987995276485984e-06, 'epoch': 0.0728476821192053, 'num_input_tokens_seen': 4325372, 'completed': '7.28% (66 / 906)', 'remaining time': '3:30:09', 'throughput': '1144.21', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:50:07,719 >> {'loss': 0.4735, 'grad_norm': 6.0314459800720215, 'learning_rate': 9.986765395811425e-06, 'epoch': 0.0739514348785872, 'num_input_tokens_seen': 4390908, 'completed': '7.40% (67 / 906)', 'remaining time': '3:29:45', 'throughput': '1146.46', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:50:22,492 >> {'loss': 0.8006, 'grad_norm': 8.056852340698242, 'learning_rate': 9.985475641488608e-06, 'epoch': 0.07505518763796909, 'num_input_tokens_seen': 4456444, 'completed': '7.51% (68 / 906)', 'remaining time': '3:29:27', 'throughput': '1109.03', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:50:37,902 >> {'loss': 0.7747, 'grad_norm': 7.1111741065979, 'learning_rate': 9.984126030728659e-06, 'epoch': 0.076158940397351, 'num_input_tokens_seen': 4521980, 'completed': '7.62% (69 / 906)', 'remaining time': '3:29:17', 'throughput': '1063.23', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:50:53,509 >> {'loss': 0.4753, 'grad_norm': 6.619105339050293, 'learning_rate': 9.982716581541462e-06, 'epoch': 0.0772626931567329, 'num_input_tokens_seen': 4587516, 'completed': '7.73% (70 / 906)', 'remaining time': '3:29:09', 'throughput': '1049.80', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:51:09,042 >> {'loss': 0.567, 'grad_norm': 5.534825801849365, 'learning_rate': 9.981247312735412e-06, 'epoch': 0.07836644591611479, 'num_input_tokens_seen': 4653052, 'completed': '7.84% (71 / 906)', 'remaining time': '3:29:01', 'throughput': '1054.73', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:51:24,533 >> {'loss': 0.496, 'grad_norm': 5.347917556762695, 'learning_rate': 9.979718243917172e-06, 'epoch': 0.07947019867549669, 'num_input_tokens_seen': 4718588, 'completed': '7.95% (72 / 906)', 'remaining time': '3:28:51', 'throughput': '1057.66', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:51:40,018 >> {'loss': 0.4872, 'grad_norm': 6.386154651641846, 'learning_rate': 9.978129395491402e-06, 'epoch': 0.08057395143487858, 'num_input_tokens_seen': 4784124, 'completed': '8.06% (73 / 906)', 'remaining time': '3:28:41', 'throughput': '1058.04', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:51:55,437 >> {'loss': 0.5869, 'grad_norm': 6.805239200592041, 'learning_rate': 9.976480788660494e-06, 'epoch': 0.08167770419426049, 'num_input_tokens_seen': 4849660, 'completed': '8.17% (74 / 906)', 'remaining time': '3:28:31', 'throughput': '1062.60', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:52:11,181 >> {'loss': 0.4738, 'grad_norm': 6.585273265838623, 'learning_rate': 9.974772445424283e-06, 'epoch': 0.08278145695364239, 'num_input_tokens_seen': 4915196, 'completed': '8.28% (75 / 906)', 'remaining time': '3:28:23', 'throughput': '1040.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:52:26,537 >> {'loss': 0.4763, 'grad_norm': 6.134422779083252, 'learning_rate': 9.973004388579758e-06, 'epoch': 0.08388520971302428, 'num_input_tokens_seen': 4980732, 'completed': '8.39% (76 / 906)', 'remaining time': '3:28:12', 'throughput': '1066.93', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:52:41,975 >> {'loss': 0.6983, 'grad_norm': 7.285155296325684, 'learning_rate': 9.971176641720756e-06, 'epoch': 0.08498896247240618, 'num_input_tokens_seen': 5046268, 'completed': '8.50% (77 / 906)', 'remaining time': '3:28:01', 'throughput': '1061.29', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:52:57,337 >> {'loss': 0.7429, 'grad_norm': 7.901010513305664, 'learning_rate': 9.96928922923765e-06, 'epoch': 0.08609271523178808, 'num_input_tokens_seen': 5111804, 'completed': '8.61% (78 / 906)', 'remaining time': '3:27:49', 'throughput': '1066.54', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:53:12,781 >> {'loss': 0.4834, 'grad_norm': 6.154994964599609, 'learning_rate': 9.967342176317018e-06, 'epoch': 0.08719646799116998, 'num_input_tokens_seen': 5177340, 'completed': '8.72% (79 / 906)', 'remaining time': '3:27:38', 'throughput': '1060.85', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:53:28,082 >> {'loss': 0.655, 'grad_norm': 7.007679462432861, 'learning_rate': 9.96533550894131e-06, 'epoch': 0.08830022075055188, 'num_input_tokens_seen': 5242876, 'completed': '8.83% (80 / 906)', 'remaining time': '3:27:25', 'throughput': '1070.79', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:53:43,287 >> {'loss': 0.7023, 'grad_norm': 6.794103622436523, 'learning_rate': 9.963269253888504e-06, 'epoch': 0.08940397350993377, 'num_input_tokens_seen': 5308412, 'completed': '8.94% (81 / 906)', 'remaining time': '3:27:12', 'throughput': '1077.54', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:53:58,435 >> {'loss': 0.8764, 'grad_norm': 7.166098117828369, 'learning_rate': 9.961143438731741e-06, 'epoch': 0.09050772626931568, 'num_input_tokens_seen': 5373948, 'completed': '9.05% (82 / 906)', 'remaining time': '3:26:57', 'throughput': '1081.58', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:54:14,084 >> {'loss': 0.5542, 'grad_norm': 6.217060565948486, 'learning_rate': 9.958958091838969e-06, 'epoch': 0.09161147902869757, 'num_input_tokens_seen': 5439484, 'completed': '9.16% (83 / 906)', 'remaining time': '3:26:48', 'throughput': '1046.95', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:54:29,743 >> {'loss': 0.6756, 'grad_norm': 6.699713706970215, 'learning_rate': 9.95671324237255e-06, 'epoch': 0.09271523178807947, 'num_input_tokens_seen': 5505020, 'completed': '9.27% (84 / 906)', 'remaining time': '3:26:39', 'throughput': '1046.36', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:54:45,516 >> {'loss': 0.4274, 'grad_norm': 5.587467670440674, 'learning_rate': 9.954408920288884e-06, 'epoch': 0.09381898454746136, 'num_input_tokens_seen': 5570556, 'completed': '9.38% (85 / 906)', 'remaining time': '3:26:30', 'throughput': '1038.73', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:55:01,305 >> {'loss': 0.3499, 'grad_norm': 5.275640964508057, 'learning_rate': 9.952045156337998e-06, 'epoch': 0.09492273730684327, 'num_input_tokens_seen': 5636092, 'completed': '9.49% (86 / 906)', 'remaining time': '3:26:22', 'throughput': '1037.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:55:17,037 >> {'loss': 0.657, 'grad_norm': 7.246316432952881, 'learning_rate': 9.949621982063145e-06, 'epoch': 0.09602649006622517, 'num_input_tokens_seen': 5701628, 'completed': '9.60% (87 / 906)', 'remaining time': '3:26:13', 'throughput': '1041.46', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:55:32,954 >> {'loss': 0.5475, 'grad_norm': 6.412764549255371, 'learning_rate': 9.947139429800377e-06, 'epoch': 0.09713024282560706, 'num_input_tokens_seen': 5767164, 'completed': '9.71% (88 / 906)', 'remaining time': '3:26:05', 'throughput': '1029.34', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:55:48,873 >> {'loss': 0.7146, 'grad_norm': 7.114440441131592, 'learning_rate': 9.94459753267812e-06, 'epoch': 0.09823399558498896, 'num_input_tokens_seen': 5832700, 'completed': '9.82% (89 / 906)', 'remaining time': '3:25:57', 'throughput': '1029.20', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:56:04,750 >> {'loss': 0.6916, 'grad_norm': 7.031120300292969, 'learning_rate': 9.941996324616723e-06, 'epoch': 0.09933774834437085, 'num_input_tokens_seen': 5898236, 'completed': '9.93% (90 / 906)', 'remaining time': '3:25:49', 'throughput': '1031.96', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:56:20,278 >> {'loss': 0.4196, 'grad_norm': 4.999560356140137, 'learning_rate': 9.939335840328011e-06, 'epoch': 0.10044150110375276, 'num_input_tokens_seen': 5963772, 'completed': '10.04% (91 / 906)', 'remaining time': '3:25:37', 'throughput': '1055.13', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:56:35,865 >> {'loss': 0.5311, 'grad_norm': 6.941257953643799, 'learning_rate': 9.93661611531482e-06, 'epoch': 0.10154525386313466, 'num_input_tokens_seen': 6029308, 'completed': '10.15% (92 / 906)', 'remaining time': '3:25:26', 'throughput': '1051.10', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:56:51,372 >> {'loss': 0.6167, 'grad_norm': 6.328064441680908, 'learning_rate': 9.933837185870526e-06, 'epoch': 0.10264900662251655, 'num_input_tokens_seen': 6094844, 'completed': '10.26% (93 / 906)', 'remaining time': '3:25:14', 'throughput': '1056.60', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:57:06,945 >> {'loss': 0.4391, 'grad_norm': 5.660717010498047, 'learning_rate': 9.930999089078556e-06, 'epoch': 0.10375275938189846, 'num_input_tokens_seen': 6160380, 'completed': '10.38% (94 / 906)', 'remaining time': '3:25:03', 'throughput': '1052.09', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 20:57:22,553 >> {'loss': 0.5525, 'grad_norm': 5.730586051940918, 'learning_rate': 9.928101862811899e-06, 'epoch': 0.10485651214128035, 'num_input_tokens_seen': 6225916, 'completed': '10.49% (95 / 906)', 'remaining time': '3:24:51', 'throughput': '1049.73', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:57:38,173 >> {'loss': 0.5216, 'grad_norm': 5.740535259246826, 'learning_rate': 9.925145545732598e-06, 'epoch': 0.10596026490066225, 'num_input_tokens_seen': 6291452, 'completed': '10.60% (96 / 906)', 'remaining time': '3:24:40', 'throughput': '1048.90', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:57:53,941 >> {'loss': 0.4671, 'grad_norm': 5.135792255401611, 'learning_rate': 9.922130177291228e-06, 'epoch': 0.10706401766004416, 'num_input_tokens_seen': 6356988, 'completed': '10.71% (97 / 906)', 'remaining time': '3:24:30', 'throughput': '1039.08', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:58:09,863 >> {'loss': 1.0276, 'grad_norm': 7.077984809875488, 'learning_rate': 9.919055797726377e-06, 'epoch': 0.10816777041942605, 'num_input_tokens_seen': 6422524, 'completed': '10.82% (98 / 906)', 'remaining time': '3:24:21', 'throughput': '1028.99', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:58:25,628 >> {'loss': 0.8116, 'grad_norm': 7.4840192794799805, 'learning_rate': 9.915922448064111e-06, 'epoch': 0.10927152317880795, 'num_input_tokens_seen': 6488060, 'completed': '10.93% (99 / 906)', 'remaining time': '3:24:11', 'throughput': '1039.26', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 20:58:41,094 >> {'loss': 0.391, 'grad_norm': 4.238962173461914, 'learning_rate': 9.912730170117419e-06, 'epoch': 0.11037527593818984, 'num_input_tokens_seen': 6553596, 'completed': '11.04% (100 / 906)', 'remaining time': '3:23:58', 'throughput': '1059.39', 'gpu_mem_free': '30139MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-04 20:59:09,418 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-100
[INFO|configuration_utils.py:472] 2025-01-04 20:59:09,422 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-100/config.json
[INFO|configuration_utils.py:807] 2025-01-04 20:59:09,423 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-100/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-04 21:00:16,342 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-100/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-04 21:00:16,346 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-100/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-04 21:00:16,347 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-100/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
01/04/2025 21:04:14 - WARNING - streaming.base.dataset - Because `num_canonical_nodes` was not specified, and `shuffle_algo` is py1e, it will default to be equal to physical nodes. Prior to Streaming v0.7.0, `num_canonical_nodes` defaulted to 64 * physical nodes.
01/04/2025 21:04:14 - WARNING - streaming.base.dataset - Because `shuffle_block_size` was not specified, it will default to max(4_000_000 // num_canonical_nodes, 1 << 18) if num_canonical_nodes is not None, otherwise 262144. Prior to Streaming v0.7.0, `shuffle_block_size` defaulted to 262144.
[WARNING|trainer.py:869] 2025-01-04 21:04:14,374 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 400, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-04 21:04:30,629 >> {'loss': 0.4316, 'grad_norm': 5.252936840057373, 'learning_rate': 9.909479006485658e-06, 'epoch': 0.11147902869757174, 'num_input_tokens_seen': 6619132, 'completed': '11.15% (101 / 906)', 'remaining time': '4:08:08', 'throughput': '46.87', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:04:44,198 >> {'loss': 0.4543, 'grad_norm': 5.10698127746582, 'learning_rate': 9.906169000553989e-06, 'epoch': 0.11258278145695365, 'num_input_tokens_seen': 6684668, 'completed': '11.26% (102 / 906)', 'remaining time': '4:07:10', 'throughput': '1207.46', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:04:57,929 >> {'loss': 0.4114, 'grad_norm': 4.487974643707275, 'learning_rate': 9.902800196492788e-06, 'epoch': 0.11368653421633554, 'num_input_tokens_seen': 6750204, 'completed': '11.37% (103 / 906)', 'remaining time': '4:06:15', 'throughput': '1193.23', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:05:11,555 >> {'loss': 0.4567, 'grad_norm': 5.283864974975586, 'learning_rate': 9.89937263925707e-06, 'epoch': 0.11479028697571744, 'num_input_tokens_seen': 6815740, 'completed': '11.48% (104 / 906)', 'remaining time': '4:05:20', 'throughput': '1202.35', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:05:25,159 >> {'loss': 0.8429, 'grad_norm': 7.552896976470947, 'learning_rate': 9.895886374585877e-06, 'epoch': 0.11589403973509933, 'num_input_tokens_seen': 6881276, 'completed': '11.59% (105 / 906)', 'remaining time': '4:04:25', 'throughput': '1204.42', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:05:38,859 >> {'loss': 0.3508, 'grad_norm': 4.633214950561523, 'learning_rate': 9.892341449001673e-06, 'epoch': 0.11699779249448124, 'num_input_tokens_seen': 6946812, 'completed': '11.70% (106 / 906)', 'remaining time': '4:03:32', 'throughput': '1195.91', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:05:52,578 >> {'loss': 0.7319, 'grad_norm': 7.490325450897217, 'learning_rate': 9.888737909809725e-06, 'epoch': 0.11810154525386314, 'num_input_tokens_seen': 7012348, 'completed': '11.81% (107 / 906)', 'remaining time': '4:02:40', 'throughput': '1194.26', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:06:06,430 >> {'loss': 0.5916, 'grad_norm': 6.352238655090332, 'learning_rate': 9.885075805097464e-06, 'epoch': 0.11920529801324503, 'num_input_tokens_seen': 7077884, 'completed': '11.92% (108 / 906)', 'remaining time': '4:01:49', 'throughput': '1182.77', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:06:20,043 >> {'loss': 0.6457, 'grad_norm': 6.156204700469971, 'learning_rate': 9.881355183733857e-06, 'epoch': 0.12030905077262694, 'num_input_tokens_seen': 7143420, 'completed': '12.03% (109 / 906)', 'remaining time': '4:00:58', 'throughput': '1203.57', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:06:33,627 >> {'loss': 0.6972, 'grad_norm': 6.884950637817383, 'learning_rate': 9.877576095368738e-06, 'epoch': 0.12141280353200883, 'num_input_tokens_seen': 7208956, 'completed': '12.14% (110 / 906)', 'remaining time': '4:00:07', 'throughput': '1206.07', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:06:47,154 >> {'loss': 0.6563, 'grad_norm': 6.656806468963623, 'learning_rate': 9.873738590432162e-06, 'epoch': 0.12251655629139073, 'num_input_tokens_seen': 7274492, 'completed': '12.25% (111 / 906)', 'remaining time': '3:59:16', 'throughput': '1211.27', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:07:00,885 >> {'loss': 0.5774, 'grad_norm': 6.270849704742432, 'learning_rate': 9.869842720133715e-06, 'epoch': 0.12362030905077263, 'num_input_tokens_seen': 7340028, 'completed': '12.36% (112 / 906)', 'remaining time': '3:58:27', 'throughput': '1193.17', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:07:14,570 >> {'loss': 0.5853, 'grad_norm': 6.998918533325195, 'learning_rate': 9.865888536461851e-06, 'epoch': 0.12472406181015452, 'num_input_tokens_seen': 7405564, 'completed': '12.47% (113 / 906)', 'remaining time': '3:57:39', 'throughput': '1197.26', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:07:28,156 >> {'loss': 0.634, 'grad_norm': 6.9652299880981445, 'learning_rate': 9.861876092183174e-06, 'epoch': 0.12582781456953643, 'num_input_tokens_seen': 7471100, 'completed': '12.58% (114 / 906)', 'remaining time': '3:56:50', 'throughput': '1205.89', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:07:41,740 >> {'loss': 0.9336, 'grad_norm': 7.60606575012207, 'learning_rate': 9.857805440841758e-06, 'epoch': 0.12693156732891833, 'num_input_tokens_seen': 7536636, 'completed': '12.69% (115 / 906)', 'remaining time': '3:56:02', 'throughput': '1206.14', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:07:55,496 >> {'loss': 0.4438, 'grad_norm': 5.321383953094482, 'learning_rate': 9.853676636758415e-06, 'epoch': 0.1280353200883002, 'num_input_tokens_seen': 7602172, 'completed': '12.80% (116 / 906)', 'remaining time': '3:55:16', 'throughput': '1191.06', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:08:09,177 >> {'loss': 0.7418, 'grad_norm': 6.2487592697143555, 'learning_rate': 9.849489735029975e-06, 'epoch': 0.1291390728476821, 'num_input_tokens_seen': 7667708, 'completed': '12.91% (117 / 906)', 'remaining time': '3:54:30', 'throughput': '1197.59', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:08:22,688 >> {'loss': 0.704, 'grad_norm': 5.844862461090088, 'learning_rate': 9.845244791528563e-06, 'epoch': 0.13024282560706402, 'num_input_tokens_seen': 7733244, 'completed': '13.02% (118 / 906)', 'remaining time': '3:53:43', 'throughput': '1212.61', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:08:36,312 >> {'loss': 0.8102, 'grad_norm': 6.298166275024414, 'learning_rate': 9.840941862900825e-06, 'epoch': 0.13134657836644592, 'num_input_tokens_seen': 7798780, 'completed': '13.13% (119 / 906)', 'remaining time': '3:52:58', 'throughput': '1202.60', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:08:50,053 >> {'loss': 0.4503, 'grad_norm': 5.498508453369141, 'learning_rate': 9.836581006567207e-06, 'epoch': 0.13245033112582782, 'num_input_tokens_seen': 7864316, 'completed': '13.25% (120 / 906)', 'remaining time': '3:52:14', 'throughput': '1192.31', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:09:03,570 >> {'loss': 0.8501, 'grad_norm': 6.927279472351074, 'learning_rate': 9.832162280721157e-06, 'epoch': 0.1335540838852097, 'num_input_tokens_seen': 7929852, 'completed': '13.36% (121 / 906)', 'remaining time': '3:51:29', 'throughput': '1212.13', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:09:17,328 >> {'loss': 0.3771, 'grad_norm': 4.555235385894775, 'learning_rate': 9.827685744328374e-06, 'epoch': 0.1346578366445916, 'num_input_tokens_seen': 7995388, 'completed': '13.47% (122 / 906)', 'remaining time': '3:50:46', 'throughput': '1190.91', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:09:30,965 >> {'loss': 0.8107, 'grad_norm': 6.517794609069824, 'learning_rate': 9.823151457126006e-06, 'epoch': 0.1357615894039735, 'num_input_tokens_seen': 8060924, 'completed': '13.58% (123 / 906)', 'remaining time': '3:50:02', 'throughput': '1201.39', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:09:44,706 >> {'loss': 0.3973, 'grad_norm': 5.046365737915039, 'learning_rate': 9.818559479621851e-06, 'epoch': 0.1368653421633554, 'num_input_tokens_seen': 8126460, 'completed': '13.69% (124 / 906)', 'remaining time': '3:49:20', 'throughput': '1192.36', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:09:58,412 >> {'loss': 0.4441, 'grad_norm': 5.294122695922852, 'learning_rate': 9.813909873093565e-06, 'epoch': 0.13796909492273732, 'num_input_tokens_seen': 8191996, 'completed': '13.80% (125 / 906)', 'remaining time': '3:48:38', 'throughput': '1195.36', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:10:12,050 >> {'loss': 0.4989, 'grad_norm': 5.88740873336792, 'learning_rate': 9.809202699587828e-06, 'epoch': 0.1390728476821192, 'num_input_tokens_seen': 8257532, 'completed': '13.91% (126 / 906)', 'remaining time': '3:47:57', 'throughput': '1201.32', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:10:25,769 >> {'loss': 0.4466, 'grad_norm': 5.602629661560059, 'learning_rate': 9.804438021919525e-06, 'epoch': 0.1401766004415011, 'num_input_tokens_seen': 8323068, 'completed': '14.02% (127 / 906)', 'remaining time': '3:47:16', 'throughput': '1194.34', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:10:39,415 >> {'loss': 0.5564, 'grad_norm': 5.889889240264893, 'learning_rate': 9.799615903670904e-06, 'epoch': 0.141280353200883, 'num_input_tokens_seen': 8388604, 'completed': '14.13% (128 / 906)', 'remaining time': '3:46:35', 'throughput': '1200.59', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:10:53,181 >> {'loss': 0.3879, 'grad_norm': 5.074821949005127, 'learning_rate': 9.794736409190732e-06, 'epoch': 0.1423841059602649, 'num_input_tokens_seen': 8454140, 'completed': '14.24% (129 / 906)', 'remaining time': '3:45:55', 'throughput': '1190.21', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:11:06,848 >> {'loss': 0.4141, 'grad_norm': 5.404687881469727, 'learning_rate': 9.789799603593433e-06, 'epoch': 0.1434878587196468, 'num_input_tokens_seen': 8519676, 'completed': '14.35% (130 / 906)', 'remaining time': '3:45:15', 'throughput': '1198.78', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:11:20,528 >> {'loss': 0.4376, 'grad_norm': 5.067564010620117, 'learning_rate': 9.784805552758213e-06, 'epoch': 0.1445916114790287, 'num_input_tokens_seen': 8585212, 'completed': '14.46% (131 / 906)', 'remaining time': '3:44:35', 'throughput': '1197.65', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:11:34,220 >> {'loss': 0.3985, 'grad_norm': 4.754221439361572, 'learning_rate': 9.779754323328192e-06, 'epoch': 0.1456953642384106, 'num_input_tokens_seen': 8650748, 'completed': '14.57% (132 / 906)', 'remaining time': '3:43:56', 'throughput': '1196.61', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:11:47,765 >> {'loss': 0.6592, 'grad_norm': 5.799423694610596, 'learning_rate': 9.77464598270951e-06, 'epoch': 0.1467991169977925, 'num_input_tokens_seen': 8716284, 'completed': '14.68% (133 / 906)', 'remaining time': '3:43:17', 'throughput': '1209.62', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:12:01,339 >> {'loss': 0.5591, 'grad_norm': 5.599109172821045, 'learning_rate': 9.76948059907043e-06, 'epoch': 0.1479028697571744, 'num_input_tokens_seen': 8781820, 'completed': '14.79% (134 / 906)', 'remaining time': '3:42:38', 'throughput': '1206.98', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:12:14,890 >> {'loss': 0.5448, 'grad_norm': 5.459147930145264, 'learning_rate': 9.764258241340421e-06, 'epoch': 0.1490066225165563, 'num_input_tokens_seen': 8847356, 'completed': '14.90% (135 / 906)', 'remaining time': '3:41:59', 'throughput': '1209.07', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:12:28,533 >> {'loss': 0.4279, 'grad_norm': 4.705541133880615, 'learning_rate': 9.758978979209243e-06, 'epoch': 0.15011037527593818, 'num_input_tokens_seen': 8912892, 'completed': '15.01% (136 / 906)', 'remaining time': '3:41:21', 'throughput': '1200.97', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:12:42,122 >> {'loss': 0.559, 'grad_norm': 5.681971073150635, 'learning_rate': 9.753642883126018e-06, 'epoch': 0.15121412803532008, 'num_input_tokens_seen': 8978428, 'completed': '15.12% (137 / 906)', 'remaining time': '3:40:43', 'throughput': '1205.61', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:12:55,696 >> {'loss': 0.4626, 'grad_norm': 4.888025760650635, 'learning_rate': 9.748250024298291e-06, 'epoch': 0.152317880794702, 'num_input_tokens_seen': 9043964, 'completed': '15.23% (138 / 906)', 'remaining time': '3:40:06', 'throughput': '1207.06', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:13:09,425 >> {'loss': 0.4097, 'grad_norm': 5.069499969482422, 'learning_rate': 9.742800474691075e-06, 'epoch': 0.1534216335540839, 'num_input_tokens_seen': 9109500, 'completed': '15.34% (139 / 906)', 'remaining time': '3:39:30', 'throughput': '1193.39', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:13:23,153 >> {'loss': 0.6859, 'grad_norm': 6.779152870178223, 'learning_rate': 9.73729430702589e-06, 'epoch': 0.1545253863134658, 'num_input_tokens_seen': 9175036, 'completed': '15.45% (140 / 906)', 'remaining time': '3:38:54', 'throughput': '1193.43', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:13:36,824 >> {'loss': 0.6244, 'grad_norm': 5.773662567138672, 'learning_rate': 9.731731594779807e-06, 'epoch': 0.15562913907284767, 'num_input_tokens_seen': 9240568, 'completed': '15.56% (141 / 906)', 'remaining time': '3:38:18', 'throughput': '1198.43', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:13:50,455 >> {'loss': 0.3684, 'grad_norm': 4.693535804748535, 'learning_rate': 9.726112412184441e-06, 'epoch': 0.15673289183222958, 'num_input_tokens_seen': 9306104, 'completed': '15.67% (142 / 906)', 'remaining time': '3:37:42', 'throughput': '1201.93', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:14:04,162 >> {'loss': 0.4194, 'grad_norm': 4.742260932922363, 'learning_rate': 9.72043683422499e-06, 'epoch': 0.15783664459161148, 'num_input_tokens_seen': 9371640, 'completed': '15.78% (143 / 906)', 'remaining time': '3:37:06', 'throughput': '1195.30', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:14:17,757 >> {'loss': 0.4763, 'grad_norm': 5.286125659942627, 'learning_rate': 9.71470493663921e-06, 'epoch': 0.15894039735099338, 'num_input_tokens_seen': 9437176, 'completed': '15.89% (144 / 906)', 'remaining time': '3:36:31', 'throughput': '1205.16', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:14:31,382 >> {'loss': 0.4952, 'grad_norm': 5.083088397979736, 'learning_rate': 9.708916795916418e-06, 'epoch': 0.1600441501103753, 'num_input_tokens_seen': 9502712, 'completed': '16.00% (145 / 906)', 'remaining time': '3:35:56', 'throughput': '1202.48', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:14:45,082 >> {'loss': 0.4201, 'grad_norm': 4.462346076965332, 'learning_rate': 9.703072489296467e-06, 'epoch': 0.16114790286975716, 'num_input_tokens_seen': 9568248, 'completed': '16.11% (146 / 906)', 'remaining time': '3:35:22', 'throughput': '1195.92', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:14:58,732 >> {'loss': 0.5632, 'grad_norm': 5.618736267089844, 'learning_rate': 9.697172094768717e-06, 'epoch': 0.16225165562913907, 'num_input_tokens_seen': 9633784, 'completed': '16.23% (147 / 906)', 'remaining time': '3:34:47', 'throughput': '1200.32', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:15:12,431 >> {'loss': 0.5519, 'grad_norm': 5.792170524597168, 'learning_rate': 9.691215691070994e-06, 'epoch': 0.16335540838852097, 'num_input_tokens_seen': 9699320, 'completed': '16.34% (148 / 906)', 'remaining time': '3:34:13', 'throughput': '1195.99', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:15:26,067 >> {'loss': 0.5686, 'grad_norm': 6.814916133880615, 'learning_rate': 9.685203357688536e-06, 'epoch': 0.16445916114790288, 'num_input_tokens_seen': 9764856, 'completed': '16.45% (149 / 906)', 'remaining time': '3:33:40', 'throughput': '1201.57', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:15:39,697 >> {'loss': 0.5319, 'grad_norm': 5.402688503265381, 'learning_rate': 9.679135174852934e-06, 'epoch': 0.16556291390728478, 'num_input_tokens_seen': 9830392, 'completed': '16.56% (150 / 906)', 'remaining time': '3:33:06', 'throughput': '1202.00', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:15:53,252 >> {'loss': 0.6354, 'grad_norm': 6.655113220214844, 'learning_rate': 9.673011223541067e-06, 'epoch': 0.16666666666666666, 'num_input_tokens_seen': 9895928, 'completed': '16.67% (151 / 906)', 'remaining time': '3:32:32', 'throughput': '1208.76', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:16:06,731 >> {'loss': 0.8729, 'grad_norm': 6.647359371185303, 'learning_rate': 9.666831585474012e-06, 'epoch': 0.16777041942604856, 'num_input_tokens_seen': 9961464, 'completed': '16.78% (152 / 906)', 'remaining time': '3:31:59', 'throughput': '1215.52', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:16:20,153 >> {'loss': 1.1005, 'grad_norm': 8.241610527038574, 'learning_rate': 9.660596343115958e-06, 'epoch': 0.16887417218543047, 'num_input_tokens_seen': 10027000, 'completed': '16.89% (153 / 906)', 'remaining time': '3:31:25', 'throughput': '1220.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:16:33,774 >> {'loss': 0.5956, 'grad_norm': 5.4942626953125, 'learning_rate': 9.65430557967311e-06, 'epoch': 0.16997792494481237, 'num_input_tokens_seen': 10092536, 'completed': '17.00% (154 / 906)', 'remaining time': '3:30:52', 'throughput': '1202.82', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:16:47,452 >> {'loss': 0.4986, 'grad_norm': 5.064987659454346, 'learning_rate': 9.647959379092568e-06, 'epoch': 0.17108167770419427, 'num_input_tokens_seen': 10158072, 'completed': '17.11% (155 / 906)', 'remaining time': '3:30:20', 'throughput': '1197.87', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:17:01,158 >> {'loss': 0.3439, 'grad_norm': 3.928574323654175, 'learning_rate': 9.641557826061218e-06, 'epoch': 0.17218543046357615, 'num_input_tokens_seen': 10223608, 'completed': '17.22% (156 / 906)', 'remaining time': '3:29:48', 'throughput': '1195.41', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:17:14,814 >> {'loss': 0.4185, 'grad_norm': 4.649381160736084, 'learning_rate': 9.635101006004596e-06, 'epoch': 0.17328918322295805, 'num_input_tokens_seen': 10289144, 'completed': '17.33% (157 / 906)', 'remaining time': '3:29:17', 'throughput': '1199.71', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:17:28,541 >> {'loss': 0.6933, 'grad_norm': 6.195332050323486, 'learning_rate': 9.628589005085745e-06, 'epoch': 0.17439293598233996, 'num_input_tokens_seen': 10354680, 'completed': '17.44% (158 / 906)', 'remaining time': '3:28:45', 'throughput': '1193.61', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:17:42,098 >> {'loss': 0.525, 'grad_norm': 5.165769577026367, 'learning_rate': 9.622021910204074e-06, 'epoch': 0.17549668874172186, 'num_input_tokens_seen': 10420216, 'completed': '17.55% (159 / 906)', 'remaining time': '3:28:14', 'throughput': '1208.51', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:17:55,701 >> {'loss': 0.4611, 'grad_norm': 5.248317241668701, 'learning_rate': 9.615399808994192e-06, 'epoch': 0.17660044150110377, 'num_input_tokens_seen': 10485752, 'completed': '17.66% (160 / 906)', 'remaining time': '3:27:42', 'throughput': '1204.42', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:18:09,434 >> {'loss': 0.4007, 'grad_norm': 4.948761940002441, 'learning_rate': 9.608722789824739e-06, 'epoch': 0.17770419426048564, 'num_input_tokens_seen': 10551288, 'completed': '17.77% (161 / 906)', 'remaining time': '3:27:12', 'throughput': '1193.02', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:18:23,089 >> {'loss': 0.6054, 'grad_norm': 6.128936290740967, 'learning_rate': 9.601990941797208e-06, 'epoch': 0.17880794701986755, 'num_input_tokens_seen': 10616824, 'completed': '17.88% (162 / 906)', 'remaining time': '3:26:41', 'throughput': '1199.86', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:18:36,867 >> {'loss': 0.5762, 'grad_norm': 6.17040491104126, 'learning_rate': 9.595204354744756e-06, 'epoch': 0.17991169977924945, 'num_input_tokens_seen': 10682360, 'completed': '17.99% (163 / 906)', 'remaining time': '3:26:12', 'throughput': '1189.15', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:18:50,690 >> {'loss': 0.5555, 'grad_norm': 5.761812210083008, 'learning_rate': 9.588363119231004e-06, 'epoch': 0.18101545253863136, 'num_input_tokens_seen': 10747896, 'completed': '18.10% (164 / 906)', 'remaining time': '3:25:42', 'throughput': '1185.33', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:19:04,974 >> {'loss': 0.3842, 'grad_norm': 5.079399108886719, 'learning_rate': 9.581467326548834e-06, 'epoch': 0.18211920529801323, 'num_input_tokens_seen': 10813432, 'completed': '18.21% (165 / 906)', 'remaining time': '3:25:15', 'throughput': '1146.97', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:19:18,825 >> {'loss': 0.6035, 'grad_norm': 5.7329607009887695, 'learning_rate': 9.57451706871916e-06, 'epoch': 0.18322295805739514, 'num_input_tokens_seen': 10878968, 'completed': '18.32% (166 / 906)', 'remaining time': '3:24:46', 'throughput': '1182.92', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:19:32,428 >> {'loss': 0.4072, 'grad_norm': 5.196740627288818, 'learning_rate': 9.567512438489711e-06, 'epoch': 0.18432671081677704, 'num_input_tokens_seen': 10944504, 'completed': '18.43% (167 / 906)', 'remaining time': '3:24:16', 'throughput': '1204.38', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:19:45,762 >> {'loss': 0.4642, 'grad_norm': 5.155704498291016, 'learning_rate': 9.560453529333787e-06, 'epoch': 0.18543046357615894, 'num_input_tokens_seen': 11010040, 'completed': '18.54% (168 / 906)', 'remaining time': '3:23:45', 'throughput': '1228.77', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:19:59,060 >> {'loss': 0.8744, 'grad_norm': 6.3926568031311035, 'learning_rate': 9.55334043544901e-06, 'epoch': 0.18653421633554085, 'num_input_tokens_seen': 11075576, 'completed': '18.65% (169 / 906)', 'remaining time': '3:23:14', 'throughput': '1232.06', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:20:12,392 >> {'loss': 0.2314, 'grad_norm': 3.620082378387451, 'learning_rate': 9.546173251756076e-06, 'epoch': 0.18763796909492272, 'num_input_tokens_seen': 11141112, 'completed': '18.76% (170 / 906)', 'remaining time': '3:22:44', 'throughput': '1228.89', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:20:25,708 >> {'loss': 0.3696, 'grad_norm': 5.067852973937988, 'learning_rate': 9.538952073897477e-06, 'epoch': 0.18874172185430463, 'num_input_tokens_seen': 11206648, 'completed': '18.87% (171 / 906)', 'remaining time': '3:22:14', 'throughput': '1230.45', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:20:38,978 >> {'loss': 0.4657, 'grad_norm': 4.713092803955078, 'learning_rate': 9.531676998236236e-06, 'epoch': 0.18984547461368653, 'num_input_tokens_seen': 11272184, 'completed': '18.98% (172 / 906)', 'remaining time': '3:21:43', 'throughput': '1234.64', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:20:52,227 >> {'loss': 0.5065, 'grad_norm': 5.534994125366211, 'learning_rate': 9.52434812185461e-06, 'epoch': 0.19094922737306844, 'num_input_tokens_seen': 11337720, 'completed': '19.09% (173 / 906)', 'remaining time': '3:21:13', 'throughput': '1236.60', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:21:05,504 >> {'loss': 0.4614, 'grad_norm': 5.428729057312012, 'learning_rate': 9.516965542552804e-06, 'epoch': 0.19205298013245034, 'num_input_tokens_seen': 11403256, 'completed': '19.21% (174 / 906)', 'remaining time': '3:20:43', 'throughput': '1234.05', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:21:18,751 >> {'loss': 0.383, 'grad_norm': 4.462728977203369, 'learning_rate': 9.509529358847655e-06, 'epoch': 0.19315673289183222, 'num_input_tokens_seen': 11468792, 'completed': '19.32% (175 / 906)', 'remaining time': '3:20:13', 'throughput': '1236.80', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:21:32,095 >> {'loss': 0.3162, 'grad_norm': 4.043033123016357, 'learning_rate': 9.502039669971336e-06, 'epoch': 0.19426048565121412, 'num_input_tokens_seen': 11534328, 'completed': '19.43% (176 / 906)', 'remaining time': '3:19:44', 'throughput': '1227.81', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:21:45,385 >> {'loss': 0.4098, 'grad_norm': 4.644472122192383, 'learning_rate': 9.494496575870007e-06, 'epoch': 0.19536423841059603, 'num_input_tokens_seen': 11599864, 'completed': '19.54% (177 / 906)', 'remaining time': '3:19:15', 'throughput': '1232.76', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:21:58,640 >> {'loss': 0.4331, 'grad_norm': 4.948442459106445, 'learning_rate': 9.486900177202503e-06, 'epoch': 0.19646799116997793, 'num_input_tokens_seen': 11665400, 'completed': '19.65% (178 / 906)', 'remaining time': '3:18:45', 'throughput': '1236.14', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:22:12,061 >> {'loss': 0.5996, 'grad_norm': 6.079590320587158, 'learning_rate': 9.479250575338977e-06, 'epoch': 0.19757174392935983, 'num_input_tokens_seen': 11730936, 'completed': '19.76% (179 / 906)', 'remaining time': '3:18:17', 'throughput': '1220.70', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:22:25,457 >> {'loss': 0.7914, 'grad_norm': 6.818512439727783, 'learning_rate': 9.471547872359552e-06, 'epoch': 0.1986754966887417, 'num_input_tokens_seen': 11796472, 'completed': '19.87% (180 / 906)', 'remaining time': '3:17:49', 'throughput': '1223.08', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:22:38,946 >> {'loss': 0.7554, 'grad_norm': 6.684516429901123, 'learning_rate': 9.463792171052965e-06, 'epoch': 0.1997792494481236, 'num_input_tokens_seen': 11862008, 'completed': '19.98% (181 / 906)', 'remaining time': '3:17:21', 'throughput': '1214.60', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:22:52,312 >> {'loss': 0.3983, 'grad_norm': 4.4812517166137695, 'learning_rate': 9.45598357491518e-06, 'epoch': 0.20088300220750552, 'num_input_tokens_seen': 11927544, 'completed': '20.09% (182 / 906)', 'remaining time': '3:16:53', 'throughput': '1225.79', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:23:05,728 >> {'loss': 0.5978, 'grad_norm': 5.318398475646973, 'learning_rate': 9.448122188148026e-06, 'epoch': 0.20198675496688742, 'num_input_tokens_seen': 11993080, 'completed': '20.20% (183 / 906)', 'remaining time': '3:16:25', 'throughput': '1221.28', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:23:19,116 >> {'loss': 0.4899, 'grad_norm': 5.358767032623291, 'learning_rate': 9.440208115657789e-06, 'epoch': 0.20309050772626933, 'num_input_tokens_seen': 12058616, 'completed': '20.31% (184 / 906)', 'remaining time': '3:15:57', 'throughput': '1223.73', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:23:32,580 >> {'loss': 0.3754, 'grad_norm': 4.549075126647949, 'learning_rate': 9.432241463053823e-06, 'epoch': 0.2041942604856512, 'num_input_tokens_seen': 12124152, 'completed': '20.42% (185 / 906)', 'remaining time': '3:15:30', 'throughput': '1216.94', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:23:45,745 >> {'loss': 0.8162, 'grad_norm': 6.748514175415039, 'learning_rate': 9.424222336647135e-06, 'epoch': 0.2052980132450331, 'num_input_tokens_seen': 12189688, 'completed': '20.53% (186 / 906)', 'remaining time': '3:15:02', 'throughput': '1244.49', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:23:59,089 >> {'loss': 0.506, 'grad_norm': 5.219625473022461, 'learning_rate': 9.416150843448974e-06, 'epoch': 0.206401766004415, 'num_input_tokens_seen': 12255224, 'completed': '20.64% (187 / 906)', 'remaining time': '3:14:34', 'throughput': '1227.82', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:24:12,392 >> {'loss': 0.437, 'grad_norm': 5.062671661376953, 'learning_rate': 9.408027091169391e-06, 'epoch': 0.20750551876379691, 'num_input_tokens_seen': 12320760, 'completed': '20.75% (188 / 906)', 'remaining time': '3:14:07', 'throughput': '1231.59', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:24:25,584 >> {'loss': 0.3918, 'grad_norm': 4.459550857543945, 'learning_rate': 9.399851188215815e-06, 'epoch': 0.20860927152317882, 'num_input_tokens_seen': 12386296, 'completed': '20.86% (189 / 906)', 'remaining time': '3:13:39', 'throughput': '1241.97', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:24:38,982 >> {'loss': 0.4505, 'grad_norm': 5.297883987426758, 'learning_rate': 9.391623243691595e-06, 'epoch': 0.2097130242825607, 'num_input_tokens_seen': 12451832, 'completed': '20.97% (190 / 906)', 'remaining time': '3:13:12', 'throughput': '1222.89', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:24:52,290 >> {'loss': 0.4077, 'grad_norm': 4.688574314117432, 'learning_rate': 9.38334336739455e-06, 'epoch': 0.2108167770419426, 'num_input_tokens_seen': 12517368, 'completed': '21.08% (191 / 906)', 'remaining time': '3:12:45', 'throughput': '1231.12', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:25:05,769 >> {'loss': 0.6966, 'grad_norm': 6.45738410949707, 'learning_rate': 9.375011669815504e-06, 'epoch': 0.2119205298013245, 'num_input_tokens_seen': 12582904, 'completed': '21.19% (192 / 906)', 'remaining time': '3:12:19', 'throughput': '1215.54', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:25:19,155 >> {'loss': 0.499, 'grad_norm': 5.424302577972412, 'learning_rate': 9.366628262136808e-06, 'epoch': 0.2130242825607064, 'num_input_tokens_seen': 12648440, 'completed': '21.30% (193 / 906)', 'remaining time': '3:11:53', 'throughput': '1223.93', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:25:32,200 >> {'loss': 0.6072, 'grad_norm': 5.937360763549805, 'learning_rate': 9.35819325623086e-06, 'epoch': 0.2141280353200883, 'num_input_tokens_seen': 12713976, 'completed': '21.41% (194 / 906)', 'remaining time': '3:11:25', 'throughput': '1255.94', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:25:45,429 >> {'loss': 0.4112, 'grad_norm': 5.085829734802246, 'learning_rate': 9.34970676465861e-06, 'epoch': 0.2152317880794702, 'num_input_tokens_seen': 12779512, 'completed': '21.52% (195 / 906)', 'remaining time': '3:10:58', 'throughput': '1238.51', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:25:58,678 >> {'loss': 0.3946, 'grad_norm': 4.746096611022949, 'learning_rate': 9.34116890066806e-06, 'epoch': 0.2163355408388521, 'num_input_tokens_seen': 12845048, 'completed': '21.63% (196 / 906)', 'remaining time': '3:10:32', 'throughput': '1236.60', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:26:11,895 >> {'loss': 0.425, 'grad_norm': 4.703492641448975, 'learning_rate': 9.332579778192749e-06, 'epoch': 0.217439293598234, 'num_input_tokens_seen': 12910584, 'completed': '21.74% (197 / 906)', 'remaining time': '3:10:05', 'throughput': '1239.67', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:26:24,987 >> {'loss': 0.6691, 'grad_norm': 5.84251594543457, 'learning_rate': 9.323939511850237e-06, 'epoch': 0.2185430463576159, 'num_input_tokens_seen': 12976120, 'completed': '21.85% (198 / 906)', 'remaining time': '3:09:39', 'throughput': '1251.38', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:26:38,229 >> {'loss': 0.3155, 'grad_norm': 3.8289952278137207, 'learning_rate': 9.31524821694057e-06, 'epoch': 0.2196467991169978, 'num_input_tokens_seen': 13041656, 'completed': '21.96% (199 / 906)', 'remaining time': '3:09:13', 'throughput': '1237.30', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:26:51,476 >> {'loss': 0.2806, 'grad_norm': 3.90834379196167, 'learning_rate': 9.30650600944475e-06, 'epoch': 0.22075055187637968, 'num_input_tokens_seen': 13107192, 'completed': '22.08% (200 / 906)', 'remaining time': '3:08:47', 'throughput': '1236.81', 'gpu_mem_free': '30131MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-04 21:27:17,351 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-200
[INFO|configuration_utils.py:472] 2025-01-04 21:27:17,354 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-200/config.json
[INFO|configuration_utils.py:807] 2025-01-04 21:27:17,355 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-200/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-04 21:28:14,185 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-200/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-04 21:28:14,188 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-200/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-04 21:28:14,189 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-200/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[WARNING|trainer.py:869] 2025-01-04 21:32:05,648 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 800, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-04 21:32:20,034 >> {'loss': 0.3854, 'grad_norm': 4.902582168579102, 'learning_rate': 9.297713006023183e-06, 'epoch': 0.22185430463576158, 'num_input_tokens_seen': 13172728, 'completed': '22.19% (201 / 906)', 'remaining time': '3:26:47', 'throughput': '49.87', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:32:33,811 >> {'loss': 0.558, 'grad_norm': 6.226442813873291, 'learning_rate': 9.28886932401411e-06, 'epoch': 0.2229580573951435, 'num_input_tokens_seen': 13238264, 'completed': '22.30% (202 / 906)', 'remaining time': '3:26:16', 'throughput': '1189.23', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:32:47,623 >> {'loss': 0.4484, 'grad_norm': 5.094548225402832, 'learning_rate': 9.279975081432063e-06, 'epoch': 0.2240618101545254, 'num_input_tokens_seen': 13303800, 'completed': '22.41% (203 / 906)', 'remaining time': '3:25:45', 'throughput': '1186.23', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:33:01,324 >> {'loss': 0.4513, 'grad_norm': 4.96065616607666, 'learning_rate': 9.27103039696628e-06, 'epoch': 0.2251655629139073, 'num_input_tokens_seen': 13369336, 'completed': '22.52% (204 / 906)', 'remaining time': '3:25:14', 'throughput': '1195.82', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:33:15,041 >> {'loss': 0.2415, 'grad_norm': 3.6584391593933105, 'learning_rate': 9.262035389979113e-06, 'epoch': 0.22626931567328917, 'num_input_tokens_seen': 13434872, 'completed': '22.63% (205 / 906)', 'remaining time': '3:24:44', 'throughput': '1194.39', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:33:28,638 >> {'loss': 0.5035, 'grad_norm': 5.138461112976074, 'learning_rate': 9.252990180504451e-06, 'epoch': 0.22737306843267108, 'num_input_tokens_seen': 13500408, 'completed': '22.74% (206 / 906)', 'remaining time': '3:24:13', 'throughput': '1204.98', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:33:42,307 >> {'loss': 0.4478, 'grad_norm': 4.905426025390625, 'learning_rate': 9.243894889246106e-06, 'epoch': 0.22847682119205298, 'num_input_tokens_seen': 13565944, 'completed': '22.85% (207 / 906)', 'remaining time': '3:23:42', 'throughput': '1198.68', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:33:56,130 >> {'loss': 0.4719, 'grad_norm': 4.798696517944336, 'learning_rate': 9.234749637576206e-06, 'epoch': 0.22958057395143489, 'num_input_tokens_seen': 13631480, 'completed': '22.96% (208 / 906)', 'remaining time': '3:23:13', 'throughput': '1185.22', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:34:09,856 >> {'loss': 0.6204, 'grad_norm': 5.787559986114502, 'learning_rate': 9.22555454753358e-06, 'epoch': 0.2306843267108168, 'num_input_tokens_seen': 13697016, 'completed': '23.07% (209 / 906)', 'remaining time': '3:22:43', 'throughput': '1193.64', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:34:23,587 >> {'loss': 0.641, 'grad_norm': 6.1379265785217285, 'learning_rate': 9.216309741822119e-06, 'epoch': 0.23178807947019867, 'num_input_tokens_seen': 13762552, 'completed': '23.18% (210 / 906)', 'remaining time': '3:22:13', 'throughput': '1193.27', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:34:37,397 >> {'loss': 0.6088, 'grad_norm': 6.323113441467285, 'learning_rate': 9.20701534380915e-06, 'epoch': 0.23289183222958057, 'num_input_tokens_seen': 13828088, 'completed': '23.29% (211 / 906)', 'remaining time': '3:21:43', 'throughput': '1186.36', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:34:51,120 >> {'loss': 0.6663, 'grad_norm': 8.215484619140625, 'learning_rate': 9.197671477523785e-06, 'epoch': 0.23399558498896247, 'num_input_tokens_seen': 13893624, 'completed': '23.40% (212 / 906)', 'remaining time': '3:21:14', 'throughput': '1193.92', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:35:04,940 >> {'loss': 0.6261, 'grad_norm': 5.999568462371826, 'learning_rate': 9.188278267655255e-06, 'epoch': 0.23509933774834438, 'num_input_tokens_seen': 13959160, 'completed': '23.51% (213 / 906)', 'remaining time': '3:20:45', 'throughput': '1185.52', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:35:18,634 >> {'loss': 0.7274, 'grad_norm': 6.217502593994141, 'learning_rate': 9.178835839551273e-06, 'epoch': 0.23620309050772628, 'num_input_tokens_seen': 14024696, 'completed': '23.62% (214 / 906)', 'remaining time': '3:20:16', 'throughput': '1196.44', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:35:32,392 >> {'loss': 0.4803, 'grad_norm': 5.5007405281066895, 'learning_rate': 9.169344319216334e-06, 'epoch': 0.23730684326710816, 'num_input_tokens_seen': 14090232, 'completed': '23.73% (215 / 906)', 'remaining time': '3:19:47', 'throughput': '1190.86', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:35:46,164 >> {'loss': 0.432, 'grad_norm': 4.78103494644165, 'learning_rate': 9.159803833310046e-06, 'epoch': 0.23841059602649006, 'num_input_tokens_seen': 14155768, 'completed': '23.84% (216 / 906)', 'remaining time': '3:19:18', 'throughput': '1189.63', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:35:59,926 >> {'loss': 0.391, 'grad_norm': 4.399326801300049, 'learning_rate': 9.150214509145439e-06, 'epoch': 0.23951434878587197, 'num_input_tokens_seen': 14221304, 'completed': '23.95% (217 / 906)', 'remaining time': '3:18:49', 'throughput': '1190.55', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:36:13,618 >> {'loss': 0.5047, 'grad_norm': 5.642846584320068, 'learning_rate': 9.140576474687263e-06, 'epoch': 0.24061810154525387, 'num_input_tokens_seen': 14286840, 'completed': '24.06% (218 / 906)', 'remaining time': '3:18:20', 'throughput': '1196.59', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:36:27,273 >> {'loss': 0.6133, 'grad_norm': 5.760812282562256, 'learning_rate': 9.13088985855029e-06, 'epoch': 0.24172185430463577, 'num_input_tokens_seen': 14352376, 'completed': '24.17% (219 / 906)', 'remaining time': '3:17:52', 'throughput': '1199.91', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:36:40,899 >> {'loss': 0.4889, 'grad_norm': 5.308858394622803, 'learning_rate': 9.121154789997583e-06, 'epoch': 0.24282560706401765, 'num_input_tokens_seen': 14417912, 'completed': '24.28% (220 / 906)', 'remaining time': '3:17:23', 'throughput': '1202.36', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:36:54,624 >> {'loss': 0.4994, 'grad_norm': 5.612556457519531, 'learning_rate': 9.11137139893878e-06, 'epoch': 0.24392935982339956, 'num_input_tokens_seen': 14483448, 'completed': '24.39% (221 / 906)', 'remaining time': '3:16:55', 'throughput': '1193.77', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:37:08,189 >> {'loss': 0.4871, 'grad_norm': 4.570278167724609, 'learning_rate': 9.101539815928358e-06, 'epoch': 0.24503311258278146, 'num_input_tokens_seen': 14548984, 'completed': '24.50% (222 / 906)', 'remaining time': '3:16:26', 'throughput': '1207.80', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:37:21,765 >> {'loss': 0.8181, 'grad_norm': 6.837000846862793, 'learning_rate': 9.091660172163894e-06, 'epoch': 0.24613686534216336, 'num_input_tokens_seen': 14614520, 'completed': '24.61% (223 / 906)', 'remaining time': '3:15:58', 'throughput': '1206.86', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:37:35,343 >> {'loss': 0.8306, 'grad_norm': 7.090153217315674, 'learning_rate': 9.08173259948431e-06, 'epoch': 0.24724061810154527, 'num_input_tokens_seen': 14680056, 'completed': '24.72% (224 / 906)', 'remaining time': '3:15:29', 'throughput': '1206.58', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:37:48,963 >> {'loss': 0.4458, 'grad_norm': 4.480409622192383, 'learning_rate': 9.071757230368117e-06, 'epoch': 0.24834437086092714, 'num_input_tokens_seen': 14745592, 'completed': '24.83% (225 / 906)', 'remaining time': '3:15:01', 'throughput': '1202.97', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:38:02,735 >> {'loss': 0.5, 'grad_norm': 5.686323642730713, 'learning_rate': 9.061734197931645e-06, 'epoch': 0.24944812362030905, 'num_input_tokens_seen': 14811128, 'completed': '24.94% (226 / 906)', 'remaining time': '3:14:34', 'throughput': '1189.63', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:38:16,377 >> {'loss': 0.5392, 'grad_norm': 5.53110933303833, 'learning_rate': 9.051663635927265e-06, 'epoch': 0.25055187637969095, 'num_input_tokens_seen': 14876664, 'completed': '25.06% (227 / 906)', 'remaining time': '3:14:06', 'throughput': '1201.02', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:38:30,008 >> {'loss': 0.3774, 'grad_norm': 4.834395408630371, 'learning_rate': 9.04154567874161e-06, 'epoch': 0.25165562913907286, 'num_input_tokens_seen': 14942200, 'completed': '25.17% (228 / 906)', 'remaining time': '3:13:39', 'throughput': '1201.96', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:38:43,632 >> {'loss': 0.5263, 'grad_norm': 5.39569091796875, 'learning_rate': 9.031380461393774e-06, 'epoch': 0.25275938189845476, 'num_input_tokens_seen': 15007736, 'completed': '25.28% (229 / 906)', 'remaining time': '3:13:11', 'throughput': '1202.59', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:38:57,343 >> {'loss': 0.3951, 'grad_norm': 4.755312919616699, 'learning_rate': 9.021168119533522e-06, 'epoch': 0.25386313465783666, 'num_input_tokens_seen': 15073272, 'completed': '25.39% (230 / 906)', 'remaining time': '3:12:44', 'throughput': '1194.96', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:39:10,997 >> {'loss': 0.6618, 'grad_norm': 5.514160633087158, 'learning_rate': 9.010908789439463e-06, 'epoch': 0.25496688741721857, 'num_input_tokens_seen': 15138808, 'completed': '25.50% (231 / 906)', 'remaining time': '3:12:17', 'throughput': '1199.98', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:39:24,623 >> {'loss': 0.4315, 'grad_norm': 4.2921552658081055, 'learning_rate': 9.000602608017243e-06, 'epoch': 0.2560706401766004, 'num_input_tokens_seen': 15204344, 'completed': '25.61% (232 / 906)', 'remaining time': '3:11:50', 'throughput': '1202.35', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:39:38,235 >> {'loss': 0.484, 'grad_norm': 5.0842108726501465, 'learning_rate': 8.99024971279772e-06, 'epoch': 0.2571743929359823, 'num_input_tokens_seen': 15269880, 'completed': '25.72% (233 / 906)', 'remaining time': '3:11:23', 'throughput': '1203.65', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:39:51,801 >> {'loss': 0.7677, 'grad_norm': 7.238823890686035, 'learning_rate': 8.979850241935122e-06, 'epoch': 0.2582781456953642, 'num_input_tokens_seen': 15335416, 'completed': '25.83% (234 / 906)', 'remaining time': '3:10:55', 'throughput': '1207.70', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:40:05,569 >> {'loss': 0.3926, 'grad_norm': 4.493621349334717, 'learning_rate': 8.969404334205203e-06, 'epoch': 0.25938189845474613, 'num_input_tokens_seen': 15400952, 'completed': '25.94% (235 / 906)', 'remaining time': '3:10:29', 'throughput': '1190.02', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:40:19,317 >> {'loss': 0.5586, 'grad_norm': 5.966931343078613, 'learning_rate': 8.958912129003395e-06, 'epoch': 0.26048565121412803, 'num_input_tokens_seen': 15466488, 'completed': '26.05% (236 / 906)', 'remaining time': '3:10:03', 'throughput': '1191.78', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:40:33,014 >> {'loss': 0.7247, 'grad_norm': 6.150123596191406, 'learning_rate': 8.948373766342952e-06, 'epoch': 0.26158940397350994, 'num_input_tokens_seen': 15532024, 'completed': '26.16% (237 / 906)', 'remaining time': '3:09:36', 'throughput': '1196.14', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:40:46,954 >> {'loss': 0.452, 'grad_norm': 5.65477180480957, 'learning_rate': 8.937789386853067e-06, 'epoch': 0.26269315673289184, 'num_input_tokens_seen': 15597560, 'completed': '26.27% (238 / 906)', 'remaining time': '3:09:11', 'throughput': '1175.32', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:41:00,685 >> {'loss': 0.6872, 'grad_norm': 6.099196910858154, 'learning_rate': 8.927159131777013e-06, 'epoch': 0.26379690949227375, 'num_input_tokens_seen': 15663096, 'completed': '26.38% (239 / 906)', 'remaining time': '3:08:45', 'throughput': '1193.23', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:41:14,336 >> {'loss': 0.6076, 'grad_norm': 5.515918254852295, 'learning_rate': 8.916483142970244e-06, 'epoch': 0.26490066225165565, 'num_input_tokens_seen': 15728632, 'completed': '26.49% (240 / 906)', 'remaining time': '3:08:18', 'throughput': '1200.25', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:41:28,017 >> {'loss': 0.5676, 'grad_norm': 5.490879058837891, 'learning_rate': 8.905761562898514e-06, 'epoch': 0.26600441501103755, 'num_input_tokens_seen': 15794168, 'completed': '26.60% (241 / 906)', 'remaining time': '3:07:52', 'throughput': '1197.52', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:41:41,658 >> {'loss': 0.3909, 'grad_norm': 4.504538536071777, 'learning_rate': 8.894994534635962e-06, 'epoch': 0.2671081677704194, 'num_input_tokens_seen': 15859704, 'completed': '26.71% (242 / 906)', 'remaining time': '3:07:26', 'throughput': '1201.13', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:41:55,139 >> {'loss': 0.6022, 'grad_norm': 5.765637397766113, 'learning_rate': 8.884182201863218e-06, 'epoch': 0.2682119205298013, 'num_input_tokens_seen': 15925240, 'completed': '26.82% (243 / 906)', 'remaining time': '3:07:00', 'throughput': '1215.35', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:42:08,652 >> {'loss': 0.3683, 'grad_norm': 7.247265338897705, 'learning_rate': 8.873324708865473e-06, 'epoch': 0.2693156732891832, 'num_input_tokens_seen': 15990776, 'completed': '26.93% (244 / 906)', 'remaining time': '3:06:34', 'throughput': '1212.38', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:42:22,139 >> {'loss': 0.6992, 'grad_norm': 6.174973964691162, 'learning_rate': 8.862422200530561e-06, 'epoch': 0.2704194260485651, 'num_input_tokens_seen': 16056312, 'completed': '27.04% (245 / 906)', 'remaining time': '3:06:08', 'throughput': '1214.86', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:42:35,570 >> {'loss': 0.745, 'grad_norm': 5.999027729034424, 'learning_rate': 8.85147482234702e-06, 'epoch': 0.271523178807947, 'num_input_tokens_seen': 16121848, 'completed': '27.15% (246 / 906)', 'remaining time': '3:05:41', 'throughput': '1219.83', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:42:49,043 >> {'loss': 0.4242, 'grad_norm': 4.824187278747559, 'learning_rate': 8.840482720402159e-06, 'epoch': 0.2726269315673289, 'num_input_tokens_seen': 16187384, 'completed': '27.26% (247 / 906)', 'remaining time': '3:05:15', 'throughput': '1216.11', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:43:02,600 >> {'loss': 0.5956, 'grad_norm': 6.063555717468262, 'learning_rate': 8.829446041380099e-06, 'epoch': 0.2737306843267108, 'num_input_tokens_seen': 16252920, 'completed': '27.37% (248 / 906)', 'remaining time': '3:04:50', 'throughput': '1208.52', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:43:16,067 >> {'loss': 0.4649, 'grad_norm': 5.309985160827637, 'learning_rate': 8.818364932559822e-06, 'epoch': 0.27483443708609273, 'num_input_tokens_seen': 16318456, 'completed': '27.48% (249 / 906)', 'remaining time': '3:04:24', 'throughput': '1216.56', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:43:29,576 >> {'loss': 0.4554, 'grad_norm': 4.762182712554932, 'learning_rate': 8.807239541813204e-06, 'epoch': 0.27593818984547464, 'num_input_tokens_seen': 16383992, 'completed': '27.59% (250 / 906)', 'remaining time': '3:03:58', 'throughput': '1212.82', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:43:43,027 >> {'loss': 0.4796, 'grad_norm': 5.908174514770508, 'learning_rate': 8.796070017603037e-06, 'epoch': 0.27704194260485654, 'num_input_tokens_seen': 16449528, 'completed': '27.70% (251 / 906)', 'remaining time': '3:03:33', 'throughput': '1218.04', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:43:56,662 >> {'loss': 0.4437, 'grad_norm': 5.626619338989258, 'learning_rate': 8.784856508981062e-06, 'epoch': 0.2781456953642384, 'num_input_tokens_seen': 16515064, 'completed': '27.81% (252 / 906)', 'remaining time': '3:03:08', 'throughput': '1201.69', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:44:10,152 >> {'loss': 0.3765, 'grad_norm': 5.213916301727295, 'learning_rate': 8.773599165585957e-06, 'epoch': 0.2792494481236203, 'num_input_tokens_seen': 16580600, 'completed': '27.92% (253 / 906)', 'remaining time': '3:02:42', 'throughput': '1214.50', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:44:23,739 >> {'loss': 0.2522, 'grad_norm': 3.7691640853881836, 'learning_rate': 8.762298137641363e-06, 'epoch': 0.2803532008830022, 'num_input_tokens_seen': 16646136, 'completed': '28.04% (254 / 906)', 'remaining time': '3:02:17', 'throughput': '1205.83', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:44:37,210 >> {'loss': 0.6886, 'grad_norm': 7.3836236000061035, 'learning_rate': 8.750953575953862e-06, 'epoch': 0.2814569536423841, 'num_input_tokens_seen': 16711672, 'completed': '28.15% (255 / 906)', 'remaining time': '3:01:52', 'throughput': '1216.27', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:44:50,660 >> {'loss': 0.729, 'grad_norm': 6.279829978942871, 'learning_rate': 8.739565631910983e-06, 'epoch': 0.282560706401766, 'num_input_tokens_seen': 16777208, 'completed': '28.26% (256 / 906)', 'remaining time': '3:01:27', 'throughput': '1218.12', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:45:04,105 >> {'loss': 0.5763, 'grad_norm': 6.2540507316589355, 'learning_rate': 8.728134457479158e-06, 'epoch': 0.2836644591611479, 'num_input_tokens_seen': 16842744, 'completed': '28.37% (257 / 906)', 'remaining time': '3:01:02', 'throughput': '1218.64', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:45:17,635 >> {'loss': 0.3749, 'grad_norm': 5.088672161102295, 'learning_rate': 8.716660205201715e-06, 'epoch': 0.2847682119205298, 'num_input_tokens_seen': 16908280, 'completed': '28.48% (258 / 906)', 'remaining time': '3:00:37', 'throughput': '1210.88', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:45:31,024 >> {'loss': 0.5395, 'grad_norm': 5.244248390197754, 'learning_rate': 8.705143028196834e-06, 'epoch': 0.2858719646799117, 'num_input_tokens_seen': 16973816, 'completed': '28.59% (259 / 906)', 'remaining time': '3:00:12', 'throughput': '1223.76', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:45:44,581 >> {'loss': 0.4133, 'grad_norm': 5.609091758728027, 'learning_rate': 8.693583080155501e-06, 'epoch': 0.2869757174392936, 'num_input_tokens_seen': 17039352, 'completed': '28.70% (260 / 906)', 'remaining time': '2:59:47', 'throughput': '1208.50', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:45:58,149 >> {'loss': 0.3799, 'grad_norm': 4.83781099319458, 'learning_rate': 8.681980515339464e-06, 'epoch': 0.28807947019867547, 'num_input_tokens_seen': 17104888, 'completed': '28.81% (261 / 906)', 'remaining time': '2:59:23', 'throughput': '1207.49', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:46:11,758 >> {'loss': 0.4834, 'grad_norm': 5.5107340812683105, 'learning_rate': 8.670335488579166e-06, 'epoch': 0.2891832229580574, 'num_input_tokens_seen': 17170424, 'completed': '28.92% (262 / 906)', 'remaining time': '2:58:59', 'throughput': '1203.95', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:46:25,188 >> {'loss': 0.4059, 'grad_norm': 5.0106201171875, 'learning_rate': 8.658648155271688e-06, 'epoch': 0.2902869757174393, 'num_input_tokens_seen': 17235960, 'completed': '29.03% (263 / 906)', 'remaining time': '2:58:34', 'throughput': '1219.91', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:46:38,619 >> {'loss': 0.4187, 'grad_norm': 4.87216329574585, 'learning_rate': 8.646918671378666e-06, 'epoch': 0.2913907284768212, 'num_input_tokens_seen': 17301496, 'completed': '29.14% (264 / 906)', 'remaining time': '2:58:10', 'throughput': '1219.86', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:46:51,891 >> {'loss': 0.4873, 'grad_norm': 5.391574859619141, 'learning_rate': 8.635147193424219e-06, 'epoch': 0.2924944812362031, 'num_input_tokens_seen': 17367032, 'completed': '29.25% (265 / 906)', 'remaining time': '2:57:45', 'throughput': '1234.53', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:47:05,358 >> {'loss': 0.4802, 'grad_norm': 5.076900959014893, 'learning_rate': 8.623333878492853e-06, 'epoch': 0.293598233995585, 'num_input_tokens_seen': 17432568, 'completed': '29.36% (266 / 906)', 'remaining time': '2:57:21', 'throughput': '1216.61', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:47:18,739 >> {'loss': 0.5354, 'grad_norm': 5.86137580871582, 'learning_rate': 8.61147888422737e-06, 'epoch': 0.2947019867549669, 'num_input_tokens_seen': 17498104, 'completed': '29.47% (267 / 906)', 'remaining time': '2:56:56', 'throughput': '1224.42', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:47:32,227 >> {'loss': 0.4128, 'grad_norm': 4.704822540283203, 'learning_rate': 8.59958236882676e-06, 'epoch': 0.2958057395143488, 'num_input_tokens_seen': 17563640, 'completed': '29.58% (268 / 906)', 'remaining time': '2:56:32', 'throughput': '1214.70', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:47:45,728 >> {'loss': 0.3959, 'grad_norm': 5.304955005645752, 'learning_rate': 8.587644491044094e-06, 'epoch': 0.2969094922737307, 'num_input_tokens_seen': 17629176, 'completed': '29.69% (269 / 906)', 'remaining time': '2:56:08', 'throughput': '1213.54', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:47:59,330 >> {'loss': 0.4585, 'grad_norm': 4.6488518714904785, 'learning_rate': 8.575665410184398e-06, 'epoch': 0.2980132450331126, 'num_input_tokens_seen': 17694712, 'completed': '29.80% (270 / 906)', 'remaining time': '2:55:44', 'throughput': '1204.48', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:48:12,787 >> {'loss': 0.6202, 'grad_norm': 5.93010139465332, 'learning_rate': 8.563645286102539e-06, 'epoch': 0.29911699779249445, 'num_input_tokens_seen': 17760248, 'completed': '29.91% (271 / 906)', 'remaining time': '2:55:21', 'throughput': '1217.57', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:48:26,255 >> {'loss': 0.499, 'grad_norm': 5.770275115966797, 'learning_rate': 8.551584279201085e-06, 'epoch': 0.30022075055187636, 'num_input_tokens_seen': 17825784, 'completed': '30.02% (272 / 906)', 'remaining time': '2:54:57', 'throughput': '1216.50', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:48:39,744 >> {'loss': 0.7024, 'grad_norm': 6.36545991897583, 'learning_rate': 8.539482550428158e-06, 'epoch': 0.30132450331125826, 'num_input_tokens_seen': 17891320, 'completed': '30.13% (273 / 906)', 'remaining time': '2:54:33', 'throughput': '1214.61', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:48:53,188 >> {'loss': 0.6154, 'grad_norm': 5.636227130889893, 'learning_rate': 8.527340261275302e-06, 'epoch': 0.30242825607064017, 'num_input_tokens_seen': 17956856, 'completed': '30.24% (274 / 906)', 'remaining time': '2:54:09', 'throughput': '1218.70', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:49:06,675 >> {'loss': 0.3441, 'grad_norm': 4.823254585266113, 'learning_rate': 8.515157573775309e-06, 'epoch': 0.30353200883002207, 'num_input_tokens_seen': 18022392, 'completed': '30.35% (275 / 906)', 'remaining time': '2:53:46', 'throughput': '1214.78', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:49:20,056 >> {'loss': 0.5015, 'grad_norm': 5.449868679046631, 'learning_rate': 8.50293465050008e-06, 'epoch': 0.304635761589404, 'num_input_tokens_seen': 18087928, 'completed': '30.46% (276 / 906)', 'remaining time': '2:53:22', 'throughput': '1224.46', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:49:33,570 >> {'loss': 0.3685, 'grad_norm': 4.312854766845703, 'learning_rate': 8.490671654558427e-06, 'epoch': 0.3057395143487859, 'num_input_tokens_seen': 18153464, 'completed': '30.57% (277 / 906)', 'remaining time': '2:52:59', 'throughput': '1212.37', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:49:47,036 >> {'loss': 0.6666, 'grad_norm': 6.949029445648193, 'learning_rate': 8.478368749593925e-06, 'epoch': 0.3068432671081678, 'num_input_tokens_seen': 18219000, 'completed': '30.68% (278 / 906)', 'remaining time': '2:52:36', 'throughput': '1216.61', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:50:00,418 >> {'loss': 0.8144, 'grad_norm': 6.440571308135986, 'learning_rate': 8.466026099782708e-06, 'epoch': 0.3079470198675497, 'num_input_tokens_seen': 18284536, 'completed': '30.79% (279 / 906)', 'remaining time': '2:52:12', 'throughput': '1224.41', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:50:13,956 >> {'loss': 0.4319, 'grad_norm': 4.756875038146973, 'learning_rate': 8.453643869831289e-06, 'epoch': 0.3090507726269316, 'num_input_tokens_seen': 18350072, 'completed': '30.91% (280 / 906)', 'remaining time': '2:51:49', 'throughput': '1210.19', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:50:27,464 >> {'loss': 0.7307, 'grad_norm': 7.117915153503418, 'learning_rate': 8.441222224974353e-06, 'epoch': 0.31015452538631344, 'num_input_tokens_seen': 18415608, 'completed': '31.02% (281 / 906)', 'remaining time': '2:51:26', 'throughput': '1212.98', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:50:41,194 >> {'loss': 0.6961, 'grad_norm': 6.1996684074401855, 'learning_rate': 8.428761330972562e-06, 'epoch': 0.31125827814569534, 'num_input_tokens_seen': 18481144, 'completed': '31.13% (282 / 906)', 'remaining time': '2:51:03', 'throughput': '1193.25', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:50:54,603 >> {'loss': 0.4869, 'grad_norm': 5.32656192779541, 'learning_rate': 8.416261354110334e-06, 'epoch': 0.31236203090507725, 'num_input_tokens_seen': 18546680, 'completed': '31.24% (283 / 906)', 'remaining time': '2:50:40', 'throughput': '1221.89', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:51:07,982 >> {'loss': 0.5949, 'grad_norm': 5.852849960327148, 'learning_rate': 8.403722461193635e-06, 'epoch': 0.31346578366445915, 'num_input_tokens_seen': 18612216, 'completed': '31.35% (284 / 906)', 'remaining time': '2:50:17', 'throughput': '1224.60', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:51:21,371 >> {'loss': 0.4291, 'grad_norm': 4.754979133605957, 'learning_rate': 8.391144819547742e-06, 'epoch': 0.31456953642384106, 'num_input_tokens_seen': 18677752, 'completed': '31.46% (285 / 906)', 'remaining time': '2:49:54', 'throughput': '1223.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:51:34,820 >> {'loss': 0.542, 'grad_norm': 6.007636547088623, 'learning_rate': 8.378528597015011e-06, 'epoch': 0.31567328918322296, 'num_input_tokens_seen': 18743288, 'completed': '31.57% (286 / 906)', 'remaining time': '2:49:31', 'throughput': '1218.26', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:51:48,177 >> {'loss': 0.7025, 'grad_norm': 6.076221942901611, 'learning_rate': 8.365873961952648e-06, 'epoch': 0.31677704194260486, 'num_input_tokens_seen': 18808824, 'completed': '31.68% (287 / 906)', 'remaining time': '2:49:08', 'throughput': '1226.60', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:52:01,608 >> {'loss': 0.4121, 'grad_norm': 4.541327476501465, 'learning_rate': 8.35318108323045e-06, 'epoch': 0.31788079470198677, 'num_input_tokens_seen': 18874360, 'completed': '31.79% (288 / 906)', 'remaining time': '2:48:46', 'throughput': '1219.89', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:52:15,085 >> {'loss': 0.3526, 'grad_norm': 4.576253414154053, 'learning_rate': 8.340450130228558e-06, 'epoch': 0.3189845474613687, 'num_input_tokens_seen': 18939896, 'completed': '31.90% (289 / 906)', 'remaining time': '2:48:23', 'throughput': '1215.67', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:52:28,569 >> {'loss': 0.5399, 'grad_norm': 5.6783976554870605, 'learning_rate': 8.327681272835197e-06, 'epoch': 0.3200883002207506, 'num_input_tokens_seen': 19005432, 'completed': '32.01% (290 / 906)', 'remaining time': '2:48:00', 'throughput': '1215.09', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:52:42,060 >> {'loss': 0.4894, 'grad_norm': 5.336645603179932, 'learning_rate': 8.314874681444404e-06, 'epoch': 0.3211920529801324, 'num_input_tokens_seen': 19070968, 'completed': '32.12% (291 / 906)', 'remaining time': '2:47:38', 'throughput': '1214.47', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:52:55,681 >> {'loss': 0.4226, 'grad_norm': 5.151770114898682, 'learning_rate': 8.30203052695376e-06, 'epoch': 0.32229580573951433, 'num_input_tokens_seen': 19136504, 'completed': '32.23% (292 / 906)', 'remaining time': '2:47:16', 'throughput': '1202.85', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 21:53:09,268 >> {'loss': 0.5135, 'grad_norm': 5.253474712371826, 'learning_rate': 8.289148980762105e-06, 'epoch': 0.32339955849889623, 'num_input_tokens_seen': 19202040, 'completed': '32.34% (293 / 906)', 'remaining time': '2:46:54', 'throughput': '1205.81', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:53:22,711 >> {'loss': 0.6277, 'grad_norm': 5.810973644256592, 'learning_rate': 8.276230214767254e-06, 'epoch': 0.32450331125827814, 'num_input_tokens_seen': 19267576, 'completed': '32.45% (294 / 906)', 'remaining time': '2:46:31', 'throughput': '1218.81', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:53:36,094 >> {'loss': 0.6719, 'grad_norm': 5.979196548461914, 'learning_rate': 8.263274401363704e-06, 'epoch': 0.32560706401766004, 'num_input_tokens_seen': 19333112, 'completed': '32.56% (295 / 906)', 'remaining time': '2:46:09', 'throughput': '1224.17', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:53:49,487 >> {'loss': 0.4234, 'grad_norm': 4.947514533996582, 'learning_rate': 8.250281713440323e-06, 'epoch': 0.32671081677704195, 'num_input_tokens_seen': 19398648, 'completed': '32.67% (296 / 906)', 'remaining time': '2:45:47', 'throughput': '1223.36', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:54:02,979 >> {'loss': 0.335, 'grad_norm': 4.194842338562012, 'learning_rate': 8.237252324378059e-06, 'epoch': 0.32781456953642385, 'num_input_tokens_seen': 19464184, 'completed': '32.78% (297 / 906)', 'remaining time': '2:45:25', 'throughput': '1214.37', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:54:16,448 >> {'loss': 0.3597, 'grad_norm': 4.941613674163818, 'learning_rate': 8.224186408047616e-06, 'epoch': 0.32891832229580575, 'num_input_tokens_seen': 19529720, 'completed': '32.89% (298 / 906)', 'remaining time': '2:45:02', 'throughput': '1216.43', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:54:29,756 >> {'loss': 0.6489, 'grad_norm': 5.901616096496582, 'learning_rate': 8.211084138807138e-06, 'epoch': 0.33002207505518766, 'num_input_tokens_seen': 19595256, 'completed': '33.00% (299 / 906)', 'remaining time': '2:44:40', 'throughput': '1231.11', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 21:54:43,240 >> {'loss': 0.4378, 'grad_norm': 4.655982971191406, 'learning_rate': 8.197945691499876e-06, 'epoch': 0.33112582781456956, 'num_input_tokens_seen': 19660792, 'completed': '33.11% (300 / 906)', 'remaining time': '2:44:18', 'throughput': '1215.10', 'gpu_mem_free': '30139MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-04 21:55:08,828 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-300
[INFO|configuration_utils.py:472] 2025-01-04 21:55:08,830 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-300/config.json
[INFO|configuration_utils.py:807] 2025-01-04 21:55:08,831 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-300/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-04 21:56:05,517 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-300/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-04 21:56:05,520 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-300/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-04 21:56:05,521 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-300/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[WARNING|trainer.py:869] 2025-01-04 21:59:49,558 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 1200, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-04 22:00:03,712 >> {'loss': 0.4636, 'grad_norm': 4.98732852935791, 'learning_rate': 8.184771241451862e-06, 'epoch': 0.3322295805739514, 'num_input_tokens_seen': 19726328, 'completed': '33.22% (301 / 906)', 'remaining time': '2:54:13', 'throughput': '51.12', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:00:17,453 >> {'loss': 0.6995, 'grad_norm': 5.869397163391113, 'learning_rate': 8.17156096446957e-06, 'epoch': 0.3333333333333333, 'num_input_tokens_seen': 19791864, 'completed': '33.33% (302 / 906)', 'remaining time': '2:53:49', 'throughput': '1192.37', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:00:31,105 >> {'loss': 0.4852, 'grad_norm': 4.8156418800354, 'learning_rate': 8.158315036837557e-06, 'epoch': 0.3344370860927152, 'num_input_tokens_seen': 19857400, 'completed': '33.44% (303 / 906)', 'remaining time': '2:53:25', 'throughput': '1200.11', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:00:44,629 >> {'loss': 0.9684, 'grad_norm': 7.32762336730957, 'learning_rate': 8.14503363531613e-06, 'epoch': 0.3355408388520971, 'num_input_tokens_seen': 19922936, 'completed': '33.55% (304 / 906)', 'remaining time': '2:53:00', 'throughput': '1211.50', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:00:58,300 >> {'loss': 0.6892, 'grad_norm': 5.988272190093994, 'learning_rate': 8.131716937138973e-06, 'epoch': 0.336644591611479, 'num_input_tokens_seen': 19988472, 'completed': '33.66% (305 / 906)', 'remaining time': '2:52:36', 'throughput': '1198.43', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:01:12,080 >> {'loss': 0.4995, 'grad_norm': 5.299745559692383, 'learning_rate': 8.11836512001079e-06, 'epoch': 0.33774834437086093, 'num_input_tokens_seen': 20054008, 'completed': '33.77% (306 / 906)', 'remaining time': '2:52:12', 'throughput': '1188.99', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:01:25,836 >> {'loss': 0.5387, 'grad_norm': 5.244733810424805, 'learning_rate': 8.10497836210492e-06, 'epoch': 0.33885209713024284, 'num_input_tokens_seen': 20119544, 'completed': '33.89% (307 / 906)', 'remaining time': '2:51:48', 'throughput': '1191.00', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:01:39,518 >> {'loss': 0.3757, 'grad_norm': 4.309633255004883, 'learning_rate': 8.091556842060981e-06, 'epoch': 0.33995584988962474, 'num_input_tokens_seen': 20185080, 'completed': '34.00% (308 / 906)', 'remaining time': '2:51:24', 'throughput': '1197.54', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:01:53,441 >> {'loss': 0.3871, 'grad_norm': 4.323147296905518, 'learning_rate': 8.07810073898247e-06, 'epoch': 0.34105960264900664, 'num_input_tokens_seen': 20250616, 'completed': '34.11% (309 / 906)', 'remaining time': '2:51:00', 'throughput': '1176.68', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:02:07,181 >> {'loss': 0.687, 'grad_norm': 5.932765007019043, 'learning_rate': 8.064610232434375e-06, 'epoch': 0.34216335540838855, 'num_input_tokens_seen': 20316152, 'completed': '34.22% (310 / 906)', 'remaining time': '2:50:36', 'throughput': '1192.51', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:02:20,801 >> {'loss': 0.6566, 'grad_norm': 5.740137100219727, 'learning_rate': 8.051085502440782e-06, 'epoch': 0.3432671081677704, 'num_input_tokens_seen': 20381688, 'completed': '34.33% (311 / 906)', 'remaining time': '2:50:12', 'throughput': '1202.86', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:02:34,508 >> {'loss': 0.6846, 'grad_norm': 5.602391719818115, 'learning_rate': 8.037526729482474e-06, 'epoch': 0.3443708609271523, 'num_input_tokens_seen': 20447224, 'completed': '34.44% (312 / 906)', 'remaining time': '2:49:49', 'throughput': '1195.37', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:02:48,260 >> {'loss': 0.5781, 'grad_norm': 6.104027271270752, 'learning_rate': 8.02393409449452e-06, 'epoch': 0.3454746136865342, 'num_input_tokens_seen': 20512760, 'completed': '34.55% (313 / 906)', 'remaining time': '2:49:25', 'throughput': '1191.31', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:03:01,996 >> {'loss': 0.2769, 'grad_norm': 3.9661595821380615, 'learning_rate': 8.010307778863859e-06, 'epoch': 0.3465783664459161, 'num_input_tokens_seen': 20578296, 'completed': '34.66% (314 / 906)', 'remaining time': '2:49:01', 'throughput': '1192.83', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:03:15,708 >> {'loss': 0.311, 'grad_norm': 4.286974906921387, 'learning_rate': 7.996647964426883e-06, 'epoch': 0.347682119205298, 'num_input_tokens_seen': 20643832, 'completed': '34.77% (315 / 906)', 'remaining time': '2:48:38', 'throughput': '1194.90', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:03:29,340 >> {'loss': 0.6516, 'grad_norm': 5.434468746185303, 'learning_rate': 7.982954833467007e-06, 'epoch': 0.3487858719646799, 'num_input_tokens_seen': 20709368, 'completed': '34.88% (316 / 906)', 'remaining time': '2:48:14', 'throughput': '1201.83', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:03:43,120 >> {'loss': 0.353, 'grad_norm': 4.0702223777771, 'learning_rate': 7.969228568712242e-06, 'epoch': 0.3498896247240618, 'num_input_tokens_seen': 20774904, 'completed': '34.99% (317 / 906)', 'remaining time': '2:47:51', 'throughput': '1188.96', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:03:56,774 >> {'loss': 0.891, 'grad_norm': 6.999876499176025, 'learning_rate': 7.95546935333275e-06, 'epoch': 0.3509933774834437, 'num_input_tokens_seen': 20840440, 'completed': '35.10% (318 / 906)', 'remaining time': '2:47:27', 'throughput': '1199.92', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:04:10,545 >> {'loss': 0.2951, 'grad_norm': 3.8806591033935547, 'learning_rate': 7.941677370938404e-06, 'epoch': 0.35209713024282563, 'num_input_tokens_seen': 20905976, 'completed': '35.21% (319 / 906)', 'remaining time': '2:47:04', 'throughput': '1189.78', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:04:24,269 >> {'loss': 0.7898, 'grad_norm': 6.465142726898193, 'learning_rate': 7.927852805576334e-06, 'epoch': 0.35320088300220753, 'num_input_tokens_seen': 20971512, 'completed': '35.32% (320 / 906)', 'remaining time': '2:46:41', 'throughput': '1193.83', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:04:38,058 >> {'loss': 0.4284, 'grad_norm': 5.034674167633057, 'learning_rate': 7.913995841728477e-06, 'epoch': 0.3543046357615894, 'num_input_tokens_seen': 21037048, 'completed': '35.43% (321 / 906)', 'remaining time': '2:46:18', 'throughput': '1188.19', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:04:51,571 >> {'loss': 0.7808, 'grad_norm': 6.744060039520264, 'learning_rate': 7.90010666430911e-06, 'epoch': 0.3554083885209713, 'num_input_tokens_seen': 21102584, 'completed': '35.54% (322 / 906)', 'remaining time': '2:45:55', 'throughput': '1212.46', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:05:05,357 >> {'loss': 0.2769, 'grad_norm': 4.225239276885986, 'learning_rate': 7.886185458662383e-06, 'epoch': 0.3565121412803532, 'num_input_tokens_seen': 21168120, 'completed': '35.65% (323 / 906)', 'remaining time': '2:45:32', 'throughput': '1188.46', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:05:19,214 >> {'loss': 0.3404, 'grad_norm': 4.372021675109863, 'learning_rate': 7.872232410559848e-06, 'epoch': 0.3576158940397351, 'num_input_tokens_seen': 21233656, 'completed': '35.76% (324 / 906)', 'remaining time': '2:45:09', 'throughput': '1182.39', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:05:33,071 >> {'loss': 0.4023, 'grad_norm': 5.2710185050964355, 'learning_rate': 7.85824770619798e-06, 'epoch': 0.358719646799117, 'num_input_tokens_seen': 21299192, 'completed': '35.87% (325 / 906)', 'remaining time': '2:44:46', 'throughput': '1182.35', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:05:46,942 >> {'loss': 0.2934, 'grad_norm': 4.083426475524902, 'learning_rate': 7.844231532195686e-06, 'epoch': 0.3598233995584989, 'num_input_tokens_seen': 21364728, 'completed': '35.98% (326 / 906)', 'remaining time': '2:44:24', 'throughput': '1181.15', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:06:00,647 >> {'loss': 0.4998, 'grad_norm': 5.436210632324219, 'learning_rate': 7.830184075591829e-06, 'epoch': 0.3609271523178808, 'num_input_tokens_seen': 21430264, 'completed': '36.09% (327 / 906)', 'remaining time': '2:44:01', 'throughput': '1195.44', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:06:14,277 >> {'loss': 0.5121, 'grad_norm': 5.057102203369141, 'learning_rate': 7.816105523842712e-06, 'epoch': 0.3620309050772627, 'num_input_tokens_seen': 21495800, 'completed': '36.20% (328 / 906)', 'remaining time': '2:43:38', 'throughput': '1202.12', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:06:27,924 >> {'loss': 0.5667, 'grad_norm': 6.161040306091309, 'learning_rate': 7.801996064819594e-06, 'epoch': 0.3631346578366446, 'num_input_tokens_seen': 21561336, 'completed': '36.31% (329 / 906)', 'remaining time': '2:43:15', 'throughput': '1200.53', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:06:41,611 >> {'loss': 0.414, 'grad_norm': 4.909401893615723, 'learning_rate': 7.787855886806174e-06, 'epoch': 0.36423841059602646, 'num_input_tokens_seen': 21626872, 'completed': '36.42% (330 / 906)', 'remaining time': '2:42:52', 'throughput': '1197.05', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:06:55,378 >> {'loss': 0.6352, 'grad_norm': 6.199487209320068, 'learning_rate': 7.773685178496084e-06, 'epoch': 0.36534216335540837, 'num_input_tokens_seen': 21692408, 'completed': '36.53% (331 / 906)', 'remaining time': '2:42:30', 'throughput': '1190.04', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:07:09,076 >> {'loss': 0.5171, 'grad_norm': 5.417155742645264, 'learning_rate': 7.759484128990359e-06, 'epoch': 0.36644591611479027, 'num_input_tokens_seen': 21757944, 'completed': '36.64% (332 / 906)', 'remaining time': '2:42:07', 'throughput': '1196.11', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:07:22,701 >> {'loss': 0.5954, 'grad_norm': 5.248231410980225, 'learning_rate': 7.745252927794929e-06, 'epoch': 0.3675496688741722, 'num_input_tokens_seen': 21823480, 'completed': '36.75% (333 / 906)', 'remaining time': '2:41:44', 'throughput': '1202.49', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:07:36,210 >> {'loss': 0.4274, 'grad_norm': 4.734684944152832, 'learning_rate': 7.730991764818083e-06, 'epoch': 0.3686534216335541, 'num_input_tokens_seen': 21889016, 'completed': '36.87% (334 / 906)', 'remaining time': '2:41:22', 'throughput': '1212.82', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:07:49,718 >> {'loss': 0.6043, 'grad_norm': 5.899268627166748, 'learning_rate': 7.716700830367937e-06, 'epoch': 0.369757174392936, 'num_input_tokens_seen': 21954552, 'completed': '36.98% (335 / 906)', 'remaining time': '2:40:59', 'throughput': '1212.94', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:08:03,312 >> {'loss': 0.5565, 'grad_norm': 5.965540885925293, 'learning_rate': 7.702380315149885e-06, 'epoch': 0.3708609271523179, 'num_input_tokens_seen': 22020088, 'completed': '37.09% (336 / 906)', 'remaining time': '2:40:36', 'throughput': '1205.20', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:08:16,881 >> {'loss': 0.2761, 'grad_norm': 3.7075252532958984, 'learning_rate': 7.68803041026407e-06, 'epoch': 0.3719646799116998, 'num_input_tokens_seen': 22085624, 'completed': '37.20% (337 / 906)', 'remaining time': '2:40:14', 'throughput': '1207.49', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:08:30,533 >> {'loss': 0.4702, 'grad_norm': 5.284467697143555, 'learning_rate': 7.673651307202816e-06, 'epoch': 0.3730684326710817, 'num_input_tokens_seen': 22151160, 'completed': '37.31% (338 / 906)', 'remaining time': '2:39:51', 'throughput': '1200.10', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:08:44,168 >> {'loss': 0.3558, 'grad_norm': 4.557376861572266, 'learning_rate': 7.659243197848091e-06, 'epoch': 0.3741721854304636, 'num_input_tokens_seen': 22216696, 'completed': '37.42% (339 / 906)', 'remaining time': '2:39:29', 'throughput': '1201.64', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:08:57,764 >> {'loss': 0.5721, 'grad_norm': 5.523560523986816, 'learning_rate': 7.644806274468936e-06, 'epoch': 0.37527593818984545, 'num_input_tokens_seen': 22282232, 'completed': '37.53% (340 / 906)', 'remaining time': '2:39:07', 'throughput': '1205.41', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:09:11,422 >> {'loss': 0.6207, 'grad_norm': 5.48103141784668, 'learning_rate': 7.630340729718896e-06, 'epoch': 0.37637969094922735, 'num_input_tokens_seen': 22347768, 'completed': '37.64% (341 / 906)', 'remaining time': '2:38:45', 'throughput': '1199.20', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:09:24,955 >> {'loss': 0.3645, 'grad_norm': 4.440031051635742, 'learning_rate': 7.6158467566334584e-06, 'epoch': 0.37748344370860926, 'num_input_tokens_seen': 22413304, 'completed': '37.75% (342 / 906)', 'remaining time': '2:38:22', 'throughput': '1210.72', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:09:38,453 >> {'loss': 0.4351, 'grad_norm': 4.76084566116333, 'learning_rate': 7.6013245486274685e-06, 'epoch': 0.37858719646799116, 'num_input_tokens_seen': 22478840, 'completed': '37.86% (343 / 906)', 'remaining time': '2:38:00', 'throughput': '1213.81', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:09:51,733 >> {'loss': 0.6918, 'grad_norm': 6.258664131164551, 'learning_rate': 7.58677429949256e-06, 'epoch': 0.37969094922737306, 'num_input_tokens_seen': 22544376, 'completed': '37.97% (344 / 906)', 'remaining time': '2:37:37', 'throughput': '1233.70', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:10:05,197 >> {'loss': 0.4623, 'grad_norm': 4.8109660148620605, 'learning_rate': 7.572196203394553e-06, 'epoch': 0.38079470198675497, 'num_input_tokens_seen': 22609912, 'completed': '38.08% (345 / 906)', 'remaining time': '2:37:15', 'throughput': '1216.87', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:10:18,624 >> {'loss': 0.3383, 'grad_norm': 4.017666339874268, 'learning_rate': 7.557590454870874e-06, 'epoch': 0.3818984547461369, 'num_input_tokens_seen': 22675448, 'completed': '38.19% (346 / 906)', 'remaining time': '2:36:53', 'throughput': '1220.25', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:10:32,096 >> {'loss': 0.3711, 'grad_norm': 4.424713134765625, 'learning_rate': 7.5429572488279615e-06, 'epoch': 0.3830022075055188, 'num_input_tokens_seen': 22740984, 'completed': '38.30% (347 / 906)', 'remaining time': '2:36:30', 'throughput': '1216.16', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:10:45,355 >> {'loss': 0.7094, 'grad_norm': 6.114891529083252, 'learning_rate': 7.5282967805386555e-06, 'epoch': 0.3841059602649007, 'num_input_tokens_seen': 22806520, 'completed': '38.41% (348 / 906)', 'remaining time': '2:36:08', 'throughput': '1235.64', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:10:58,874 >> {'loss': 0.3381, 'grad_norm': 4.288976669311523, 'learning_rate': 7.5136092456396e-06, 'epoch': 0.3852097130242826, 'num_input_tokens_seen': 22872056, 'completed': '38.52% (349 / 906)', 'remaining time': '2:35:46', 'throughput': '1212.00', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:11:12,418 >> {'loss': 0.2501, 'grad_norm': 3.8139936923980713, 'learning_rate': 7.498894840128632e-06, 'epoch': 0.38631346578366443, 'num_input_tokens_seen': 22937592, 'completed': '38.63% (350 / 906)', 'remaining time': '2:35:24', 'throughput': '1209.66', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:11:25,801 >> {'loss': 0.4111, 'grad_norm': 4.9862060546875, 'learning_rate': 7.484153760362155e-06, 'epoch': 0.38741721854304634, 'num_input_tokens_seen': 23003128, 'completed': '38.74% (351 / 906)', 'remaining time': '2:35:02', 'throughput': '1224.20', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:11:39,148 >> {'loss': 0.8444, 'grad_norm': 6.541782379150391, 'learning_rate': 7.4693862030525356e-06, 'epoch': 0.38852097130242824, 'num_input_tokens_seen': 23068664, 'completed': '38.85% (352 / 906)', 'remaining time': '2:34:40', 'throughput': '1227.59', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:11:52,604 >> {'loss': 0.3405, 'grad_norm': 4.731116771697998, 'learning_rate': 7.454592365265464e-06, 'epoch': 0.38962472406181015, 'num_input_tokens_seen': 23134200, 'completed': '38.96% (353 / 906)', 'remaining time': '2:34:18', 'throughput': '1217.59', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:12:06,053 >> {'loss': 0.6451, 'grad_norm': 5.749923229217529, 'learning_rate': 7.439772444417337e-06, 'epoch': 0.39072847682119205, 'num_input_tokens_seen': 23199736, 'completed': '39.07% (354 / 906)', 'remaining time': '2:33:56', 'throughput': '1218.20', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:12:19,442 >> {'loss': 0.5034, 'grad_norm': 5.7721099853515625, 'learning_rate': 7.424926638272609e-06, 'epoch': 0.39183222958057395, 'num_input_tokens_seen': 23265272, 'completed': '39.18% (355 / 906)', 'remaining time': '2:33:34', 'throughput': '1223.68', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:12:32,786 >> {'loss': 0.3973, 'grad_norm': 4.80233907699585, 'learning_rate': 7.410055144941168e-06, 'epoch': 0.39293598233995586, 'num_input_tokens_seen': 23330808, 'completed': '39.29% (356 / 906)', 'remaining time': '2:33:12', 'throughput': '1227.80', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:12:45,998 >> {'loss': 0.5132, 'grad_norm': 5.965678691864014, 'learning_rate': 7.395158162875681e-06, 'epoch': 0.39403973509933776, 'num_input_tokens_seen': 23396344, 'completed': '39.40% (357 / 906)', 'remaining time': '2:32:50', 'throughput': '1240.17', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:12:59,287 >> {'loss': 0.6651, 'grad_norm': 5.89133882522583, 'learning_rate': 7.380235890868946e-06, 'epoch': 0.39514348785871967, 'num_input_tokens_seen': 23461880, 'completed': '39.51% (358 / 906)', 'remaining time': '2:32:28', 'throughput': '1232.85', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:13:12,529 >> {'loss': 0.4079, 'grad_norm': 4.942079067230225, 'learning_rate': 7.365288528051251e-06, 'epoch': 0.39624724061810157, 'num_input_tokens_seen': 23527412, 'completed': '39.62% (359 / 906)', 'remaining time': '2:32:06', 'throughput': '1237.24', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:13:25,970 >> {'loss': 0.3752, 'grad_norm': 4.461698055267334, 'learning_rate': 7.350316273887702e-06, 'epoch': 0.3973509933774834, 'num_input_tokens_seen': 23592948, 'completed': '39.74% (360 / 906)', 'remaining time': '2:31:44', 'throughput': '1218.90', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:13:39,394 >> {'loss': 0.6089, 'grad_norm': 5.262115001678467, 'learning_rate': 7.335319328175571e-06, 'epoch': 0.3984547461368653, 'num_input_tokens_seen': 23658484, 'completed': '39.85% (361 / 906)', 'remaining time': '2:31:23', 'throughput': '1220.52', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:13:52,636 >> {'loss': 0.9327, 'grad_norm': 6.512784957885742, 'learning_rate': 7.3202978910416225e-06, 'epoch': 0.3995584988962472, 'num_input_tokens_seen': 23724020, 'completed': '39.96% (362 / 906)', 'remaining time': '2:31:01', 'throughput': '1237.31', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:14:06,023 >> {'loss': 0.2887, 'grad_norm': 3.6190109252929688, 'learning_rate': 7.305252162939451e-06, 'epoch': 0.40066225165562913, 'num_input_tokens_seen': 23789556, 'completed': '40.07% (363 / 906)', 'remaining time': '2:30:40', 'throughput': '1223.88', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:14:19,463 >> {'loss': 0.3774, 'grad_norm': 4.493287086486816, 'learning_rate': 7.290182344646799e-06, 'epoch': 0.40176600441501104, 'num_input_tokens_seen': 23855092, 'completed': '40.18% (364 / 906)', 'remaining time': '2:30:18', 'throughput': '1219.03', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:14:32,726 >> {'loss': 1.0283, 'grad_norm': 6.982156276702881, 'learning_rate': 7.275088637262881e-06, 'epoch': 0.40286975717439294, 'num_input_tokens_seen': 23920628, 'completed': '40.29% (365 / 906)', 'remaining time': '2:29:56', 'throughput': '1235.28', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:14:46,160 >> {'loss': 0.6455, 'grad_norm': 5.538138389587402, 'learning_rate': 7.259971242205702e-06, 'epoch': 0.40397350993377484, 'num_input_tokens_seen': 23986164, 'completed': '40.40% (366 / 906)', 'remaining time': '2:29:35', 'throughput': '1219.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:14:59,537 >> {'loss': 0.2485, 'grad_norm': 3.606851100921631, 'learning_rate': 7.244830361209366e-06, 'epoch': 0.40507726269315675, 'num_input_tokens_seen': 24051700, 'completed': '40.51% (367 / 906)', 'remaining time': '2:29:14', 'throughput': '1224.78', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:15:12,846 >> {'loss': 0.6441, 'grad_norm': 5.6115851402282715, 'learning_rate': 7.229666196321383e-06, 'epoch': 0.40618101545253865, 'num_input_tokens_seen': 24117236, 'completed': '40.62% (368 / 906)', 'remaining time': '2:28:52', 'throughput': '1231.05', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:15:26,340 >> {'loss': 0.3399, 'grad_norm': 4.510439395904541, 'learning_rate': 7.214478949899976e-06, 'epoch': 0.40728476821192056, 'num_input_tokens_seen': 24182772, 'completed': '40.73% (369 / 906)', 'remaining time': '2:28:31', 'throughput': '1214.11', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:15:39,817 >> {'loss': 0.4898, 'grad_norm': 5.339845657348633, 'learning_rate': 7.199268824611382e-06, 'epoch': 0.4083885209713024, 'num_input_tokens_seen': 24248308, 'completed': '40.84% (370 / 906)', 'remaining time': '2:28:10', 'throughput': '1215.77', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:15:53,235 >> {'loss': 0.2596, 'grad_norm': 3.8770904541015625, 'learning_rate': 7.18403602342714e-06, 'epoch': 0.4094922737306843, 'num_input_tokens_seen': 24313844, 'completed': '40.95% (371 / 906)', 'remaining time': '2:27:49', 'throughput': '1221.02', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:16:06,512 >> {'loss': 0.2945, 'grad_norm': 3.9396862983703613, 'learning_rate': 7.168780749621394e-06, 'epoch': 0.4105960264900662, 'num_input_tokens_seen': 24379380, 'completed': '41.06% (372 / 906)', 'remaining time': '2:27:28', 'throughput': '1234.02', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:16:19,737 >> {'loss': 0.2875, 'grad_norm': 4.604616165161133, 'learning_rate': 7.1535032067681684e-06, 'epoch': 0.4116997792494481, 'num_input_tokens_seen': 24444916, 'completed': '41.17% (373 / 906)', 'remaining time': '2:27:06', 'throughput': '1238.83', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:16:32,914 >> {'loss': 0.2887, 'grad_norm': 4.745989799499512, 'learning_rate': 7.138203598738659e-06, 'epoch': 0.41280353200883, 'num_input_tokens_seen': 24510448, 'completed': '41.28% (374 / 906)', 'remaining time': '2:26:45', 'throughput': '1243.29', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:16:45,998 >> {'loss': 0.4992, 'grad_norm': 5.695927619934082, 'learning_rate': 7.122882129698514e-06, 'epoch': 0.4139072847682119, 'num_input_tokens_seen': 24575984, 'completed': '41.39% (375 / 906)', 'remaining time': '2:26:23', 'throughput': '1252.24', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:16:59,298 >> {'loss': 0.322, 'grad_norm': 4.267956256866455, 'learning_rate': 7.107539004105097e-06, 'epoch': 0.41501103752759383, 'num_input_tokens_seen': 24641520, 'completed': '41.50% (376 / 906)', 'remaining time': '2:26:02', 'throughput': '1231.87', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:17:12,618 >> {'loss': 0.5286, 'grad_norm': 6.005370140075684, 'learning_rate': 7.092174426704779e-06, 'epoch': 0.41611479028697573, 'num_input_tokens_seen': 24707056, 'completed': '41.61% (377 / 906)', 'remaining time': '2:25:41', 'throughput': '1230.09', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:17:25,839 >> {'loss': 0.4565, 'grad_norm': 4.701905727386475, 'learning_rate': 7.076788602530182e-06, 'epoch': 0.41721854304635764, 'num_input_tokens_seen': 24772592, 'completed': '41.72% (378 / 906)', 'remaining time': '2:25:20', 'throughput': '1239.19', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:17:39,114 >> {'loss': 0.3361, 'grad_norm': 5.479290962219238, 'learning_rate': 7.061381736897468e-06, 'epoch': 0.41832229580573954, 'num_input_tokens_seen': 24838128, 'completed': '41.83% (379 / 906)', 'remaining time': '2:24:59', 'throughput': '1234.25', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:17:52,277 >> {'loss': 0.3394, 'grad_norm': 4.56320333480835, 'learning_rate': 7.0459540354035775e-06, 'epoch': 0.4194260485651214, 'num_input_tokens_seen': 24903664, 'completed': '41.94% (380 / 906)', 'remaining time': '2:24:38', 'throughput': '1244.69', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:18:05,483 >> {'loss': 0.5699, 'grad_norm': 5.685725212097168, 'learning_rate': 7.0305057039235e-06, 'epoch': 0.4205298013245033, 'num_input_tokens_seen': 24969200, 'completed': '42.05% (381 / 906)', 'remaining time': '2:24:17', 'throughput': '1240.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:18:18,485 >> {'loss': 0.8107, 'grad_norm': 6.919594764709473, 'learning_rate': 7.015036948607519e-06, 'epoch': 0.4216335540838852, 'num_input_tokens_seen': 25034736, 'completed': '42.16% (382 / 906)', 'remaining time': '2:23:56', 'throughput': '1260.08', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:18:31,606 >> {'loss': 0.5774, 'grad_norm': 5.660834312438965, 'learning_rate': 6.999547975878467e-06, 'epoch': 0.4227373068432671, 'num_input_tokens_seen': 25100272, 'completed': '42.27% (383 / 906)', 'remaining time': '2:23:35', 'throughput': '1248.68', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:18:44,900 >> {'loss': 0.3895, 'grad_norm': 4.428837776184082, 'learning_rate': 6.984038992428967e-06, 'epoch': 0.423841059602649, 'num_input_tokens_seen': 25165808, 'completed': '42.38% (384 / 906)', 'remaining time': '2:23:14', 'throughput': '1232.43', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:18:58,095 >> {'loss': 0.8248, 'grad_norm': 6.41189432144165, 'learning_rate': 6.968510205218671e-06, 'epoch': 0.4249448123620309, 'num_input_tokens_seen': 25231344, 'completed': '42.49% (385 / 906)', 'remaining time': '2:22:53', 'throughput': '1241.72', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:19:11,416 >> {'loss': 0.4435, 'grad_norm': 4.905003547668457, 'learning_rate': 6.952961821471509e-06, 'epoch': 0.4260485651214128, 'num_input_tokens_seen': 25296880, 'completed': '42.60% (386 / 906)', 'remaining time': '2:22:32', 'throughput': '1229.95', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:19:24,949 >> {'loss': 0.3755, 'grad_norm': 4.218749046325684, 'learning_rate': 6.937394048672912e-06, 'epoch': 0.4271523178807947, 'num_input_tokens_seen': 25362416, 'completed': '42.72% (387 / 906)', 'remaining time': '2:22:12', 'throughput': '1210.64', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:19:38,429 >> {'loss': 0.6033, 'grad_norm': 5.433783054351807, 'learning_rate': 6.921807094567051e-06, 'epoch': 0.4282560706401766, 'num_input_tokens_seen': 25427952, 'completed': '42.83% (388 / 906)', 'remaining time': '2:21:51', 'throughput': '1215.47', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:19:51,727 >> {'loss': 0.4313, 'grad_norm': 5.018522262573242, 'learning_rate': 6.906201167154061e-06, 'epoch': 0.42935982339955847, 'num_input_tokens_seen': 25493488, 'completed': '42.94% (389 / 906)', 'remaining time': '2:21:31', 'throughput': '1232.00', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:20:05,134 >> {'loss': 0.5557, 'grad_norm': 5.442930698394775, 'learning_rate': 6.890576474687264e-06, 'epoch': 0.4304635761589404, 'num_input_tokens_seen': 25559024, 'completed': '43.05% (390 / 906)', 'remaining time': '2:21:10', 'throughput': '1222.06', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:20:18,454 >> {'loss': 0.7915, 'grad_norm': 6.9419145584106445, 'learning_rate': 6.8749332256703975e-06, 'epoch': 0.4315673289183223, 'num_input_tokens_seen': 25624560, 'completed': '43.16% (391 / 906)', 'remaining time': '2:20:50', 'throughput': '1230.06', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:20:31,834 >> {'loss': 0.4263, 'grad_norm': 5.017683982849121, 'learning_rate': 6.85927162885482e-06, 'epoch': 0.4326710816777042, 'num_input_tokens_seen': 25690096, 'completed': '43.27% (392 / 906)', 'remaining time': '2:20:30', 'throughput': '1224.51', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:20:45,257 >> {'loss': 0.5095, 'grad_norm': 5.351146221160889, 'learning_rate': 6.843591893236742e-06, 'epoch': 0.4337748344370861, 'num_input_tokens_seen': 25755632, 'completed': '43.38% (393 / 906)', 'remaining time': '2:20:09', 'throughput': '1220.58', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:20:58,575 >> {'loss': 0.4626, 'grad_norm': 4.434185028076172, 'learning_rate': 6.827894228054416e-06, 'epoch': 0.434878587196468, 'num_input_tokens_seen': 25821168, 'completed': '43.49% (394 / 906)', 'remaining time': '2:19:49', 'throughput': '1230.26', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:21:12,011 >> {'loss': 0.3621, 'grad_norm': 4.385143280029297, 'learning_rate': 6.812178842785364e-06, 'epoch': 0.4359823399558499, 'num_input_tokens_seen': 25886704, 'completed': '43.60% (395 / 906)', 'remaining time': '2:19:29', 'throughput': '1219.39', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:21:25,409 >> {'loss': 0.3593, 'grad_norm': 4.1273112297058105, 'learning_rate': 6.796445947143571e-06, 'epoch': 0.4370860927152318, 'num_input_tokens_seen': 25952240, 'completed': '43.71% (396 / 906)', 'remaining time': '2:19:08', 'throughput': '1222.89', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:21:38,719 >> {'loss': 0.9328, 'grad_norm': 6.839285373687744, 'learning_rate': 6.780695751076685e-06, 'epoch': 0.4381898454746137, 'num_input_tokens_seen': 26017776, 'completed': '43.82% (397 / 906)', 'remaining time': '2:18:48', 'throughput': '1230.90', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:21:52,102 >> {'loss': 0.283, 'grad_norm': 3.5457375049591064, 'learning_rate': 6.7649284647632285e-06, 'epoch': 0.4392935982339956, 'num_input_tokens_seen': 26083312, 'completed': '43.93% (398 / 906)', 'remaining time': '2:18:28', 'throughput': '1224.23', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:22:05,523 >> {'loss': 0.2656, 'grad_norm': 3.7174270153045654, 'learning_rate': 6.749144298609776e-06, 'epoch': 0.44039735099337746, 'num_input_tokens_seen': 26148848, 'completed': '44.04% (399 / 906)', 'remaining time': '2:18:08', 'throughput': '1220.84', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:22:18,904 >> {'loss': 0.4559, 'grad_norm': 5.119399547576904, 'learning_rate': 6.733343463248163e-06, 'epoch': 0.44150110375275936, 'num_input_tokens_seen': 26214384, 'completed': '44.15% (400 / 906)', 'remaining time': '2:17:48', 'throughput': '1224.42', 'gpu_mem_free': '30139MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-04 22:22:44,536 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-400
[INFO|configuration_utils.py:472] 2025-01-04 22:22:44,539 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-400/config.json
[INFO|configuration_utils.py:807] 2025-01-04 22:22:44,541 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-400/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-04 22:23:42,050 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-400/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-04 22:23:42,054 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-400/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-04 22:23:42,054 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-400/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[WARNING|trainer.py:869] 2025-01-04 22:27:42,312 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 1600, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-04 22:27:55,950 >> {'loss': 0.4606, 'grad_norm': 4.70241641998291, 'learning_rate': 6.717526169532658e-06, 'epoch': 0.44260485651214126, 'num_input_tokens_seen': 26279920, 'completed': '44.26% (401 / 906)', 'remaining time': '2:24:15', 'throughput': '48.61', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:28:09,491 >> {'loss': 0.3758, 'grad_norm': 4.453240871429443, 'learning_rate': 6.701692628537169e-06, 'epoch': 0.44370860927152317, 'num_input_tokens_seen': 26345456, 'completed': '44.37% (402 / 906)', 'remaining time': '2:23:54', 'throughput': '1209.98', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:28:23,047 >> {'loss': 0.2946, 'grad_norm': 4.099554061889648, 'learning_rate': 6.685843051552405e-06, 'epoch': 0.4448123620309051, 'num_input_tokens_seen': 26410992, 'completed': '44.48% (403 / 906)', 'remaining time': '2:23:32', 'throughput': '1208.67', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:28:36,384 >> {'loss': 0.574, 'grad_norm': 5.848024368286133, 'learning_rate': 6.669977650083075e-06, 'epoch': 0.445916114790287, 'num_input_tokens_seen': 26476528, 'completed': '44.59% (404 / 906)', 'remaining time': '2:23:10', 'throughput': '1228.39', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:28:49,751 >> {'loss': 0.6731, 'grad_norm': 5.945405960083008, 'learning_rate': 6.654096635845054e-06, 'epoch': 0.4470198675496689, 'num_input_tokens_seen': 26542064, 'completed': '44.70% (405 / 906)', 'remaining time': '2:22:49', 'throughput': '1225.70', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:29:03,223 >> {'loss': 0.2969, 'grad_norm': 4.190430164337158, 'learning_rate': 6.638200220762563e-06, 'epoch': 0.4481236203090508, 'num_input_tokens_seen': 26607600, 'completed': '44.81% (406 / 906)', 'remaining time': '2:22:27', 'throughput': '1216.15', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:29:16,704 >> {'loss': 0.3044, 'grad_norm': 4.02309513092041, 'learning_rate': 6.622288616965343e-06, 'epoch': 0.4492273730684327, 'num_input_tokens_seen': 26673136, 'completed': '44.92% (407 / 906)', 'remaining time': '2:22:05', 'throughput': '1215.39', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:29:30,239 >> {'loss': 0.4959, 'grad_norm': 5.067698955535889, 'learning_rate': 6.60636203678581e-06, 'epoch': 0.4503311258278146, 'num_input_tokens_seen': 26738672, 'completed': '45.03% (408 / 906)', 'remaining time': '2:21:44', 'throughput': '1210.49', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:29:43,646 >> {'loss': 0.3315, 'grad_norm': 4.1457695960998535, 'learning_rate': 6.590420692756247e-06, 'epoch': 0.45143487858719644, 'num_input_tokens_seen': 26804208, 'completed': '45.14% (409 / 906)', 'remaining time': '2:21:22', 'throughput': '1222.03', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:29:57,173 >> {'loss': 0.6599, 'grad_norm': 6.310359001159668, 'learning_rate': 6.574464797605938e-06, 'epoch': 0.45253863134657835, 'num_input_tokens_seen': 26869744, 'completed': '45.25% (410 / 906)', 'remaining time': '2:21:01', 'throughput': '1211.23', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:30:10,497 >> {'loss': 0.4739, 'grad_norm': 5.186681270599365, 'learning_rate': 6.558494564258362e-06, 'epoch': 0.45364238410596025, 'num_input_tokens_seen': 26935280, 'completed': '45.36% (411 / 906)', 'remaining time': '2:20:40', 'throughput': '1229.65', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:30:23,889 >> {'loss': 0.4659, 'grad_norm': 4.929074764251709, 'learning_rate': 6.542510205828316e-06, 'epoch': 0.45474613686534215, 'num_input_tokens_seen': 27000816, 'completed': '45.47% (412 / 906)', 'remaining time': '2:20:18', 'throughput': '1223.43', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:30:37,319 >> {'loss': 0.3948, 'grad_norm': 4.5425944328308105, 'learning_rate': 6.5265119356191005e-06, 'epoch': 0.45584988962472406, 'num_input_tokens_seen': 27066352, 'completed': '45.58% (413 / 906)', 'remaining time': '2:19:57', 'throughput': '1219.94', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:30:50,741 >> {'loss': 0.4228, 'grad_norm': 5.008086681365967, 'learning_rate': 6.51049996711966e-06, 'epoch': 0.45695364238410596, 'num_input_tokens_seen': 27131888, 'completed': '45.70% (414 / 906)', 'remaining time': '2:19:35', 'throughput': '1220.69', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:31:04,189 >> {'loss': 0.3959, 'grad_norm': 5.23047399520874, 'learning_rate': 6.494474514001734e-06, 'epoch': 0.45805739514348787, 'num_input_tokens_seen': 27197424, 'completed': '45.81% (415 / 906)', 'remaining time': '2:19:14', 'throughput': '1218.29', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:31:17,612 >> {'loss': 0.5117, 'grad_norm': 5.282271385192871, 'learning_rate': 6.478435790117007e-06, 'epoch': 0.45916114790286977, 'num_input_tokens_seen': 27262960, 'completed': '45.92% (416 / 906)', 'remaining time': '2:18:53', 'throughput': '1220.59', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:31:31,169 >> {'loss': 0.2912, 'grad_norm': 4.828380584716797, 'learning_rate': 6.462384009494257e-06, 'epoch': 0.4602649006622517, 'num_input_tokens_seen': 27328496, 'completed': '46.03% (417 / 906)', 'remaining time': '2:18:32', 'throughput': '1208.50', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:31:44,669 >> {'loss': 0.3067, 'grad_norm': 4.036655902862549, 'learning_rate': 6.446319386336499e-06, 'epoch': 0.4613686534216336, 'num_input_tokens_seen': 27394032, 'completed': '46.14% (418 / 906)', 'remaining time': '2:18:11', 'throughput': '1213.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:31:58,090 >> {'loss': 0.3897, 'grad_norm': 4.4774274826049805, 'learning_rate': 6.430242135018121e-06, 'epoch': 0.4624724061810154, 'num_input_tokens_seen': 27459568, 'completed': '46.25% (419 / 906)', 'remaining time': '2:17:50', 'throughput': '1220.80', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:32:11,630 >> {'loss': 0.5586, 'grad_norm': 5.575551509857178, 'learning_rate': 6.414152470082031e-06, 'epoch': 0.46357615894039733, 'num_input_tokens_seen': 27525104, 'completed': '46.36% (420 / 906)', 'remaining time': '2:17:29', 'throughput': '1210.03', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:32:25,150 >> {'loss': 0.3839, 'grad_norm': 4.964656829833984, 'learning_rate': 6.3980506062367884e-06, 'epoch': 0.46467991169977924, 'num_input_tokens_seen': 27590640, 'completed': '46.47% (421 / 906)', 'remaining time': '2:17:08', 'throughput': '1211.87', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:32:38,671 >> {'loss': 0.274, 'grad_norm': 6.3939948081970215, 'learning_rate': 6.3819367583537425e-06, 'epoch': 0.46578366445916114, 'num_input_tokens_seen': 27656176, 'completed': '46.58% (422 / 906)', 'remaining time': '2:16:47', 'throughput': '1211.77', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:32:52,089 >> {'loss': 0.5634, 'grad_norm': 6.026232719421387, 'learning_rate': 6.365811141464162e-06, 'epoch': 0.46688741721854304, 'num_input_tokens_seen': 27721712, 'completed': '46.69% (423 / 906)', 'remaining time': '2:16:26', 'throughput': '1220.99', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:33:05,713 >> {'loss': 0.4866, 'grad_norm': 5.401932716369629, 'learning_rate': 6.349673970756371e-06, 'epoch': 0.46799116997792495, 'num_input_tokens_seen': 27787248, 'completed': '46.80% (424 / 906)', 'remaining time': '2:16:05', 'throughput': '1202.64', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:33:19,146 >> {'loss': 0.4014, 'grad_norm': 5.220919609069824, 'learning_rate': 6.33352546157287e-06, 'epoch': 0.46909492273730685, 'num_input_tokens_seen': 27852784, 'completed': '46.91% (425 / 906)', 'remaining time': '2:15:44', 'throughput': '1219.61', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:33:32,541 >> {'loss': 0.489, 'grad_norm': 5.735633373260498, 'learning_rate': 6.317365829407465e-06, 'epoch': 0.47019867549668876, 'num_input_tokens_seen': 27918320, 'completed': '47.02% (426 / 906)', 'remaining time': '2:15:23', 'throughput': '1223.17', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:33:46,008 >> {'loss': 0.3543, 'grad_norm': 4.342199325561523, 'learning_rate': 6.301195289902395e-06, 'epoch': 0.47130242825607066, 'num_input_tokens_seen': 27983856, 'completed': '47.13% (427 / 906)', 'remaining time': '2:15:02', 'throughput': '1216.60', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:33:59,243 >> {'loss': 0.4572, 'grad_norm': 5.337205410003662, 'learning_rate': 6.2850140588454515e-06, 'epoch': 0.47240618101545256, 'num_input_tokens_seen': 28049392, 'completed': '47.24% (428 / 906)', 'remaining time': '2:14:41', 'throughput': '1237.90', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:34:12,435 >> {'loss': 0.3907, 'grad_norm': 4.443479061126709, 'learning_rate': 6.268822352167097e-06, 'epoch': 0.4735099337748344, 'num_input_tokens_seen': 28114928, 'completed': '47.35% (429 / 906)', 'remaining time': '2:14:20', 'throughput': '1242.03', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:34:25,712 >> {'loss': 0.4246, 'grad_norm': 4.553615570068359, 'learning_rate': 6.252620385937591e-06, 'epoch': 0.4746136865342163, 'num_input_tokens_seen': 28180464, 'completed': '47.46% (430 / 906)', 'remaining time': '2:14:00', 'throughput': '1233.98', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:34:38,800 >> {'loss': 0.6506, 'grad_norm': 5.831062316894531, 'learning_rate': 6.236408376364097e-06, 'epoch': 0.4757174392935982, 'num_input_tokens_seen': 28246000, 'completed': '47.57% (431 / 906)', 'remaining time': '2:13:38', 'throughput': '1251.85', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:34:52,034 >> {'loss': 0.6119, 'grad_norm': 5.902902603149414, 'learning_rate': 6.220186539787806e-06, 'epoch': 0.4768211920529801, 'num_input_tokens_seen': 28311536, 'completed': '47.68% (432 / 906)', 'remaining time': '2:13:18', 'throughput': '1238.07', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:35:05,334 >> {'loss': 0.3519, 'grad_norm': 4.526601314544678, 'learning_rate': 6.20395509268104e-06, 'epoch': 0.47792494481236203, 'num_input_tokens_seen': 28377072, 'completed': '47.79% (433 / 906)', 'remaining time': '2:12:57', 'throughput': '1231.80', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:35:18,618 >> {'loss': 0.5892, 'grad_norm': 5.268987655639648, 'learning_rate': 6.187714251644375e-06, 'epoch': 0.47902869757174393, 'num_input_tokens_seen': 28442608, 'completed': '47.90% (434 / 906)', 'remaining time': '2:12:36', 'throughput': '1233.38', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:35:32,022 >> {'loss': 0.4768, 'grad_norm': 5.396267414093018, 'learning_rate': 6.171464233403734e-06, 'epoch': 0.48013245033112584, 'num_input_tokens_seen': 28508144, 'completed': '48.01% (435 / 906)', 'remaining time': '2:12:15', 'throughput': '1222.36', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:35:45,239 >> {'loss': 0.7685, 'grad_norm': 6.279263019561768, 'learning_rate': 6.155205254807524e-06, 'epoch': 0.48123620309050774, 'num_input_tokens_seen': 28573680, 'completed': '48.12% (436 / 906)', 'remaining time': '2:11:55', 'throughput': '1239.64', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:35:58,460 >> {'loss': 0.3726, 'grad_norm': 4.332390785217285, 'learning_rate': 6.138937532823701e-06, 'epoch': 0.48233995584988965, 'num_input_tokens_seen': 28639216, 'completed': '48.23% (437 / 906)', 'remaining time': '2:11:34', 'throughput': '1239.23', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:36:11,757 >> {'loss': 0.4462, 'grad_norm': 5.152309417724609, 'learning_rate': 6.1226612845369134e-06, 'epoch': 0.48344370860927155, 'num_input_tokens_seen': 28704752, 'completed': '48.34% (438 / 906)', 'remaining time': '2:11:13', 'throughput': '1232.13', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:36:25,052 >> {'loss': 0.6085, 'grad_norm': 5.474704742431641, 'learning_rate': 6.1063767271455834e-06, 'epoch': 0.4845474613686534, 'num_input_tokens_seen': 28770288, 'completed': '48.45% (439 / 906)', 'remaining time': '2:10:53', 'throughput': '1232.37', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:36:38,205 >> {'loss': 0.5532, 'grad_norm': 5.3773345947265625, 'learning_rate': 6.090084077959013e-06, 'epoch': 0.4856512141280353, 'num_input_tokens_seen': 28835824, 'completed': '48.57% (440 / 906)', 'remaining time': '2:10:32', 'throughput': '1245.64', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:36:51,442 >> {'loss': 0.2709, 'grad_norm': 3.462773323059082, 'learning_rate': 6.073783554394486e-06, 'epoch': 0.4867549668874172, 'num_input_tokens_seen': 28901360, 'completed': '48.68% (441 / 906)', 'remaining time': '2:10:11', 'throughput': '1237.68', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:37:04,721 >> {'loss': 0.6188, 'grad_norm': 5.682385444641113, 'learning_rate': 6.057475373974366e-06, 'epoch': 0.4878587196467991, 'num_input_tokens_seen': 28966896, 'completed': '48.79% (442 / 906)', 'remaining time': '2:09:51', 'throughput': '1233.86', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:37:17,945 >> {'loss': 0.6366, 'grad_norm': 5.999266624450684, 'learning_rate': 6.041159754323196e-06, 'epoch': 0.488962472406181, 'num_input_tokens_seen': 29032432, 'completed': '48.90% (443 / 906)', 'remaining time': '2:09:30', 'throughput': '1238.96', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:37:31,208 >> {'loss': 0.5834, 'grad_norm': 6.69478702545166, 'learning_rate': 6.024836913164787e-06, 'epoch': 0.4900662251655629, 'num_input_tokens_seen': 29097968, 'completed': '49.01% (444 / 906)', 'remaining time': '2:09:10', 'throughput': '1235.30', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:37:44,403 >> {'loss': 0.7872, 'grad_norm': 6.7099103927612305, 'learning_rate': 6.008507068319318e-06, 'epoch': 0.4911699779249448, 'num_input_tokens_seen': 29163504, 'completed': '49.12% (445 / 906)', 'remaining time': '2:08:50', 'throughput': '1241.73', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:37:57,746 >> {'loss': 0.3586, 'grad_norm': 4.720966815948486, 'learning_rate': 5.992170437700436e-06, 'epoch': 0.4922737306843267, 'num_input_tokens_seen': 29229040, 'completed': '49.23% (446 / 906)', 'remaining time': '2:08:29', 'throughput': '1227.93', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:38:11,080 >> {'loss': 0.3405, 'grad_norm': 4.880152225494385, 'learning_rate': 5.9758272393123305e-06, 'epoch': 0.49337748344370863, 'num_input_tokens_seen': 29294576, 'completed': '49.34% (447 / 906)', 'remaining time': '2:08:09', 'throughput': '1228.67', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:38:24,372 >> {'loss': 0.5912, 'grad_norm': 5.520662784576416, 'learning_rate': 5.959477691246842e-06, 'epoch': 0.49448123620309054, 'num_input_tokens_seen': 29360112, 'completed': '49.45% (448 / 906)', 'remaining time': '2:07:49', 'throughput': '1232.69', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:38:37,593 >> {'loss': 0.4973, 'grad_norm': 4.44854211807251, 'learning_rate': 5.943122011680542e-06, 'epoch': 0.4955849889624724, 'num_input_tokens_seen': 29425648, 'completed': '49.56% (449 / 906)', 'remaining time': '2:07:28', 'throughput': '1239.24', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:38:50,718 >> {'loss': 0.2792, 'grad_norm': 3.7942519187927246, 'learning_rate': 5.926760418871823e-06, 'epoch': 0.4966887417218543, 'num_input_tokens_seen': 29491184, 'completed': '49.67% (450 / 906)', 'remaining time': '2:07:08', 'throughput': '1248.28', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:39:03,774 >> {'loss': 0.3856, 'grad_norm': 4.290163993835449, 'learning_rate': 5.910393131157987e-06, 'epoch': 0.4977924944812362, 'num_input_tokens_seen': 29556720, 'completed': '49.78% (451 / 906)', 'remaining time': '2:06:47', 'throughput': '1254.91', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:39:16,888 >> {'loss': 0.6704, 'grad_norm': 6.079360485076904, 'learning_rate': 5.894020366952331e-06, 'epoch': 0.4988962472406181, 'num_input_tokens_seen': 29622256, 'completed': '49.89% (452 / 906)', 'remaining time': '2:06:27', 'throughput': '1249.30', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:39:30,126 >> {'loss': 0.4156, 'grad_norm': 4.259561061859131, 'learning_rate': 5.8776423447412366e-06, 'epoch': 0.5, 'num_input_tokens_seen': 29687792, 'completed': '50.00% (453 / 906)', 'remaining time': '2:06:07', 'throughput': '1237.73', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:39:43,393 >> {'loss': 0.5608, 'grad_norm': 5.737578392028809, 'learning_rate': 5.861259283081246e-06, 'epoch': 0.5011037527593819, 'num_input_tokens_seen': 29753328, 'completed': '50.11% (454 / 906)', 'remaining time': '2:05:47', 'throughput': '1234.91', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:39:56,676 >> {'loss': 0.4287, 'grad_norm': 4.720670700073242, 'learning_rate': 5.844871400596154e-06, 'epoch': 0.5022075055187638, 'num_input_tokens_seen': 29818864, 'completed': '50.22% (455 / 906)', 'remaining time': '2:05:27', 'throughput': '1233.45', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:40:10,155 >> {'loss': 0.2397, 'grad_norm': 3.384713649749756, 'learning_rate': 5.828478915974084e-06, 'epoch': 0.5033112582781457, 'num_input_tokens_seen': 29884400, 'completed': '50.33% (456 / 906)', 'remaining time': '2:05:07', 'throughput': '1215.54', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:40:23,304 >> {'loss': 0.5351, 'grad_norm': 5.180420398712158, 'learning_rate': 5.812082047964578e-06, 'epoch': 0.5044150110375276, 'num_input_tokens_seen': 29949936, 'completed': '50.44% (457 / 906)', 'remaining time': '2:04:47', 'throughput': '1246.04', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:40:36,602 >> {'loss': 0.5918, 'grad_norm': 5.582147121429443, 'learning_rate': 5.795681015375664e-06, 'epoch': 0.5055187637969095, 'num_input_tokens_seen': 30015472, 'completed': '50.55% (458 / 906)', 'remaining time': '2:04:27', 'throughput': '1232.07', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:40:49,707 >> {'loss': 0.504, 'grad_norm': 5.289458274841309, 'learning_rate': 5.779276037070951e-06, 'epoch': 0.5066225165562914, 'num_input_tokens_seen': 30081008, 'completed': '50.66% (459 / 906)', 'remaining time': '2:04:07', 'throughput': '1250.20', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:41:02,918 >> {'loss': 0.3156, 'grad_norm': 4.269794464111328, 'learning_rate': 5.762867331966698e-06, 'epoch': 0.5077262693156733, 'num_input_tokens_seen': 30146544, 'completed': '50.77% (460 / 906)', 'remaining time': '2:03:47', 'throughput': '1240.11', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:41:16,155 >> {'loss': 0.3911, 'grad_norm': 4.807360649108887, 'learning_rate': 5.746455119028896e-06, 'epoch': 0.5088300220750552, 'num_input_tokens_seen': 30212080, 'completed': '50.88% (461 / 906)', 'remaining time': '2:03:27', 'throughput': '1237.79', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:41:29,346 >> {'loss': 0.4696, 'grad_norm': 5.4606122970581055, 'learning_rate': 5.730039617270353e-06, 'epoch': 0.5099337748344371, 'num_input_tokens_seen': 30277616, 'completed': '50.99% (462 / 906)', 'remaining time': '2:03:07', 'throughput': '1242.08', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:41:42,555 >> {'loss': 0.2897, 'grad_norm': 4.116583824157715, 'learning_rate': 5.7136210457477546e-06, 'epoch': 0.5110375275938189, 'num_input_tokens_seen': 30343152, 'completed': '51.10% (463 / 906)', 'remaining time': '2:02:47', 'throughput': '1240.29', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:41:55,735 >> {'loss': 0.3166, 'grad_norm': 3.7440638542175293, 'learning_rate': 5.697199623558758e-06, 'epoch': 0.5121412803532008, 'num_input_tokens_seen': 30408688, 'completed': '51.21% (464 / 906)', 'remaining time': '2:02:27', 'throughput': '1243.10', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:42:08,928 >> {'loss': 0.695, 'grad_norm': 7.0095014572143555, 'learning_rate': 5.680775569839058e-06, 'epoch': 0.5132450331125827, 'num_input_tokens_seen': 30474224, 'completed': '51.32% (465 / 906)', 'remaining time': '2:02:07', 'throughput': '1241.92', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:42:22,090 >> {'loss': 0.2507, 'grad_norm': 3.876681089401245, 'learning_rate': 5.664349103759467e-06, 'epoch': 0.5143487858719646, 'num_input_tokens_seen': 30539760, 'completed': '51.43% (466 / 906)', 'remaining time': '2:01:47', 'throughput': '1244.83', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:42:35,318 >> {'loss': 0.4078, 'grad_norm': 4.9294753074646, 'learning_rate': 5.647920444522986e-06, 'epoch': 0.5154525386313465, 'num_input_tokens_seen': 30605296, 'completed': '51.55% (467 / 906)', 'remaining time': '2:01:27', 'throughput': '1238.51', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:42:48,395 >> {'loss': 0.6855, 'grad_norm': 5.85141658782959, 'learning_rate': 5.631489811361891e-06, 'epoch': 0.5165562913907285, 'num_input_tokens_seen': 30670832, 'completed': '51.66% (468 / 906)', 'remaining time': '2:01:07', 'throughput': '1252.94', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:43:01,469 >> {'loss': 0.3744, 'grad_norm': 4.421255588531494, 'learning_rate': 5.615057423534788e-06, 'epoch': 0.5176600441501104, 'num_input_tokens_seen': 30736368, 'completed': '51.77% (469 / 906)', 'remaining time': '2:00:48', 'throughput': '1253.18', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:43:14,646 >> {'loss': 0.3238, 'grad_norm': 4.160279750823975, 'learning_rate': 5.5986235003237065e-06, 'epoch': 0.5187637969094923, 'num_input_tokens_seen': 30801904, 'completed': '51.88% (470 / 906)', 'remaining time': '2:00:28', 'throughput': '1243.39', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:43:27,807 >> {'loss': 0.5394, 'grad_norm': 5.7004194259643555, 'learning_rate': 5.5821882610311625e-06, 'epoch': 0.5198675496688742, 'num_input_tokens_seen': 30867440, 'completed': '51.99% (471 / 906)', 'remaining time': '2:00:08', 'throughput': '1244.86', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:43:40,963 >> {'loss': 0.4037, 'grad_norm': 6.271990776062012, 'learning_rate': 5.565751924977232e-06, 'epoch': 0.5209713024282561, 'num_input_tokens_seen': 30932976, 'completed': '52.10% (472 / 906)', 'remaining time': '1:59:48', 'throughput': '1245.34', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:43:54,134 >> {'loss': 0.4116, 'grad_norm': 4.508204936981201, 'learning_rate': 5.549314711496631e-06, 'epoch': 0.522075055187638, 'num_input_tokens_seen': 30998512, 'completed': '52.21% (473 / 906)', 'remaining time': '1:59:29', 'throughput': '1243.93', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:44:07,428 >> {'loss': 0.2638, 'grad_norm': 4.369766712188721, 'learning_rate': 5.532876839935779e-06, 'epoch': 0.5231788079470199, 'num_input_tokens_seen': 31064048, 'completed': '52.32% (474 / 906)', 'remaining time': '1:59:09', 'throughput': '1232.46', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:44:20,628 >> {'loss': 0.4689, 'grad_norm': 5.024233341217041, 'learning_rate': 5.516438529649883e-06, 'epoch': 0.5242825607064018, 'num_input_tokens_seen': 31129584, 'completed': '52.43% (475 / 906)', 'remaining time': '1:58:50', 'throughput': '1241.26', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:44:33,806 >> {'loss': 0.3456, 'grad_norm': 4.348725318908691, 'learning_rate': 5.500000000000001e-06, 'epoch': 0.5253863134657837, 'num_input_tokens_seen': 31195120, 'completed': '52.54% (476 / 906)', 'remaining time': '1:58:30', 'throughput': '1243.23', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:44:47,104 >> {'loss': 0.2993, 'grad_norm': 4.22805643081665, 'learning_rate': 5.483561470350118e-06, 'epoch': 0.5264900662251656, 'num_input_tokens_seen': 31260656, 'completed': '52.65% (477 / 906)', 'remaining time': '1:58:11', 'throughput': '1232.12', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:45:00,272 >> {'loss': 0.4853, 'grad_norm': 5.037574291229248, 'learning_rate': 5.467123160064222e-06, 'epoch': 0.5275938189845475, 'num_input_tokens_seen': 31326192, 'completed': '52.76% (478 / 906)', 'remaining time': '1:57:51', 'throughput': '1244.19', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:45:13,554 >> {'loss': 0.1977, 'grad_norm': 4.518187522888184, 'learning_rate': 5.4506852885033715e-06, 'epoch': 0.5286975717439294, 'num_input_tokens_seen': 31391728, 'completed': '52.87% (479 / 906)', 'remaining time': '1:57:32', 'throughput': '1233.54', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:45:26,767 >> {'loss': 0.6275, 'grad_norm': 5.805148601531982, 'learning_rate': 5.434248075022769e-06, 'epoch': 0.5298013245033113, 'num_input_tokens_seen': 31457264, 'completed': '52.98% (480 / 906)', 'remaining time': '1:57:12', 'throughput': '1239.96', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:45:39,944 >> {'loss': 0.3938, 'grad_norm': 4.804733753204346, 'learning_rate': 5.417811738968839e-06, 'epoch': 0.5309050772626932, 'num_input_tokens_seen': 31522800, 'completed': '53.09% (481 / 906)', 'remaining time': '1:56:53', 'throughput': '1243.42', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:45:53,246 >> {'loss': 0.3928, 'grad_norm': 4.533995151519775, 'learning_rate': 5.401376499676294e-06, 'epoch': 0.5320088300220751, 'num_input_tokens_seen': 31588336, 'completed': '53.20% (482 / 906)', 'remaining time': '1:56:33', 'throughput': '1231.67', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:46:06,358 >> {'loss': 0.3949, 'grad_norm': 4.422305583953857, 'learning_rate': 5.384942576465215e-06, 'epoch': 0.5331125827814569, 'num_input_tokens_seen': 31653872, 'completed': '53.31% (483 / 906)', 'remaining time': '1:56:14', 'throughput': '1249.60', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:46:19,485 >> {'loss': 0.4996, 'grad_norm': 5.579964637756348, 'learning_rate': 5.368510188638113e-06, 'epoch': 0.5342163355408388, 'num_input_tokens_seen': 31719408, 'completed': '53.42% (484 / 906)', 'remaining time': '1:55:54', 'throughput': '1248.10', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:46:32,746 >> {'loss': 0.3976, 'grad_norm': 4.969320774078369, 'learning_rate': 5.3520795554770155e-06, 'epoch': 0.5353200883002207, 'num_input_tokens_seen': 31784944, 'completed': '53.53% (485 / 906)', 'remaining time': '1:55:35', 'throughput': '1235.47', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:46:45,959 >> {'loss': 0.309, 'grad_norm': 4.212143421173096, 'learning_rate': 5.3356508962405355e-06, 'epoch': 0.5364238410596026, 'num_input_tokens_seen': 31850480, 'completed': '53.64% (486 / 906)', 'remaining time': '1:55:16', 'throughput': '1239.98', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:46:59,336 >> {'loss': 0.2525, 'grad_norm': 3.8276007175445557, 'learning_rate': 5.319224430160943e-06, 'epoch': 0.5375275938189845, 'num_input_tokens_seen': 31916016, 'completed': '53.75% (487 / 906)', 'remaining time': '1:54:57', 'throughput': '1224.81', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:47:12,569 >> {'loss': 0.4736, 'grad_norm': 5.205665111541748, 'learning_rate': 5.302800376441244e-06, 'epoch': 0.5386313465783664, 'num_input_tokens_seen': 31981552, 'completed': '53.86% (488 / 906)', 'remaining time': '1:54:38', 'throughput': '1238.12', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:47:25,743 >> {'loss': 0.3377, 'grad_norm': 4.5672807693481445, 'learning_rate': 5.286378954252247e-06, 'epoch': 0.5397350993377483, 'num_input_tokens_seen': 32047088, 'completed': '53.97% (489 / 906)', 'remaining time': '1:54:18', 'throughput': '1243.70', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:47:38,954 >> {'loss': 0.5481, 'grad_norm': 5.600037097930908, 'learning_rate': 5.269960382729649e-06, 'epoch': 0.5408388520971302, 'num_input_tokens_seen': 32112624, 'completed': '54.08% (490 / 906)', 'remaining time': '1:53:59', 'throughput': '1240.16', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:47:52,333 >> {'loss': 0.4308, 'grad_norm': 5.772684097290039, 'learning_rate': 5.2535448809711046e-06, 'epoch': 0.5419426048565121, 'num_input_tokens_seen': 32178160, 'completed': '54.19% (491 / 906)', 'remaining time': '1:53:40', 'throughput': '1224.56', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:48:05,681 >> {'loss': 0.497, 'grad_norm': 5.687682628631592, 'learning_rate': 5.237132668033303e-06, 'epoch': 0.543046357615894, 'num_input_tokens_seen': 32243696, 'completed': '54.30% (492 / 906)', 'remaining time': '1:53:21', 'throughput': '1227.48', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:48:18,837 >> {'loss': 0.4745, 'grad_norm': 4.946759223937988, 'learning_rate': 5.220723962929052e-06, 'epoch': 0.5441501103752759, 'num_input_tokens_seen': 32309232, 'completed': '54.42% (493 / 906)', 'remaining time': '1:53:02', 'throughput': '1245.33', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:48:31,924 >> {'loss': 0.505, 'grad_norm': 5.427900791168213, 'learning_rate': 5.204318984624338e-06, 'epoch': 0.5452538631346578, 'num_input_tokens_seen': 32374768, 'completed': '54.53% (494 / 906)', 'remaining time': '1:52:43', 'throughput': '1251.96', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:48:45,056 >> {'loss': 0.4926, 'grad_norm': 5.026839256286621, 'learning_rate': 5.187917952035424e-06, 'epoch': 0.5463576158940397, 'num_input_tokens_seen': 32440304, 'completed': '54.64% (495 / 906)', 'remaining time': '1:52:24', 'throughput': '1247.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:48:58,458 >> {'loss': 0.2596, 'grad_norm': 3.4422245025634766, 'learning_rate': 5.171521084025917e-06, 'epoch': 0.5474613686534217, 'num_input_tokens_seen': 32505840, 'completed': '54.75% (496 / 906)', 'remaining time': '1:52:05', 'throughput': '1222.53', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:49:11,565 >> {'loss': 0.4804, 'grad_norm': 5.574887752532959, 'learning_rate': 5.155128599403849e-06, 'epoch': 0.5485651214128036, 'num_input_tokens_seen': 32571376, 'completed': '54.86% (497 / 906)', 'remaining time': '1:51:46', 'throughput': '1249.95', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:49:24,748 >> {'loss': 0.5245, 'grad_norm': 4.954028129577637, 'learning_rate': 5.138740716918755e-06, 'epoch': 0.5496688741721855, 'num_input_tokens_seen': 32636912, 'completed': '54.97% (498 / 906)', 'remaining time': '1:51:26', 'throughput': '1242.87', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:49:37,970 >> {'loss': 0.5285, 'grad_norm': 6.411076068878174, 'learning_rate': 5.122357655258765e-06, 'epoch': 0.5507726269315674, 'num_input_tokens_seen': 32702448, 'completed': '55.08% (499 / 906)', 'remaining time': '1:51:08', 'throughput': '1239.17', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:49:51,114 >> {'loss': 0.6117, 'grad_norm': 6.448341369628906, 'learning_rate': 5.105979633047669e-06, 'epoch': 0.5518763796909493, 'num_input_tokens_seen': 32767984, 'completed': '55.19% (500 / 906)', 'remaining time': '1:50:49', 'throughput': '1246.49', 'gpu_mem_free': '30139MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-04 22:50:16,734 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-500
[INFO|configuration_utils.py:472] 2025-01-04 22:50:16,737 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-500/config.json
[INFO|configuration_utils.py:807] 2025-01-04 22:50:16,738 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-500/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-04 22:51:15,521 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-500/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-04 22:51:15,525 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-500/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-04 22:51:15,525 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-500/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[WARNING|trainer.py:869] 2025-01-04 22:55:12,488 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 2000, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-04 22:55:26,384 >> {'loss': 0.5242, 'grad_norm': 5.298100471496582, 'learning_rate': 5.0896068688420146e-06, 'epoch': 0.5529801324503312, 'num_input_tokens_seen': 32833520, 'completed': '55.30% (501 / 906)', 'remaining time': '1:54:50', 'throughput': '48.87', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:55:39,715 >> {'loss': 0.6421, 'grad_norm': 5.838515281677246, 'learning_rate': 5.07323958112818e-06, 'epoch': 0.5540838852097131, 'num_input_tokens_seen': 32899056, 'completed': '55.41% (502 / 906)', 'remaining time': '1:54:30', 'throughput': '1229.04', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:55:53,152 >> {'loss': 0.4227, 'grad_norm': 4.797820091247559, 'learning_rate': 5.056877988319459e-06, 'epoch': 0.5551876379690949, 'num_input_tokens_seen': 32964592, 'completed': '55.52% (503 / 906)', 'remaining time': '1:54:10', 'throughput': '1219.30', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:56:06,299 >> {'loss': 0.4165, 'grad_norm': 4.745166301727295, 'learning_rate': 5.04052230875316e-06, 'epoch': 0.5562913907284768, 'num_input_tokens_seen': 33030128, 'completed': '55.63% (504 / 906)', 'remaining time': '1:53:50', 'throughput': '1246.15', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:56:19,670 >> {'loss': 0.256, 'grad_norm': 3.404788017272949, 'learning_rate': 5.024172760687671e-06, 'epoch': 0.5573951434878587, 'num_input_tokens_seen': 33095664, 'completed': '55.74% (505 / 906)', 'remaining time': '1:53:30', 'throughput': '1225.38', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:56:33,037 >> {'loss': 0.5258, 'grad_norm': 5.247468948364258, 'learning_rate': 5.007829562299567e-06, 'epoch': 0.5584988962472406, 'num_input_tokens_seen': 33161200, 'completed': '55.85% (506 / 906)', 'remaining time': '1:53:10', 'throughput': '1225.69', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:56:46,419 >> {'loss': 0.4969, 'grad_norm': 4.767642021179199, 'learning_rate': 4.991492931680684e-06, 'epoch': 0.5596026490066225, 'num_input_tokens_seen': 33226736, 'completed': '55.96% (507 / 906)', 'remaining time': '1:52:50', 'throughput': '1224.33', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:56:59,718 >> {'loss': 0.554, 'grad_norm': 5.1648101806640625, 'learning_rate': 4.975163086835216e-06, 'epoch': 0.5607064017660044, 'num_input_tokens_seen': 33292272, 'completed': '56.07% (508 / 906)', 'remaining time': '1:52:31', 'throughput': '1231.97', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:57:13,200 >> {'loss': 0.4574, 'grad_norm': 5.1125264167785645, 'learning_rate': 4.958840245676806e-06, 'epoch': 0.5618101545253863, 'num_input_tokens_seen': 33357808, 'completed': '56.18% (509 / 906)', 'remaining time': '1:52:11', 'throughput': '1215.25', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:57:26,590 >> {'loss': 0.5128, 'grad_norm': 5.249313831329346, 'learning_rate': 4.9425246260256345e-06, 'epoch': 0.5629139072847682, 'num_input_tokens_seen': 33423344, 'completed': '56.29% (510 / 906)', 'remaining time': '1:51:51', 'throughput': '1223.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:57:39,953 >> {'loss': 0.2696, 'grad_norm': 3.5619633197784424, 'learning_rate': 4.9262164456055165e-06, 'epoch': 0.5640176600441501, 'num_input_tokens_seen': 33488880, 'completed': '56.40% (511 / 906)', 'remaining time': '1:51:32', 'throughput': '1226.05', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:57:53,248 >> {'loss': 0.4366, 'grad_norm': 4.751221656799316, 'learning_rate': 4.909915922040989e-06, 'epoch': 0.565121412803532, 'num_input_tokens_seen': 33554416, 'completed': '56.51% (512 / 906)', 'remaining time': '1:51:12', 'throughput': '1232.38', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:58:06,598 >> {'loss': 0.3903, 'grad_norm': 4.820193767547607, 'learning_rate': 4.893623272854417e-06, 'epoch': 0.5662251655629139, 'num_input_tokens_seen': 33619952, 'completed': '56.62% (513 / 906)', 'remaining time': '1:50:52', 'throughput': '1227.24', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:58:19,919 >> {'loss': 0.6647, 'grad_norm': 6.389187335968018, 'learning_rate': 4.877338715463087e-06, 'epoch': 0.5673289183222958, 'num_input_tokens_seen': 33685488, 'completed': '56.73% (514 / 906)', 'remaining time': '1:50:32', 'throughput': '1229.90', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:58:33,207 >> {'loss': 0.4495, 'grad_norm': 5.3614301681518555, 'learning_rate': 4.861062467176302e-06, 'epoch': 0.5684326710816777, 'num_input_tokens_seen': 33751024, 'completed': '56.84% (515 / 906)', 'remaining time': '1:50:13', 'throughput': '1233.06', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:58:46,427 >> {'loss': 0.5787, 'grad_norm': 5.853780746459961, 'learning_rate': 4.844794745192479e-06, 'epoch': 0.5695364238410596, 'num_input_tokens_seen': 33816560, 'completed': '56.95% (516 / 906)', 'remaining time': '1:49:53', 'throughput': '1239.33', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:58:59,789 >> {'loss': 0.409, 'grad_norm': 4.756037712097168, 'learning_rate': 4.828535766596266e-06, 'epoch': 0.5706401766004415, 'num_input_tokens_seen': 33882096, 'completed': '57.06% (517 / 906)', 'remaining time': '1:49:33', 'throughput': '1226.11', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:59:13,195 >> {'loss': 0.4346, 'grad_norm': 4.552085876464844, 'learning_rate': 4.8122857483556285e-06, 'epoch': 0.5717439293598234, 'num_input_tokens_seen': 33947632, 'completed': '57.17% (518 / 906)', 'remaining time': '1:49:14', 'throughput': '1222.17', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:59:26,619 >> {'loss': 0.4542, 'grad_norm': 5.546624660491943, 'learning_rate': 4.796044907318961e-06, 'epoch': 0.5728476821192053, 'num_input_tokens_seen': 34013168, 'completed': '57.28% (519 / 906)', 'remaining time': '1:48:54', 'throughput': '1220.48', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 22:59:40,121 >> {'loss': 0.3123, 'grad_norm': 3.965635061264038, 'learning_rate': 4.779813460212197e-06, 'epoch': 0.5739514348785872, 'num_input_tokens_seen': 34078704, 'completed': '57.40% (520 / 906)', 'remaining time': '1:48:35', 'throughput': '1213.45', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 22:59:53,549 >> {'loss': 0.1869, 'grad_norm': 3.4151055812835693, 'learning_rate': 4.763591623635905e-06, 'epoch': 0.5750551876379691, 'num_input_tokens_seen': 34144240, 'completed': '57.51% (521 / 906)', 'remaining time': '1:48:16', 'throughput': '1220.13', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:00:06,876 >> {'loss': 0.2873, 'grad_norm': 3.900256395339966, 'learning_rate': 4.747379614062411e-06, 'epoch': 0.5761589403973509, 'num_input_tokens_seen': 34209776, 'completed': '57.62% (522 / 906)', 'remaining time': '1:47:56', 'throughput': '1229.39', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:00:20,011 >> {'loss': 0.3625, 'grad_norm': 4.467334270477295, 'learning_rate': 4.731177647832905e-06, 'epoch': 0.5772626931567328, 'num_input_tokens_seen': 34275312, 'completed': '57.73% (523 / 906)', 'remaining time': '1:47:37', 'throughput': '1247.36', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:00:33,363 >> {'loss': 0.4265, 'grad_norm': 4.834880828857422, 'learning_rate': 4.714985941154551e-06, 'epoch': 0.5783664459161147, 'num_input_tokens_seen': 34340848, 'completed': '57.84% (524 / 906)', 'remaining time': '1:47:17', 'throughput': '1227.10', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:00:46,573 >> {'loss': 0.5406, 'grad_norm': 5.462576866149902, 'learning_rate': 4.698804710097607e-06, 'epoch': 0.5794701986754967, 'num_input_tokens_seen': 34406384, 'completed': '57.95% (525 / 906)', 'remaining time': '1:46:58', 'throughput': '1240.26', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:00:59,719 >> {'loss': 0.553, 'grad_norm': 5.588657379150391, 'learning_rate': 4.682634170592537e-06, 'epoch': 0.5805739514348786, 'num_input_tokens_seen': 34471920, 'completed': '58.06% (526 / 906)', 'remaining time': '1:46:38', 'throughput': '1246.31', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:01:12,990 >> {'loss': 0.5007, 'grad_norm': 4.872191429138184, 'learning_rate': 4.6664745384271315e-06, 'epoch': 0.5816777041942605, 'num_input_tokens_seen': 34537456, 'completed': '58.17% (527 / 906)', 'remaining time': '1:46:19', 'throughput': '1234.59', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:01:26,203 >> {'loss': 0.6074, 'grad_norm': 5.454434394836426, 'learning_rate': 4.650326029243629e-06, 'epoch': 0.5827814569536424, 'num_input_tokens_seen': 34602992, 'completed': '58.28% (528 / 906)', 'remaining time': '1:45:59', 'throughput': '1239.96', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:01:39,483 >> {'loss': 0.2471, 'grad_norm': 3.7384860515594482, 'learning_rate': 4.634188858535839e-06, 'epoch': 0.5838852097130243, 'num_input_tokens_seen': 34668528, 'completed': '58.39% (529 / 906)', 'remaining time': '1:45:40', 'throughput': '1233.73', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:01:52,620 >> {'loss': 0.446, 'grad_norm': 5.02782678604126, 'learning_rate': 4.61806324164626e-06, 'epoch': 0.5849889624724062, 'num_input_tokens_seen': 34734064, 'completed': '58.50% (530 / 906)', 'remaining time': '1:45:21', 'throughput': '1247.21', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:02:05,925 >> {'loss': 0.4206, 'grad_norm': 4.343999862670898, 'learning_rate': 4.601949393763215e-06, 'epoch': 0.5860927152317881, 'num_input_tokens_seen': 34799600, 'completed': '58.61% (531 / 906)', 'remaining time': '1:45:01', 'throughput': '1231.38', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:02:19,011 >> {'loss': 0.6584, 'grad_norm': 6.126996040344238, 'learning_rate': 4.58584752991797e-06, 'epoch': 0.58719646799117, 'num_input_tokens_seen': 34865136, 'completed': '58.72% (532 / 906)', 'remaining time': '1:44:42', 'throughput': '1252.07', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:02:32,158 >> {'loss': 0.5208, 'grad_norm': 4.696587085723877, 'learning_rate': 4.56975786498188e-06, 'epoch': 0.5883002207505519, 'num_input_tokens_seen': 34930672, 'completed': '58.83% (533 / 906)', 'remaining time': '1:44:22', 'throughput': '1246.19', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:02:45,376 >> {'loss': 0.4503, 'grad_norm': 4.739029884338379, 'learning_rate': 4.553680613663504e-06, 'epoch': 0.5894039735099338, 'num_input_tokens_seen': 34996208, 'completed': '58.94% (534 / 906)', 'remaining time': '1:44:03', 'throughput': '1239.50', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:02:58,671 >> {'loss': 0.4225, 'grad_norm': 4.531418323516846, 'learning_rate': 4.537615990505744e-06, 'epoch': 0.5905077262693157, 'num_input_tokens_seen': 35061744, 'completed': '59.05% (535 / 906)', 'remaining time': '1:43:44', 'throughput': '1232.39', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:03:11,823 >> {'loss': 0.4602, 'grad_norm': 4.888784885406494, 'learning_rate': 4.521564209882995e-06, 'epoch': 0.5916114790286976, 'num_input_tokens_seen': 35127280, 'completed': '59.16% (536 / 906)', 'remaining time': '1:43:25', 'throughput': '1245.66', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:03:25,101 >> {'loss': 0.5887, 'grad_norm': 5.615579605102539, 'learning_rate': 4.505525485998267e-06, 'epoch': 0.5927152317880795, 'num_input_tokens_seen': 35192816, 'completed': '59.27% (537 / 906)', 'remaining time': '1:43:06', 'throughput': '1233.93', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:03:38,445 >> {'loss': 0.4209, 'grad_norm': 4.535574913024902, 'learning_rate': 4.489500032880342e-06, 'epoch': 0.5938189845474614, 'num_input_tokens_seen': 35258352, 'completed': '59.38% (538 / 906)', 'remaining time': '1:42:46', 'throughput': '1227.86', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:03:51,657 >> {'loss': 0.672, 'grad_norm': 5.854379177093506, 'learning_rate': 4.473488064380901e-06, 'epoch': 0.5949227373068433, 'num_input_tokens_seen': 35323888, 'completed': '59.49% (539 / 906)', 'remaining time': '1:42:27', 'throughput': '1240.05', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:04:05,102 >> {'loss': 0.4111, 'grad_norm': 4.801560878753662, 'learning_rate': 4.457489794171685e-06, 'epoch': 0.5960264900662252, 'num_input_tokens_seen': 35389424, 'completed': '59.60% (540 / 906)', 'remaining time': '1:42:08', 'throughput': '1218.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:04:18,563 >> {'loss': 0.2854, 'grad_norm': 4.031239032745361, 'learning_rate': 4.44150543574164e-06, 'epoch': 0.5971302428256071, 'num_input_tokens_seen': 35454960, 'completed': '59.71% (541 / 906)', 'remaining time': '1:41:49', 'throughput': '1217.12', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:04:31,876 >> {'loss': 0.403, 'grad_norm': 4.758004665374756, 'learning_rate': 4.4255352023940616e-06, 'epoch': 0.5982339955849889, 'num_input_tokens_seen': 35520496, 'completed': '59.82% (542 / 906)', 'remaining time': '1:41:30', 'throughput': '1230.66', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:04:45,171 >> {'loss': 0.5664, 'grad_norm': 6.281385898590088, 'learning_rate': 4.4095793072437554e-06, 'epoch': 0.5993377483443708, 'num_input_tokens_seen': 35586032, 'completed': '59.93% (543 / 906)', 'remaining time': '1:41:11', 'throughput': '1232.35', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:04:58,596 >> {'loss': 0.4567, 'grad_norm': 4.875401973724365, 'learning_rate': 4.393637963214191e-06, 'epoch': 0.6004415011037527, 'num_input_tokens_seen': 35651568, 'completed': '60.04% (544 / 906)', 'remaining time': '1:40:52', 'throughput': '1220.45', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:05:11,790 >> {'loss': 0.8247, 'grad_norm': 6.389124870300293, 'learning_rate': 4.37771138303466e-06, 'epoch': 0.6015452538631346, 'num_input_tokens_seen': 35717104, 'completed': '60.15% (545 / 906)', 'remaining time': '1:40:33', 'throughput': '1241.73', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:05:24,986 >> {'loss': 0.5074, 'grad_norm': 4.922704696655273, 'learning_rate': 4.3617997792374365e-06, 'epoch': 0.6026490066225165, 'num_input_tokens_seen': 35782640, 'completed': '60.26% (546 / 906)', 'remaining time': '1:40:14', 'throughput': '1241.57', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:05:38,258 >> {'loss': 0.5406, 'grad_norm': 5.83562707901001, 'learning_rate': 4.345903364154949e-06, 'epoch': 0.6037527593818984, 'num_input_tokens_seen': 35848176, 'completed': '60.38% (547 / 906)', 'remaining time': '1:39:55', 'throughput': '1234.49', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:05:51,555 >> {'loss': 0.5394, 'grad_norm': 5.12996244430542, 'learning_rate': 4.330022349916928e-06, 'epoch': 0.6048565121412803, 'num_input_tokens_seen': 35913712, 'completed': '60.49% (548 / 906)', 'remaining time': '1:39:36', 'throughput': '1232.18', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:06:04,854 >> {'loss': 0.6657, 'grad_norm': 5.372644424438477, 'learning_rate': 4.314156948447596e-06, 'epoch': 0.6059602649006622, 'num_input_tokens_seen': 35979248, 'completed': '60.60% (549 / 906)', 'remaining time': '1:39:17', 'throughput': '1231.97', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:06:18,021 >> {'loss': 0.5652, 'grad_norm': 5.526101112365723, 'learning_rate': 4.298307371462833e-06, 'epoch': 0.6070640176600441, 'num_input_tokens_seen': 36044784, 'completed': '60.71% (550 / 906)', 'remaining time': '1:38:58', 'throughput': '1244.30', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:06:31,373 >> {'loss': 0.1381, 'grad_norm': 2.293391227722168, 'learning_rate': 4.282473830467342e-06, 'epoch': 0.608167770419426, 'num_input_tokens_seen': 36110320, 'completed': '60.82% (551 / 906)', 'remaining time': '1:38:40', 'throughput': '1227.16', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:06:44,677 >> {'loss': 0.2672, 'grad_norm': 3.795881509780884, 'learning_rate': 4.26665653675184e-06, 'epoch': 0.609271523178808, 'num_input_tokens_seen': 36175856, 'completed': '60.93% (552 / 906)', 'remaining time': '1:38:21', 'throughput': '1231.47', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:06:58,023 >> {'loss': 0.2625, 'grad_norm': 3.400076389312744, 'learning_rate': 4.250855701390225e-06, 'epoch': 0.6103752759381899, 'num_input_tokens_seen': 36241392, 'completed': '61.04% (553 / 906)', 'remaining time': '1:38:02', 'throughput': '1227.58', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:07:11,221 >> {'loss': 0.5561, 'grad_norm': 5.294395446777344, 'learning_rate': 4.235071535236773e-06, 'epoch': 0.6114790286975718, 'num_input_tokens_seen': 36306928, 'completed': '61.15% (554 / 906)', 'remaining time': '1:37:43', 'throughput': '1241.40', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:07:24,512 >> {'loss': 0.287, 'grad_norm': 3.94831919670105, 'learning_rate': 4.219304248923316e-06, 'epoch': 0.6125827814569537, 'num_input_tokens_seen': 36372464, 'completed': '61.26% (555 / 906)', 'remaining time': '1:37:24', 'throughput': '1232.73', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:07:37,766 >> {'loss': 0.3303, 'grad_norm': 4.536540985107422, 'learning_rate': 4.203554052856431e-06, 'epoch': 0.6136865342163356, 'num_input_tokens_seen': 36438000, 'completed': '61.37% (556 / 906)', 'remaining time': '1:37:06', 'throughput': '1236.22', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:07:51,090 >> {'loss': 0.2098, 'grad_norm': 3.447150230407715, 'learning_rate': 4.187821157214638e-06, 'epoch': 0.6147902869757175, 'num_input_tokens_seen': 36503536, 'completed': '61.48% (557 / 906)', 'remaining time': '1:36:47', 'throughput': '1229.65', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:08:04,475 >> {'loss': 0.2808, 'grad_norm': 4.320523738861084, 'learning_rate': 4.1721057719455845e-06, 'epoch': 0.6158940397350994, 'num_input_tokens_seen': 36569072, 'completed': '61.59% (558 / 906)', 'remaining time': '1:36:28', 'throughput': '1224.01', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:08:17,531 >> {'loss': 0.6881, 'grad_norm': 6.25832986831665, 'learning_rate': 4.156408106763259e-06, 'epoch': 0.6169977924944813, 'num_input_tokens_seen': 36634608, 'completed': '61.70% (559 / 906)', 'remaining time': '1:36:09', 'throughput': '1254.93', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:08:30,633 >> {'loss': 0.5689, 'grad_norm': 5.774060249328613, 'learning_rate': 4.1407283711451795e-06, 'epoch': 0.6181015452538632, 'num_input_tokens_seen': 36700144, 'completed': '61.81% (560 / 906)', 'remaining time': '1:35:50', 'throughput': '1250.54', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:08:43,934 >> {'loss': 0.4055, 'grad_norm': 5.2640461921691895, 'learning_rate': 4.125066774329605e-06, 'epoch': 0.6192052980132451, 'num_input_tokens_seen': 36765680, 'completed': '61.92% (561 / 906)', 'remaining time': '1:35:32', 'throughput': '1231.75', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:08:57,195 >> {'loss': 0.3318, 'grad_norm': 4.976436614990234, 'learning_rate': 4.109423525312738e-06, 'epoch': 0.6203090507726269, 'num_input_tokens_seen': 36831216, 'completed': '62.03% (562 / 906)', 'remaining time': '1:35:13', 'throughput': '1235.54', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:09:10,582 >> {'loss': 0.3969, 'grad_norm': 5.786040306091309, 'learning_rate': 4.093798832845941e-06, 'epoch': 0.6214128035320088, 'num_input_tokens_seen': 36896752, 'completed': '62.14% (563 / 906)', 'remaining time': '1:34:55', 'throughput': '1223.84', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:09:23,900 >> {'loss': 0.4451, 'grad_norm': 5.763451099395752, 'learning_rate': 4.078192905432949e-06, 'epoch': 0.6225165562913907, 'num_input_tokens_seen': 36962288, 'completed': '62.25% (564 / 906)', 'remaining time': '1:34:36', 'throughput': '1230.25', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:09:37,309 >> {'loss': 0.3197, 'grad_norm': 4.722074031829834, 'learning_rate': 4.0626059513270885e-06, 'epoch': 0.6236203090507726, 'num_input_tokens_seen': 37027824, 'completed': '62.36% (565 / 906)', 'remaining time': '1:34:17', 'throughput': '1221.82', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:09:50,682 >> {'loss': 0.2824, 'grad_norm': 4.204012870788574, 'learning_rate': 4.047038178528494e-06, 'epoch': 0.6247240618101545, 'num_input_tokens_seen': 37093360, 'completed': '62.47% (566 / 906)', 'remaining time': '1:33:59', 'throughput': '1225.18', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:10:03,862 >> {'loss': 0.4326, 'grad_norm': 4.942794322967529, 'learning_rate': 4.0314897947813315e-06, 'epoch': 0.6258278145695364, 'num_input_tokens_seen': 37158896, 'completed': '62.58% (567 / 906)', 'remaining time': '1:33:40', 'throughput': '1243.03', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:10:17,128 >> {'loss': 0.3502, 'grad_norm': 4.9892730712890625, 'learning_rate': 4.015961007571036e-06, 'epoch': 0.6269315673289183, 'num_input_tokens_seen': 37224432, 'completed': '62.69% (568 / 906)', 'remaining time': '1:33:22', 'throughput': '1235.12', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:10:30,448 >> {'loss': 0.3456, 'grad_norm': 4.182072639465332, 'learning_rate': 4.000452024121534e-06, 'epoch': 0.6280353200883002, 'num_input_tokens_seen': 37289968, 'completed': '62.80% (569 / 906)', 'remaining time': '1:33:03', 'throughput': '1230.01', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:10:43,826 >> {'loss': 0.2242, 'grad_norm': 3.5467734336853027, 'learning_rate': 3.9849630513924844e-06, 'epoch': 0.6291390728476821, 'num_input_tokens_seen': 37355504, 'completed': '62.91% (570 / 906)', 'remaining time': '1:32:45', 'throughput': '1224.67', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:10:57,011 >> {'loss': 0.397, 'grad_norm': 4.3329033851623535, 'learning_rate': 3.9694942960765035e-06, 'epoch': 0.630242825607064, 'num_input_tokens_seen': 37421040, 'completed': '63.02% (571 / 906)', 'remaining time': '1:32:26', 'throughput': '1242.61', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:11:10,288 >> {'loss': 0.2883, 'grad_norm': 4.063337326049805, 'learning_rate': 3.954045964596425e-06, 'epoch': 0.6313465783664459, 'num_input_tokens_seen': 37486576, 'completed': '63.13% (572 / 906)', 'remaining time': '1:32:08', 'throughput': '1234.00', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:11:23,444 >> {'loss': 0.5777, 'grad_norm': 5.374694347381592, 'learning_rate': 3.938618263102534e-06, 'epoch': 0.6324503311258278, 'num_input_tokens_seen': 37552112, 'completed': '63.25% (573 / 906)', 'remaining time': '1:31:49', 'throughput': '1245.41', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:11:36,689 >> {'loss': 0.5059, 'grad_norm': 5.350125789642334, 'learning_rate': 3.923211397469818e-06, 'epoch': 0.6335540838852097, 'num_input_tokens_seen': 37617648, 'completed': '63.36% (574 / 906)', 'remaining time': '1:31:31', 'throughput': '1236.98', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:11:49,884 >> {'loss': 0.3395, 'grad_norm': 4.029745101928711, 'learning_rate': 3.9078255732952244e-06, 'epoch': 0.6346578366445916, 'num_input_tokens_seen': 37683184, 'completed': '63.47% (575 / 906)', 'remaining time': '1:31:12', 'throughput': '1241.72', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:12:03,242 >> {'loss': 0.3145, 'grad_norm': 4.23199987411499, 'learning_rate': 3.8924609958949035e-06, 'epoch': 0.6357615894039735, 'num_input_tokens_seen': 37748720, 'completed': '63.58% (576 / 906)', 'remaining time': '1:30:54', 'throughput': '1226.46', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:12:16,538 >> {'loss': 0.2616, 'grad_norm': 3.817701578140259, 'learning_rate': 3.877117870301488e-06, 'epoch': 0.6368653421633554, 'num_input_tokens_seen': 37814256, 'completed': '63.69% (577 / 906)', 'remaining time': '1:30:36', 'throughput': '1232.26', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:12:29,742 >> {'loss': 0.4336, 'grad_norm': 5.496271133422852, 'learning_rate': 3.861796401261341e-06, 'epoch': 0.6379690949227373, 'num_input_tokens_seen': 37879792, 'completed': '63.80% (578 / 906)', 'remaining time': '1:30:17', 'throughput': '1240.91', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:12:42,874 >> {'loss': 0.5434, 'grad_norm': 5.724914073944092, 'learning_rate': 3.846496793231834e-06, 'epoch': 0.6390728476821192, 'num_input_tokens_seen': 37945328, 'completed': '63.91% (579 / 906)', 'remaining time': '1:29:59', 'throughput': '1247.57', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:12:56,110 >> {'loss': 0.4194, 'grad_norm': 5.538677215576172, 'learning_rate': 3.8312192503786085e-06, 'epoch': 0.6401766004415012, 'num_input_tokens_seen': 38010864, 'completed': '64.02% (580 / 906)', 'remaining time': '1:29:40', 'throughput': '1237.82', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:13:09,507 >> {'loss': 0.3671, 'grad_norm': 4.735373020172119, 'learning_rate': 3.81596397657286e-06, 'epoch': 0.6412803532008831, 'num_input_tokens_seen': 38076400, 'completed': '64.13% (581 / 906)', 'remaining time': '1:29:22', 'throughput': '1222.99', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:13:22,824 >> {'loss': 0.2745, 'grad_norm': 4.600485324859619, 'learning_rate': 3.80073117538862e-06, 'epoch': 0.6423841059602649, 'num_input_tokens_seen': 38141936, 'completed': '64.24% (582 / 906)', 'remaining time': '1:29:04', 'throughput': '1230.37', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:13:36,119 >> {'loss': 0.5071, 'grad_norm': 5.746383190155029, 'learning_rate': 3.785521050100025e-06, 'epoch': 0.6434878587196468, 'num_input_tokens_seen': 38207472, 'completed': '64.35% (583 / 906)', 'remaining time': '1:28:46', 'throughput': '1232.30', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:13:49,308 >> {'loss': 0.4335, 'grad_norm': 5.705986976623535, 'learning_rate': 3.7703338036786195e-06, 'epoch': 0.6445916114790287, 'num_input_tokens_seen': 38273008, 'completed': '64.46% (584 / 906)', 'remaining time': '1:28:27', 'throughput': '1242.20', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:14:02,568 >> {'loss': 0.4688, 'grad_norm': 6.0024614334106445, 'learning_rate': 3.7551696387906365e-06, 'epoch': 0.6456953642384106, 'num_input_tokens_seen': 38338544, 'completed': '64.57% (585 / 906)', 'remaining time': '1:28:09', 'throughput': '1235.62', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:14:15,891 >> {'loss': 0.5509, 'grad_norm': 6.416266918182373, 'learning_rate': 3.7400287577942994e-06, 'epoch': 0.6467991169977925, 'num_input_tokens_seen': 38404080, 'completed': '64.68% (586 / 906)', 'remaining time': '1:27:51', 'throughput': '1229.74', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:14:29,229 >> {'loss': 0.4493, 'grad_norm': 5.125373840332031, 'learning_rate': 3.7249113627371203e-06, 'epoch': 0.6479028697571744, 'num_input_tokens_seen': 38469616, 'completed': '64.79% (587 / 906)', 'remaining time': '1:27:33', 'throughput': '1228.38', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:14:42,482 >> {'loss': 0.4476, 'grad_norm': 5.068411827087402, 'learning_rate': 3.7098176553532015e-06, 'epoch': 0.6490066225165563, 'num_input_tokens_seen': 38535152, 'completed': '64.90% (588 / 906)', 'remaining time': '1:27:14', 'throughput': '1236.32', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:14:55,640 >> {'loss': 0.5361, 'grad_norm': 5.842297554016113, 'learning_rate': 3.6947478370605516e-06, 'epoch': 0.6501103752759382, 'num_input_tokens_seen': 38600688, 'completed': '65.01% (589 / 906)', 'remaining time': '1:26:56', 'throughput': '1245.13', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:15:08,819 >> {'loss': 0.4917, 'grad_norm': 4.763930320739746, 'learning_rate': 3.6797021089583794e-06, 'epoch': 0.6512141280353201, 'num_input_tokens_seen': 38666224, 'completed': '65.12% (590 / 906)', 'remaining time': '1:26:38', 'throughput': '1243.19', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:15:22,196 >> {'loss': 0.1964, 'grad_norm': 3.0710020065307617, 'learning_rate': 3.66468067182443e-06, 'epoch': 0.652317880794702, 'num_input_tokens_seen': 38731760, 'completed': '65.23% (591 / 906)', 'remaining time': '1:26:20', 'throughput': '1224.77', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:15:35,531 >> {'loss': 0.3169, 'grad_norm': 4.4214887619018555, 'learning_rate': 3.649683726112299e-06, 'epoch': 0.6534216335540839, 'num_input_tokens_seen': 38797296, 'completed': '65.34% (592 / 906)', 'remaining time': '1:26:02', 'throughput': '1228.70', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:15:49,040 >> {'loss': 0.454, 'grad_norm': 5.3136396408081055, 'learning_rate': 3.6347114719487496e-06, 'epoch': 0.6545253863134658, 'num_input_tokens_seen': 38862832, 'completed': '65.45% (593 / 906)', 'remaining time': '1:25:44', 'throughput': '1212.80', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:16:02,335 >> {'loss': 0.5138, 'grad_norm': 5.324778079986572, 'learning_rate': 3.6197641091310553e-06, 'epoch': 0.6556291390728477, 'num_input_tokens_seen': 38928368, 'completed': '65.56% (594 / 906)', 'remaining time': '1:25:26', 'throughput': '1232.29', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:16:15,588 >> {'loss': 0.4474, 'grad_norm': 5.356673717498779, 'learning_rate': 3.6048418371243222e-06, 'epoch': 0.6567328918322296, 'num_input_tokens_seen': 38993904, 'completed': '65.67% (595 / 906)', 'remaining time': '1:25:08', 'throughput': '1236.31', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:16:28,773 >> {'loss': 0.6614, 'grad_norm': 5.669276237487793, 'learning_rate': 3.5899448550588335e-06, 'epoch': 0.6578366445916115, 'num_input_tokens_seen': 39059440, 'completed': '65.78% (596 / 906)', 'remaining time': '1:24:50', 'throughput': '1242.57', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:16:41,983 >> {'loss': 0.4255, 'grad_norm': 4.830669403076172, 'learning_rate': 3.5750733617273914e-06, 'epoch': 0.6589403973509934, 'num_input_tokens_seen': 39124976, 'completed': '65.89% (597 / 906)', 'remaining time': '1:24:32', 'throughput': '1240.30', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:16:55,240 >> {'loss': 0.2942, 'grad_norm': 5.008633613586426, 'learning_rate': 3.560227555582665e-06, 'epoch': 0.6600441501103753, 'num_input_tokens_seen': 39190508, 'completed': '66.00% (598 / 906)', 'remaining time': '1:24:13', 'throughput': '1235.82', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:17:08,522 >> {'loss': 0.2513, 'grad_norm': 3.8209006786346436, 'learning_rate': 3.5454076347345367e-06, 'epoch': 0.6611479028697572, 'num_input_tokens_seen': 39256044, 'completed': '66.11% (599 / 906)', 'remaining time': '1:23:55', 'throughput': '1233.56', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:17:21,763 >> {'loss': 0.3439, 'grad_norm': 4.039699554443359, 'learning_rate': 3.5306137969474663e-06, 'epoch': 0.6622516556291391, 'num_input_tokens_seen': 39321580, 'completed': '66.23% (600 / 906)', 'remaining time': '1:23:37', 'throughput': '1237.35', 'gpu_mem_free': '30139MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-04 23:17:47,223 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-600
[INFO|configuration_utils.py:472] 2025-01-04 23:17:47,226 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-600/config.json
[INFO|configuration_utils.py:807] 2025-01-04 23:17:47,227 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-600/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-04 23:18:43,538 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-600/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-04 23:18:43,542 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-600/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-04 23:18:43,542 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-600/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[WARNING|trainer.py:869] 2025-01-04 23:22:38,545 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 2400, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-04 23:22:52,215 >> {'loss': 0.3721, 'grad_norm': 5.03836727142334, 'learning_rate': 3.515846239637846e-06, 'epoch': 0.6633554083885209, 'num_input_tokens_seen': 39387116, 'completed': '66.34% (601 / 906)', 'remaining time': '1:26:00', 'throughput': '49.58', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:23:05,490 >> {'loss': 0.5748, 'grad_norm': 5.898911476135254, 'learning_rate': 3.5011051598713707e-06, 'epoch': 0.6644591611479028, 'num_input_tokens_seen': 39452652, 'completed': '66.45% (602 / 906)', 'remaining time': '1:25:42', 'throughput': '1234.15', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:23:18,867 >> {'loss': 0.6649, 'grad_norm': 6.085241794586182, 'learning_rate': 3.4863907543604e-06, 'epoch': 0.6655629139072847, 'num_input_tokens_seen': 39518188, 'completed': '66.56% (603 / 906)', 'remaining time': '1:25:23', 'throughput': '1224.77', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:23:32,228 >> {'loss': 0.4869, 'grad_norm': 5.235047340393066, 'learning_rate': 3.4717032194613455e-06, 'epoch': 0.6666666666666666, 'num_input_tokens_seen': 39583724, 'completed': '66.67% (604 / 906)', 'remaining time': '1:25:04', 'throughput': '1226.28', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:23:45,772 >> {'loss': 0.2515, 'grad_norm': 3.5258564949035645, 'learning_rate': 3.45704275117204e-06, 'epoch': 0.6677704194260485, 'num_input_tokens_seen': 39649260, 'completed': '66.78% (605 / 906)', 'remaining time': '1:24:46', 'throughput': '1209.69', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:23:59,362 >> {'loss': 0.3673, 'grad_norm': 4.411483287811279, 'learning_rate': 3.4424095451291273e-06, 'epoch': 0.6688741721854304, 'num_input_tokens_seen': 39714796, 'completed': '66.89% (606 / 906)', 'remaining time': '1:24:27', 'throughput': '1205.55', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:24:12,635 >> {'loss': 0.3262, 'grad_norm': 4.131246566772461, 'learning_rate': 3.4278037966054505e-06, 'epoch': 0.6699779249448123, 'num_input_tokens_seen': 39780332, 'completed': '67.00% (607 / 906)', 'remaining time': '1:24:08', 'throughput': '1234.43', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:24:25,783 >> {'loss': 0.6029, 'grad_norm': 6.033511161804199, 'learning_rate': 3.4132257005074424e-06, 'epoch': 0.6710816777041942, 'num_input_tokens_seen': 39845868, 'completed': '67.11% (608 / 906)', 'remaining time': '1:23:50', 'throughput': '1246.14', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:24:39,159 >> {'loss': 0.356, 'grad_norm': 4.950568675994873, 'learning_rate': 3.3986754513725308e-06, 'epoch': 0.6721854304635762, 'num_input_tokens_seen': 39911404, 'completed': '67.22% (609 / 906)', 'remaining time': '1:23:31', 'throughput': '1224.89', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:24:52,506 >> {'loss': 0.431, 'grad_norm': 5.005128860473633, 'learning_rate': 3.3841532433665425e-06, 'epoch': 0.673289183222958, 'num_input_tokens_seen': 39976940, 'completed': '67.33% (610 / 906)', 'remaining time': '1:23:13', 'throughput': '1227.51', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:25:05,645 >> {'loss': 0.7201, 'grad_norm': 6.317650318145752, 'learning_rate': 3.369659270281106e-06, 'epoch': 0.67439293598234, 'num_input_tokens_seen': 40042476, 'completed': '67.44% (611 / 906)', 'remaining time': '1:22:54', 'throughput': '1247.00', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:25:18,893 >> {'loss': 0.4634, 'grad_norm': 4.954858779907227, 'learning_rate': 3.3551937255310656e-06, 'epoch': 0.6754966887417219, 'num_input_tokens_seen': 40108012, 'completed': '67.55% (612 / 906)', 'remaining time': '1:22:35', 'throughput': '1236.72', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:25:32,168 >> {'loss': 0.4785, 'grad_norm': 5.401790142059326, 'learning_rate': 3.3407568021519086e-06, 'epoch': 0.6766004415011038, 'num_input_tokens_seen': 40173548, 'completed': '67.66% (613 / 906)', 'remaining time': '1:22:17', 'throughput': '1234.13', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:25:45,607 >> {'loss': 0.2134, 'grad_norm': 3.7082345485687256, 'learning_rate': 3.326348692797185e-06, 'epoch': 0.6777041942604857, 'num_input_tokens_seen': 40239084, 'completed': '67.77% (614 / 906)', 'remaining time': '1:21:58', 'throughput': '1219.16', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:25:58,871 >> {'loss': 0.4226, 'grad_norm': 4.993045806884766, 'learning_rate': 3.3119695897359318e-06, 'epoch': 0.6788079470198676, 'num_input_tokens_seen': 40304620, 'completed': '67.88% (615 / 906)', 'remaining time': '1:21:40', 'throughput': '1235.22', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:26:12,126 >> {'loss': 0.5581, 'grad_norm': 7.229109764099121, 'learning_rate': 3.2976196848501164e-06, 'epoch': 0.6799116997792495, 'num_input_tokens_seen': 40370156, 'completed': '67.99% (616 / 906)', 'remaining time': '1:21:21', 'throughput': '1236.03', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:26:25,418 >> {'loss': 0.3896, 'grad_norm': 5.062275409698486, 'learning_rate': 3.2832991696320647e-06, 'epoch': 0.6810154525386314, 'num_input_tokens_seen': 40435692, 'completed': '68.10% (617 / 906)', 'remaining time': '1:21:03', 'throughput': '1232.67', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:26:38,701 >> {'loss': 0.3345, 'grad_norm': 4.842869281768799, 'learning_rate': 3.2690082351819176e-06, 'epoch': 0.6821192052980133, 'num_input_tokens_seen': 40501228, 'completed': '68.21% (618 / 906)', 'remaining time': '1:20:44', 'throughput': '1233.46', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:26:51,890 >> {'loss': 0.3487, 'grad_norm': 4.732944011688232, 'learning_rate': 3.254747072205072e-06, 'epoch': 0.6832229580573952, 'num_input_tokens_seen': 40566764, 'completed': '68.32% (619 / 906)', 'remaining time': '1:20:26', 'throughput': '1242.23', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:27:05,205 >> {'loss': 0.3981, 'grad_norm': 5.033801078796387, 'learning_rate': 3.2405158710096437e-06, 'epoch': 0.6843267108167771, 'num_input_tokens_seen': 40632300, 'completed': '68.43% (620 / 906)', 'remaining time': '1:20:07', 'throughput': '1230.51', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:27:18,605 >> {'loss': 0.4127, 'grad_norm': 4.653602600097656, 'learning_rate': 3.2263148215039188e-06, 'epoch': 0.6854304635761589, 'num_input_tokens_seen': 40697836, 'completed': '68.54% (621 / 906)', 'remaining time': '1:19:49', 'throughput': '1222.73', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:27:31,961 >> {'loss': 0.4875, 'grad_norm': 5.868643760681152, 'learning_rate': 3.2121441131938257e-06, 'epoch': 0.6865342163355408, 'num_input_tokens_seen': 40763372, 'completed': '68.65% (622 / 906)', 'remaining time': '1:19:31', 'throughput': '1226.69', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:27:45,219 >> {'loss': 0.3392, 'grad_norm': 4.774325847625732, 'learning_rate': 3.198003935180406e-06, 'epoch': 0.6876379690949227, 'num_input_tokens_seen': 40828908, 'completed': '68.76% (623 / 906)', 'remaining time': '1:19:12', 'throughput': '1235.77', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:27:58,551 >> {'loss': 0.2351, 'grad_norm': 4.566282749176025, 'learning_rate': 3.183894476157288e-06, 'epoch': 0.6887417218543046, 'num_input_tokens_seen': 40894444, 'completed': '68.87% (624 / 906)', 'remaining time': '1:18:54', 'throughput': '1228.91', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:28:11,779 >> {'loss': 0.5228, 'grad_norm': 6.035506725311279, 'learning_rate': 3.1698159244081728e-06, 'epoch': 0.6898454746136865, 'num_input_tokens_seen': 40959980, 'completed': '68.98% (625 / 906)', 'remaining time': '1:18:35', 'throughput': '1238.60', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:28:25,001 >> {'loss': 0.4605, 'grad_norm': 5.202334880828857, 'learning_rate': 3.1557684678043145e-06, 'epoch': 0.6909492273730684, 'num_input_tokens_seen': 41025516, 'completed': '69.09% (626 / 906)', 'remaining time': '1:18:17', 'throughput': '1239.11', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:28:38,272 >> {'loss': 0.4955, 'grad_norm': 5.597423553466797, 'learning_rate': 3.1417522938020227e-06, 'epoch': 0.6920529801324503, 'num_input_tokens_seen': 41091052, 'completed': '69.21% (627 / 906)', 'remaining time': '1:17:59', 'throughput': '1234.56', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:28:51,608 >> {'loss': 0.3477, 'grad_norm': 4.580325603485107, 'learning_rate': 3.127767589440154e-06, 'epoch': 0.6931567328918322, 'num_input_tokens_seen': 41156588, 'completed': '69.32% (628 / 906)', 'remaining time': '1:17:40', 'throughput': '1228.56', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:29:04,963 >> {'loss': 0.4553, 'grad_norm': 5.499457836151123, 'learning_rate': 3.1138145413376187e-06, 'epoch': 0.6942604856512141, 'num_input_tokens_seen': 41222124, 'completed': '69.43% (629 / 906)', 'remaining time': '1:17:22', 'throughput': '1226.80', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:29:18,332 >> {'loss': 0.36, 'grad_norm': 4.949913024902344, 'learning_rate': 3.0998933356908933e-06, 'epoch': 0.695364238410596, 'num_input_tokens_seen': 41287660, 'completed': '69.54% (630 / 906)', 'remaining time': '1:17:04', 'throughput': '1225.59', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:29:31,549 >> {'loss': 0.298, 'grad_norm': 4.248452663421631, 'learning_rate': 3.086004158271526e-06, 'epoch': 0.6964679911699779, 'num_input_tokens_seen': 41353196, 'completed': '69.65% (631 / 906)', 'remaining time': '1:16:46', 'throughput': '1239.53', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:29:44,831 >> {'loss': 0.2398, 'grad_norm': 3.7569615840911865, 'learning_rate': 3.072147194423668e-06, 'epoch': 0.6975717439293598, 'num_input_tokens_seen': 41418732, 'completed': '69.76% (632 / 906)', 'remaining time': '1:16:27', 'throughput': '1233.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:29:58,370 >> {'loss': 0.3882, 'grad_norm': 4.620909690856934, 'learning_rate': 3.058322629061598e-06, 'epoch': 0.6986754966887417, 'num_input_tokens_seen': 41484268, 'completed': '69.87% (633 / 906)', 'remaining time': '1:16:09', 'throughput': '1210.12', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:30:11,822 >> {'loss': 0.5345, 'grad_norm': 6.98662805557251, 'learning_rate': 3.044530646667251e-06, 'epoch': 0.6997792494481236, 'num_input_tokens_seen': 41549804, 'completed': '69.98% (634 / 906)', 'remaining time': '1:15:51', 'throughput': '1217.93', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:30:25,288 >> {'loss': 0.4899, 'grad_norm': 5.646053791046143, 'learning_rate': 3.0307714312877588e-06, 'epoch': 0.7008830022075055, 'num_input_tokens_seen': 41615340, 'completed': '70.09% (635 / 906)', 'remaining time': '1:15:33', 'throughput': '1216.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:30:38,828 >> {'loss': 0.4995, 'grad_norm': 11.439247131347656, 'learning_rate': 3.0170451665329936e-06, 'epoch': 0.7019867549668874, 'num_input_tokens_seen': 41680876, 'completed': '70.20% (636 / 906)', 'remaining time': '1:15:15', 'throughput': '1210.05', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:30:52,240 >> {'loss': 0.4077, 'grad_norm': 5.286317825317383, 'learning_rate': 3.0033520355731182e-06, 'epoch': 0.7030905077262694, 'num_input_tokens_seen': 41746412, 'completed': '70.31% (637 / 906)', 'remaining time': '1:14:57', 'throughput': '1221.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:31:05,524 >> {'loss': 0.5231, 'grad_norm': 6.521766662597656, 'learning_rate': 2.9896922211361423e-06, 'epoch': 0.7041942604856513, 'num_input_tokens_seen': 41811948, 'completed': '70.42% (638 / 906)', 'remaining time': '1:14:39', 'throughput': '1233.41', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:31:18,770 >> {'loss': 0.3481, 'grad_norm': 4.848811149597168, 'learning_rate': 2.9760659055054826e-06, 'epoch': 0.7052980132450332, 'num_input_tokens_seen': 41877484, 'completed': '70.53% (639 / 906)', 'remaining time': '1:14:20', 'throughput': '1236.86', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:31:32,013 >> {'loss': 0.3645, 'grad_norm': 4.405468940734863, 'learning_rate': 2.962473270517528e-06, 'epoch': 0.7064017660044151, 'num_input_tokens_seen': 41943020, 'completed': '70.64% (640 / 906)', 'remaining time': '1:14:02', 'throughput': '1237.20', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:31:45,448 >> {'loss': 0.2403, 'grad_norm': 3.9719290733337402, 'learning_rate': 2.94891449755922e-06, 'epoch': 0.7075055187637969, 'num_input_tokens_seen': 42008556, 'completed': '70.75% (641 / 906)', 'remaining time': '1:13:44', 'throughput': '1219.50', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:31:58,868 >> {'loss': 0.4094, 'grad_norm': 5.281460762023926, 'learning_rate': 2.9353897675656267e-06, 'epoch': 0.7086092715231788, 'num_input_tokens_seen': 42074092, 'completed': '70.86% (642 / 906)', 'remaining time': '1:13:26', 'throughput': '1220.87', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:32:12,196 >> {'loss': 0.2112, 'grad_norm': 3.758775234222412, 'learning_rate': 2.9218992610175324e-06, 'epoch': 0.7097130242825607, 'num_input_tokens_seen': 42139628, 'completed': '70.97% (643 / 906)', 'remaining time': '1:13:08', 'throughput': '1229.29', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:32:25,517 >> {'loss': 0.6355, 'grad_norm': 6.650103569030762, 'learning_rate': 2.9084431579390204e-06, 'epoch': 0.7108167770419426, 'num_input_tokens_seen': 42205164, 'completed': '71.08% (644 / 906)', 'remaining time': '1:12:50', 'throughput': '1229.90', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:32:38,755 >> {'loss': 0.5132, 'grad_norm': 5.46085786819458, 'learning_rate': 2.8950216378950824e-06, 'epoch': 0.7119205298013245, 'num_input_tokens_seen': 42270700, 'completed': '71.19% (645 / 906)', 'remaining time': '1:12:32', 'throughput': '1237.68', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:32:51,959 >> {'loss': 0.3778, 'grad_norm': 5.055593967437744, 'learning_rate': 2.8816348799892134e-06, 'epoch': 0.7130242825607064, 'num_input_tokens_seen': 42336236, 'completed': '71.30% (646 / 906)', 'remaining time': '1:12:14', 'throughput': '1240.84', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:33:05,414 >> {'loss': 0.2851, 'grad_norm': 3.933264970779419, 'learning_rate': 2.868283062861028e-06, 'epoch': 0.7141280353200883, 'num_input_tokens_seen': 42401772, 'completed': '71.41% (647 / 906)', 'remaining time': '1:11:56', 'throughput': '1217.70', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:33:18,773 >> {'loss': 0.4315, 'grad_norm': 5.225867748260498, 'learning_rate': 2.854966364683872e-06, 'epoch': 0.7152317880794702, 'num_input_tokens_seen': 42467308, 'completed': '71.52% (648 / 906)', 'remaining time': '1:11:38', 'throughput': '1226.44', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:33:32,078 >> {'loss': 0.3846, 'grad_norm': 4.521208763122559, 'learning_rate': 2.8416849631624453e-06, 'epoch': 0.7163355408388521, 'num_input_tokens_seen': 42532844, 'completed': '71.63% (649 / 906)', 'remaining time': '1:11:20', 'throughput': '1231.42', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:33:45,330 >> {'loss': 0.5581, 'grad_norm': 4.635922908782959, 'learning_rate': 2.8284390355304325e-06, 'epoch': 0.717439293598234, 'num_input_tokens_seen': 42598380, 'completed': '71.74% (650 / 906)', 'remaining time': '1:11:02', 'throughput': '1236.31', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:33:58,649 >> {'loss': 0.4154, 'grad_norm': 4.856995105743408, 'learning_rate': 2.8152287585481384e-06, 'epoch': 0.7185430463576159, 'num_input_tokens_seen': 42663916, 'completed': '71.85% (651 / 906)', 'remaining time': '1:10:44', 'throughput': '1230.10', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:34:11,957 >> {'loss': 0.4506, 'grad_norm': 4.650574684143066, 'learning_rate': 2.802054308500125e-06, 'epoch': 0.7196467991169978, 'num_input_tokens_seen': 42729452, 'completed': '71.96% (652 / 906)', 'remaining time': '1:10:26', 'throughput': '1231.11', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:34:25,256 >> {'loss': 0.406, 'grad_norm': 5.974058628082275, 'learning_rate': 2.7889158611928647e-06, 'epoch': 0.7207505518763797, 'num_input_tokens_seen': 42794988, 'completed': '72.08% (653 / 906)', 'remaining time': '1:10:08', 'throughput': '1232.03', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:34:38,635 >> {'loss': 0.2127, 'grad_norm': 3.3415307998657227, 'learning_rate': 2.775813591952385e-06, 'epoch': 0.7218543046357616, 'num_input_tokens_seen': 42860524, 'completed': '72.19% (654 / 906)', 'remaining time': '1:09:50', 'throughput': '1224.62', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:34:51,836 >> {'loss': 0.4521, 'grad_norm': 5.096837997436523, 'learning_rate': 2.7627476756219416e-06, 'epoch': 0.7229580573951435, 'num_input_tokens_seen': 42926060, 'completed': '72.30% (655 / 906)', 'remaining time': '1:09:32', 'throughput': '1241.05', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:35:05,071 >> {'loss': 0.488, 'grad_norm': 5.722452640533447, 'learning_rate': 2.7497182865596785e-06, 'epoch': 0.7240618101545254, 'num_input_tokens_seen': 42991596, 'completed': '72.41% (656 / 906)', 'remaining time': '1:09:14', 'throughput': '1237.98', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:35:18,221 >> {'loss': 0.5543, 'grad_norm': 5.530818462371826, 'learning_rate': 2.7367255986362995e-06, 'epoch': 0.7251655629139073, 'num_input_tokens_seen': 43057132, 'completed': '72.52% (657 / 906)', 'remaining time': '1:08:56', 'throughput': '1245.94', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:35:31,676 >> {'loss': 0.2923, 'grad_norm': 4.10623836517334, 'learning_rate': 2.7237697852327465e-06, 'epoch': 0.7262693156732892, 'num_input_tokens_seen': 43122668, 'completed': '72.63% (658 / 906)', 'remaining time': '1:08:39', 'throughput': '1217.67', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:35:44,964 >> {'loss': 0.4849, 'grad_norm': 5.076680660247803, 'learning_rate': 2.7108510192378956e-06, 'epoch': 0.7273730684326711, 'num_input_tokens_seen': 43188204, 'completed': '72.74% (659 / 906)', 'remaining time': '1:08:21', 'throughput': '1232.95', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:35:58,270 >> {'loss': 0.504, 'grad_norm': 5.36618185043335, 'learning_rate': 2.697969473046239e-06, 'epoch': 0.7284768211920529, 'num_input_tokens_seen': 43253740, 'completed': '72.85% (660 / 906)', 'remaining time': '1:08:03', 'throughput': '1231.35', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:36:11,567 >> {'loss': 0.2422, 'grad_norm': 3.818769693374634, 'learning_rate': 2.685125318555595e-06, 'epoch': 0.7295805739514348, 'num_input_tokens_seen': 43319276, 'completed': '72.96% (661 / 906)', 'remaining time': '1:07:45', 'throughput': '1232.16', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:36:24,842 >> {'loss': 0.3608, 'grad_norm': 4.558277606964111, 'learning_rate': 2.672318727164803e-06, 'epoch': 0.7306843267108167, 'num_input_tokens_seen': 43384812, 'completed': '73.07% (662 / 906)', 'remaining time': '1:07:27', 'throughput': '1234.19', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:36:38,363 >> {'loss': 0.1414, 'grad_norm': 3.284147262573242, 'learning_rate': 2.659549869771442e-06, 'epoch': 0.7317880794701986, 'num_input_tokens_seen': 43450348, 'completed': '73.18% (663 / 906)', 'remaining time': '1:07:10', 'throughput': '1211.74', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:36:51,748 >> {'loss': 0.3462, 'grad_norm': 4.764499187469482, 'learning_rate': 2.646818916769551e-06, 'epoch': 0.7328918322295805, 'num_input_tokens_seen': 43515884, 'completed': '73.29% (664 / 906)', 'remaining time': '1:06:52', 'throughput': '1224.11', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:37:05,157 >> {'loss': 0.584, 'grad_norm': 6.250286102294922, 'learning_rate': 2.6341260380473522e-06, 'epoch': 0.7339955849889624, 'num_input_tokens_seen': 43581420, 'completed': '73.40% (665 / 906)', 'remaining time': '1:06:34', 'throughput': '1221.80', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:37:18,463 >> {'loss': 0.4575, 'grad_norm': 5.290203094482422, 'learning_rate': 2.621471402984991e-06, 'epoch': 0.7350993377483444, 'num_input_tokens_seen': 43646956, 'completed': '73.51% (666 / 906)', 'remaining time': '1:06:16', 'throughput': '1231.35', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:37:31,604 >> {'loss': 0.3306, 'grad_norm': 4.591466903686523, 'learning_rate': 2.60885518045226e-06, 'epoch': 0.7362030905077263, 'num_input_tokens_seen': 43712492, 'completed': '73.62% (667 / 906)', 'remaining time': '1:05:59', 'throughput': '1246.78', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:37:44,933 >> {'loss': 0.1541, 'grad_norm': 3.301421642303467, 'learning_rate': 2.5962775388063653e-06, 'epoch': 0.7373068432671082, 'num_input_tokens_seen': 43778028, 'completed': '73.73% (668 / 906)', 'remaining time': '1:05:41', 'throughput': '1229.23', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:37:58,157 >> {'loss': 0.3604, 'grad_norm': 4.60337495803833, 'learning_rate': 2.5837386458896665e-06, 'epoch': 0.7384105960264901, 'num_input_tokens_seen': 43843564, 'completed': '73.84% (669 / 906)', 'remaining time': '1:05:23', 'throughput': '1238.91', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:38:11,412 >> {'loss': 0.5413, 'grad_norm': 5.7256879806518555, 'learning_rate': 2.5712386690274405e-06, 'epoch': 0.739514348785872, 'num_input_tokens_seen': 43909100, 'completed': '73.95% (670 / 906)', 'remaining time': '1:05:05', 'throughput': '1236.08', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:38:24,730 >> {'loss': 0.3497, 'grad_norm': 4.792141914367676, 'learning_rate': 2.55877777502565e-06, 'epoch': 0.7406181015452539, 'num_input_tokens_seen': 43974636, 'completed': '74.06% (671 / 906)', 'remaining time': '1:04:48', 'throughput': '1230.20', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:38:38,121 >> {'loss': 0.3359, 'grad_norm': 4.639016628265381, 'learning_rate': 2.5463561301687122e-06, 'epoch': 0.7417218543046358, 'num_input_tokens_seen': 44040172, 'completed': '74.17% (672 / 906)', 'remaining time': '1:04:30', 'throughput': '1223.55', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:38:51,292 >> {'loss': 0.6212, 'grad_norm': 6.675306797027588, 'learning_rate': 2.533973900217292e-06, 'epoch': 0.7428256070640177, 'num_input_tokens_seen': 44105708, 'completed': '74.28% (673 / 906)', 'remaining time': '1:04:12', 'throughput': '1243.97', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:39:04,723 >> {'loss': 0.3184, 'grad_norm': 4.351767063140869, 'learning_rate': 2.521631250406076e-06, 'epoch': 0.7439293598233996, 'num_input_tokens_seen': 44171244, 'completed': '74.39% (674 / 906)', 'remaining time': '1:03:55', 'throughput': '1219.78', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:39:17,958 >> {'loss': 0.3898, 'grad_norm': 5.042402744293213, 'learning_rate': 2.5093283454415753e-06, 'epoch': 0.7450331125827815, 'num_input_tokens_seen': 44236780, 'completed': '74.50% (675 / 906)', 'remaining time': '1:03:37', 'throughput': '1238.02', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:39:31,210 >> {'loss': 0.5072, 'grad_norm': 5.825693130493164, 'learning_rate': 2.4970653494999233e-06, 'epoch': 0.7461368653421634, 'num_input_tokens_seen': 44302316, 'completed': '74.61% (676 / 906)', 'remaining time': '1:03:19', 'throughput': '1236.26', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:39:44,567 >> {'loss': 0.2332, 'grad_norm': 3.8188884258270264, 'learning_rate': 2.484842426224692e-06, 'epoch': 0.7472406181015453, 'num_input_tokens_seen': 44367852, 'completed': '74.72% (677 / 906)', 'remaining time': '1:03:02', 'throughput': '1226.66', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:39:57,854 >> {'loss': 0.307, 'grad_norm': 4.3476033210754395, 'learning_rate': 2.4726597387247e-06, 'epoch': 0.7483443708609272, 'num_input_tokens_seen': 44433388, 'completed': '74.83% (678 / 906)', 'remaining time': '1:02:44', 'throughput': '1233.11', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:40:11,274 >> {'loss': 0.3315, 'grad_norm': 5.082565784454346, 'learning_rate': 2.4605174495718426e-06, 'epoch': 0.7494481236203091, 'num_input_tokens_seen': 44498924, 'completed': '74.94% (679 / 906)', 'remaining time': '1:02:27', 'throughput': '1220.87', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:40:24,585 >> {'loss': 0.4357, 'grad_norm': 5.1927032470703125, 'learning_rate': 2.4484157207989172e-06, 'epoch': 0.7505518763796909, 'num_input_tokens_seen': 44564460, 'completed': '75.06% (680 / 906)', 'remaining time': '1:02:09', 'throughput': '1230.85', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:40:37,834 >> {'loss': 0.3239, 'grad_norm': 4.175013542175293, 'learning_rate': 2.4363547138974615e-06, 'epoch': 0.7516556291390728, 'num_input_tokens_seen': 44629996, 'completed': '75.17% (681 / 906)', 'remaining time': '1:01:52', 'throughput': '1236.55', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:40:51,107 >> {'loss': 0.2688, 'grad_norm': 3.897489070892334, 'learning_rate': 2.4243345898156036e-06, 'epoch': 0.7527593818984547, 'num_input_tokens_seen': 44695532, 'completed': '75.28% (682 / 906)', 'remaining time': '1:01:34', 'throughput': '1234.40', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:41:04,428 >> {'loss': 0.2775, 'grad_norm': 3.912888288497925, 'learning_rate': 2.4123555089559084e-06, 'epoch': 0.7538631346578366, 'num_input_tokens_seen': 44761068, 'completed': '75.39% (683 / 906)', 'remaining time': '1:01:16', 'throughput': '1229.94', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:41:17,795 >> {'loss': 0.1913, 'grad_norm': 3.3684606552124023, 'learning_rate': 2.4004176311732407e-06, 'epoch': 0.7549668874172185, 'num_input_tokens_seen': 44826604, 'completed': '75.50% (684 / 906)', 'remaining time': '1:00:59', 'throughput': '1225.70', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:41:31,115 >> {'loss': 0.1278, 'grad_norm': 2.597721815109253, 'learning_rate': 2.388521115772631e-06, 'epoch': 0.7560706401766004, 'num_input_tokens_seen': 44892140, 'completed': '75.61% (685 / 906)', 'remaining time': '1:00:41', 'throughput': '1230.08', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:41:44,377 >> {'loss': 0.4902, 'grad_norm': 5.179257869720459, 'learning_rate': 2.3766661215071473e-06, 'epoch': 0.7571743929359823, 'num_input_tokens_seen': 44957676, 'completed': '75.72% (686 / 906)', 'remaining time': '1:00:24', 'throughput': '1235.37', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:41:57,620 >> {'loss': 0.5226, 'grad_norm': 5.601065635681152, 'learning_rate': 2.364852806575782e-06, 'epoch': 0.7582781456953642, 'num_input_tokens_seen': 45023212, 'completed': '75.83% (687 / 906)', 'remaining time': '1:00:06', 'throughput': '1237.18', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:42:11,054 >> {'loss': 0.3309, 'grad_norm': 4.68391752243042, 'learning_rate': 2.353081328621335e-06, 'epoch': 0.7593818984547461, 'num_input_tokens_seen': 45088748, 'completed': '75.94% (688 / 906)', 'remaining time': '0:59:49', 'throughput': '1219.58', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:42:24,385 >> {'loss': 0.3436, 'grad_norm': 4.620096683502197, 'learning_rate': 2.3413518447283145e-06, 'epoch': 0.760485651214128, 'num_input_tokens_seen': 45154284, 'completed': '76.05% (689 / 906)', 'remaining time': '0:59:32', 'throughput': '1229.08', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:42:37,659 >> {'loss': 0.2027, 'grad_norm': 3.8575119972229004, 'learning_rate': 2.329664511420835e-06, 'epoch': 0.7615894039735099, 'num_input_tokens_seen': 45219820, 'completed': '76.16% (690 / 906)', 'remaining time': '0:59:14', 'throughput': '1234.30', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:42:50,789 >> {'loss': 0.5122, 'grad_norm': 5.794271945953369, 'learning_rate': 2.3180194846605367e-06, 'epoch': 0.7626931567328918, 'num_input_tokens_seen': 45285356, 'completed': '76.27% (691 / 906)', 'remaining time': '0:58:57', 'throughput': '1247.79', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:43:04,040 >> {'loss': 0.1909, 'grad_norm': 3.642765998840332, 'learning_rate': 2.3064169198444988e-06, 'epoch': 0.7637969094922737, 'num_input_tokens_seen': 45350892, 'completed': '76.38% (692 / 906)', 'remaining time': '0:58:39', 'throughput': '1236.40', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:43:17,361 >> {'loss': 0.3171, 'grad_norm': 4.802554130554199, 'learning_rate': 2.2948569718031665e-06, 'epoch': 0.7649006622516556, 'num_input_tokens_seen': 45416428, 'completed': '76.49% (693 / 906)', 'remaining time': '0:58:22', 'throughput': '1229.93', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:43:30,688 >> {'loss': 0.2115, 'grad_norm': 3.871217727661133, 'learning_rate': 2.283339794798286e-06, 'epoch': 0.7660044150110376, 'num_input_tokens_seen': 45481964, 'completed': '76.60% (694 / 906)', 'remaining time': '0:58:04', 'throughput': '1229.42', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:43:43,947 >> {'loss': 0.2803, 'grad_norm': 4.514727592468262, 'learning_rate': 2.2718655425208443e-06, 'epoch': 0.7671081677704195, 'num_input_tokens_seen': 45547500, 'completed': '76.71% (695 / 906)', 'remaining time': '0:57:47', 'throughput': '1235.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:43:57,149 >> {'loss': 0.3749, 'grad_norm': 4.732895851135254, 'learning_rate': 2.26043436808902e-06, 'epoch': 0.7682119205298014, 'num_input_tokens_seen': 45613036, 'completed': '76.82% (696 / 906)', 'remaining time': '0:57:30', 'throughput': '1241.04', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:44:10,598 >> {'loss': 0.2932, 'grad_norm': 4.663212776184082, 'learning_rate': 2.2490464240461386e-06, 'epoch': 0.7693156732891833, 'num_input_tokens_seen': 45678572, 'completed': '76.93% (697 / 906)', 'remaining time': '0:57:12', 'throughput': '1218.26', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:44:23,898 >> {'loss': 0.2619, 'grad_norm': 4.421245574951172, 'learning_rate': 2.2377018623586392e-06, 'epoch': 0.7704194260485652, 'num_input_tokens_seen': 45744108, 'completed': '77.04% (698 / 906)', 'remaining time': '0:56:55', 'throughput': '1231.84', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:44:37,136 >> {'loss': 0.351, 'grad_norm': 4.790342807769775, 'learning_rate': 2.2264008344140444e-06, 'epoch': 0.7715231788079471, 'num_input_tokens_seen': 45809644, 'completed': '77.15% (699 / 906)', 'remaining time': '0:56:38', 'throughput': '1237.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:44:50,450 >> {'loss': 0.307, 'grad_norm': 4.666688919067383, 'learning_rate': 2.2151434910189397e-06, 'epoch': 0.7726269315673289, 'num_input_tokens_seen': 45875180, 'completed': '77.26% (700 / 906)', 'remaining time': '0:56:20', 'throughput': '1230.57', 'gpu_mem_free': '30131MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-04 23:45:16,019 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-700
[INFO|configuration_utils.py:472] 2025-01-04 23:45:16,022 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-700/config.json
[INFO|configuration_utils.py:807] 2025-01-04 23:45:16,023 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-700/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-04 23:46:12,508 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-700/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-04 23:46:12,511 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-700/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-04 23:46:12,512 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-700/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[WARNING|trainer.py:869] 2025-01-04 23:50:07,599 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 2800, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-04 23:50:21,389 >> {'loss': 0.5548, 'grad_norm': 6.7703704833984375, 'learning_rate': 2.2039299823969623e-06, 'epoch': 0.7737306843267108, 'num_input_tokens_seen': 45940716, 'completed': '77.37% (701 / 906)', 'remaining time': '0:57:36', 'throughput': '49.51', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:50:35,193 >> {'loss': 0.2819, 'grad_norm': 4.884865760803223, 'learning_rate': 2.1927604581867974e-06, 'epoch': 0.7748344370860927, 'num_input_tokens_seen': 46006252, 'completed': '77.48% (702 / 906)', 'remaining time': '0:57:18', 'throughput': '1186.90', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:50:48,973 >> {'loss': 0.2164, 'grad_norm': 3.6003012657165527, 'learning_rate': 2.1816350674401804e-06, 'epoch': 0.7759381898454746, 'num_input_tokens_seen': 46071788, 'completed': '77.59% (703 / 906)', 'remaining time': '0:57:00', 'throughput': '1188.95', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:51:02,670 >> {'loss': 0.2567, 'grad_norm': 4.006162166595459, 'learning_rate': 2.1705539586199037e-06, 'epoch': 0.7770419426048565, 'num_input_tokens_seen': 46137324, 'completed': '77.70% (704 / 906)', 'remaining time': '0:56:43', 'throughput': '1196.17', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:51:16,226 >> {'loss': 0.4782, 'grad_norm': 5.422812461853027, 'learning_rate': 2.159517279597844e-06, 'epoch': 0.7781456953642384, 'num_input_tokens_seen': 46202860, 'completed': '77.81% (705 / 906)', 'remaining time': '0:56:25', 'throughput': '1208.64', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:51:29,743 >> {'loss': 0.2962, 'grad_norm': 5.053647041320801, 'learning_rate': 2.148525177652982e-06, 'epoch': 0.7792494481236203, 'num_input_tokens_seen': 46268396, 'completed': '77.92% (706 / 906)', 'remaining time': '0:56:07', 'throughput': '1212.15', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:51:43,485 >> {'loss': 0.3387, 'grad_norm': 4.703584671020508, 'learning_rate': 2.1375777994694415e-06, 'epoch': 0.7803532008830022, 'num_input_tokens_seen': 46333932, 'completed': '78.04% (707 / 906)', 'remaining time': '0:55:49', 'throughput': '1192.27', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:51:57,052 >> {'loss': 0.7101, 'grad_norm': 6.285885810852051, 'learning_rate': 2.1266752911345293e-06, 'epoch': 0.7814569536423841, 'num_input_tokens_seen': 46399468, 'completed': '78.15% (708 / 906)', 'remaining time': '0:55:31', 'throughput': '1207.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:52:10,646 >> {'loss': 0.4221, 'grad_norm': 5.295046329498291, 'learning_rate': 2.1158177981367832e-06, 'epoch': 0.782560706401766, 'num_input_tokens_seen': 46465004, 'completed': '78.26% (709 / 906)', 'remaining time': '0:55:14', 'throughput': '1205.24', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:52:24,299 >> {'loss': 0.499, 'grad_norm': 5.126113414764404, 'learning_rate': 2.1050054653640382e-06, 'epoch': 0.7836644591611479, 'num_input_tokens_seen': 46530540, 'completed': '78.37% (710 / 906)', 'remaining time': '0:54:56', 'throughput': '1199.99', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:52:37,999 >> {'loss': 0.3448, 'grad_norm': 4.821897983551025, 'learning_rate': 2.0942384371014858e-06, 'epoch': 0.7847682119205298, 'num_input_tokens_seen': 46596076, 'completed': '78.48% (711 / 906)', 'remaining time': '0:54:38', 'throughput': '1195.96', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:52:51,565 >> {'loss': 0.5352, 'grad_norm': 5.19854211807251, 'learning_rate': 2.083516857029757e-06, 'epoch': 0.7858719646799117, 'num_input_tokens_seen': 46661612, 'completed': '78.59% (712 / 906)', 'remaining time': '0:54:21', 'throughput': '1207.67', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:53:05,337 >> {'loss': 0.3713, 'grad_norm': 4.652998924255371, 'learning_rate': 2.072840868222989e-06, 'epoch': 0.7869757174392936, 'num_input_tokens_seen': 46727148, 'completed': '78.70% (713 / 906)', 'remaining time': '0:54:03', 'throughput': '1189.69', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:53:19,037 >> {'loss': 0.3579, 'grad_norm': 4.630868434906006, 'learning_rate': 2.0622106131469346e-06, 'epoch': 0.7880794701986755, 'num_input_tokens_seen': 46792684, 'completed': '78.81% (714 / 906)', 'remaining time': '0:53:45', 'throughput': '1195.88', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:53:32,622 >> {'loss': 0.6049, 'grad_norm': 5.569841384887695, 'learning_rate': 2.0516262336570504e-06, 'epoch': 0.7891832229580574, 'num_input_tokens_seen': 46858220, 'completed': '78.92% (715 / 906)', 'remaining time': '0:53:28', 'throughput': '1206.09', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:53:46,426 >> {'loss': 0.2668, 'grad_norm': 4.281322479248047, 'learning_rate': 2.0410878709966055e-06, 'epoch': 0.7902869757174393, 'num_input_tokens_seen': 46923756, 'completed': '79.03% (716 / 906)', 'remaining time': '0:53:10', 'throughput': '1186.86', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:54:00,198 >> {'loss': 0.5742, 'grad_norm': 6.267445087432861, 'learning_rate': 2.0305956657947993e-06, 'epoch': 0.7913907284768212, 'num_input_tokens_seen': 46989292, 'completed': '79.14% (717 / 906)', 'remaining time': '0:52:53', 'throughput': '1189.69', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:54:13,811 >> {'loss': 0.6532, 'grad_norm': 6.611531734466553, 'learning_rate': 2.0201497580648804e-06, 'epoch': 0.7924944812362031, 'num_input_tokens_seen': 47054828, 'completed': '79.25% (718 / 906)', 'remaining time': '0:52:35', 'throughput': '1203.52', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:54:27,547 >> {'loss': 0.4038, 'grad_norm': 5.3773040771484375, 'learning_rate': 2.0097502872022808e-06, 'epoch': 0.7935982339955849, 'num_input_tokens_seen': 47120364, 'completed': '79.36% (719 / 906)', 'remaining time': '0:52:17', 'throughput': '1192.82', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:54:41,273 >> {'loss': 0.4642, 'grad_norm': 5.617459774017334, 'learning_rate': 1.999397391982758e-06, 'epoch': 0.7947019867549668, 'num_input_tokens_seen': 47185900, 'completed': '79.47% (720 / 906)', 'remaining time': '0:52:00', 'throughput': '1193.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:54:55,092 >> {'loss': 0.2961, 'grad_norm': 4.279394149780273, 'learning_rate': 1.98909121056054e-06, 'epoch': 0.7958057395143487, 'num_input_tokens_seen': 47251436, 'completed': '79.58% (721 / 906)', 'remaining time': '0:51:42', 'throughput': '1185.61', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:55:08,868 >> {'loss': 0.298, 'grad_norm': 4.948429107666016, 'learning_rate': 1.97883188046648e-06, 'epoch': 0.7969094922737306, 'num_input_tokens_seen': 47316972, 'completed': '79.69% (722 / 906)', 'remaining time': '0:51:25', 'throughput': '1189.29', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:55:22,632 >> {'loss': 0.406, 'grad_norm': 4.897294521331787, 'learning_rate': 1.9686195386062253e-06, 'epoch': 0.7980132450331126, 'num_input_tokens_seen': 47382508, 'completed': '79.80% (723 / 906)', 'remaining time': '0:51:07', 'throughput': '1190.35', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:55:36,465 >> {'loss': 0.1767, 'grad_norm': 3.2944583892822266, 'learning_rate': 1.958454321258391e-06, 'epoch': 0.7991169977924945, 'num_input_tokens_seen': 47448044, 'completed': '79.91% (724 / 906)', 'remaining time': '0:50:50', 'throughput': '1184.43', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:55:50,082 >> {'loss': 0.5323, 'grad_norm': 5.691059589385986, 'learning_rate': 1.948336364072736e-06, 'epoch': 0.8002207505518764, 'num_input_tokens_seen': 47513580, 'completed': '80.02% (725 / 906)', 'remaining time': '0:50:32', 'throughput': '1203.20', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:56:03,872 >> {'loss': 0.2576, 'grad_norm': 3.72227144241333, 'learning_rate': 1.9382658020683572e-06, 'epoch': 0.8013245033112583, 'num_input_tokens_seen': 47579116, 'completed': '80.13% (726 / 906)', 'remaining time': '0:50:15', 'throughput': '1188.08', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:56:17,545 >> {'loss': 0.4782, 'grad_norm': 5.053169250488281, 'learning_rate': 1.928242769631884e-06, 'epoch': 0.8024282560706402, 'num_input_tokens_seen': 47644652, 'completed': '80.24% (727 / 906)', 'remaining time': '0:49:57', 'throughput': '1198.32', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:56:31,409 >> {'loss': 0.2233, 'grad_norm': 3.867647171020508, 'learning_rate': 1.918267400515691e-06, 'epoch': 0.8035320088300221, 'num_input_tokens_seen': 47710188, 'completed': '80.35% (728 / 906)', 'remaining time': '0:49:40', 'throughput': '1181.79', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:56:45,099 >> {'loss': 0.5753, 'grad_norm': 6.746331691741943, 'learning_rate': 1.9083398278361077e-06, 'epoch': 0.804635761589404, 'num_input_tokens_seen': 47775724, 'completed': '80.46% (729 / 906)', 'remaining time': '0:49:22', 'throughput': '1196.79', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:56:58,848 >> {'loss': 0.4902, 'grad_norm': 5.674907207489014, 'learning_rate': 1.8984601840716443e-06, 'epoch': 0.8057395143487859, 'num_input_tokens_seen': 47841260, 'completed': '80.57% (730 / 906)', 'remaining time': '0:49:05', 'throughput': '1191.60', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:57:12,495 >> {'loss': 0.5091, 'grad_norm': 5.640385150909424, 'learning_rate': 1.8886286010612226e-06, 'epoch': 0.8068432671081678, 'num_input_tokens_seen': 47906796, 'completed': '80.68% (731 / 906)', 'remaining time': '0:48:47', 'throughput': '1200.58', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:57:26,155 >> {'loss': 0.2676, 'grad_norm': 4.219995975494385, 'learning_rate': 1.8788452100024185e-06, 'epoch': 0.8079470198675497, 'num_input_tokens_seen': 47972332, 'completed': '80.79% (732 / 906)', 'remaining time': '0:48:30', 'throughput': '1199.45', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:57:39,881 >> {'loss': 0.3313, 'grad_norm': 4.815901279449463, 'learning_rate': 1.8691101414497104e-06, 'epoch': 0.8090507726269316, 'num_input_tokens_seen': 48037868, 'completed': '80.91% (733 / 906)', 'remaining time': '0:48:12', 'throughput': '1193.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:57:53,572 >> {'loss': 0.4606, 'grad_norm': 5.200927734375, 'learning_rate': 1.8594235253127373e-06, 'epoch': 0.8101545253863135, 'num_input_tokens_seen': 48103404, 'completed': '81.02% (734 / 906)', 'remaining time': '0:47:55', 'throughput': '1196.68', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:58:07,213 >> {'loss': 0.2862, 'grad_norm': 4.116132736206055, 'learning_rate': 1.8497854908545632e-06, 'epoch': 0.8112582781456954, 'num_input_tokens_seen': 48168940, 'completed': '81.13% (735 / 906)', 'remaining time': '0:47:38', 'throughput': '1201.06', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:58:20,975 >> {'loss': 0.248, 'grad_norm': 4.191470623016357, 'learning_rate': 1.840196166689956e-06, 'epoch': 0.8123620309050773, 'num_input_tokens_seen': 48234476, 'completed': '81.24% (736 / 906)', 'remaining time': '0:47:20', 'throughput': '1190.56', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:58:34,747 >> {'loss': 0.3043, 'grad_norm': 4.818159103393555, 'learning_rate': 1.8306556807836673e-06, 'epoch': 0.8134657836644592, 'num_input_tokens_seen': 48300012, 'completed': '81.35% (737 / 906)', 'remaining time': '0:47:03', 'throughput': '1189.62', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:58:48,416 >> {'loss': 0.3915, 'grad_norm': 5.291959762573242, 'learning_rate': 1.8211641604487276e-06, 'epoch': 0.8145695364238411, 'num_input_tokens_seen': 48365548, 'completed': '81.46% (738 / 906)', 'remaining time': '0:46:45', 'throughput': '1198.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:59:02,195 >> {'loss': 0.2789, 'grad_norm': 3.9926469326019287, 'learning_rate': 1.811721732344745e-06, 'epoch': 0.8156732891832229, 'num_input_tokens_seen': 48431084, 'completed': '81.57% (739 / 906)', 'remaining time': '0:46:28', 'throughput': '1189.03', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:59:15,795 >> {'loss': 0.2372, 'grad_norm': 3.875976085662842, 'learning_rate': 1.8023285224762182e-06, 'epoch': 0.8167770419426048, 'num_input_tokens_seen': 48496620, 'completed': '81.68% (740 / 906)', 'remaining time': '0:46:11', 'throughput': '1204.70', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-04 23:59:29,346 >> {'loss': 0.3044, 'grad_norm': 4.194228649139404, 'learning_rate': 1.792984656190851e-06, 'epoch': 0.8178807947019867, 'num_input_tokens_seen': 48562156, 'completed': '81.79% (741 / 906)', 'remaining time': '0:45:53', 'throughput': '1209.08', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:59:42,835 >> {'loss': 0.2313, 'grad_norm': 3.9738593101501465, 'learning_rate': 1.7836902581778814e-06, 'epoch': 0.8189845474613686, 'num_input_tokens_seen': 48627688, 'completed': '81.90% (742 / 906)', 'remaining time': '0:45:36', 'throughput': '1214.54', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-04 23:59:56,338 >> {'loss': 0.3658, 'grad_norm': 5.357481479644775, 'learning_rate': 1.7744454524664206e-06, 'epoch': 0.8200883002207505, 'num_input_tokens_seen': 48693224, 'completed': '82.01% (743 / 906)', 'remaining time': '0:45:18', 'throughput': '1213.42', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:00:09,851 >> {'loss': 0.3554, 'grad_norm': 4.926805019378662, 'learning_rate': 1.7652503624237943e-06, 'epoch': 0.8211920529801324, 'num_input_tokens_seen': 48758760, 'completed': '82.12% (744 / 906)', 'remaining time': '0:45:01', 'throughput': '1212.44', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:00:23,306 >> {'loss': 0.3042, 'grad_norm': 4.7101149559021, 'learning_rate': 1.7561051107538957e-06, 'epoch': 0.8222958057395143, 'num_input_tokens_seen': 48824296, 'completed': '82.23% (745 / 906)', 'remaining time': '0:44:44', 'throughput': '1217.70', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:00:36,752 >> {'loss': 0.9089, 'grad_norm': 7.876413822174072, 'learning_rate': 1.7470098194955502e-06, 'epoch': 0.8233995584988962, 'num_input_tokens_seen': 48889832, 'completed': '82.34% (746 / 906)', 'remaining time': '0:44:26', 'throughput': '1218.49', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:00:50,186 >> {'loss': 0.7701, 'grad_norm': 7.759825706481934, 'learning_rate': 1.737964610020888e-06, 'epoch': 0.8245033112582781, 'num_input_tokens_seen': 48955368, 'completed': '82.45% (747 / 906)', 'remaining time': '0:44:09', 'throughput': '1219.55', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:01:03,588 >> {'loss': 0.473, 'grad_norm': 5.320918083190918, 'learning_rate': 1.7289696030337217e-06, 'epoch': 0.82560706401766, 'num_input_tokens_seen': 49020904, 'completed': '82.56% (748 / 906)', 'remaining time': '0:43:52', 'throughput': '1222.50', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:01:17,018 >> {'loss': 0.279, 'grad_norm': 4.480899810791016, 'learning_rate': 1.7200249185679373e-06, 'epoch': 0.826710816777042, 'num_input_tokens_seen': 49086440, 'completed': '82.67% (749 / 906)', 'remaining time': '0:43:34', 'throughput': '1219.98', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:01:30,514 >> {'loss': 0.2524, 'grad_norm': 3.9195127487182617, 'learning_rate': 1.7111306759858915e-06, 'epoch': 0.8278145695364238, 'num_input_tokens_seen': 49151976, 'completed': '82.78% (750 / 906)', 'remaining time': '0:43:17', 'throughput': '1214.03', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:01:43,990 >> {'loss': 0.3206, 'grad_norm': 4.489572525024414, 'learning_rate': 1.7022869939768189e-06, 'epoch': 0.8289183222958058, 'num_input_tokens_seen': 49217512, 'completed': '82.89% (751 / 906)', 'remaining time': '0:43:00', 'throughput': '1215.73', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:01:57,497 >> {'loss': 0.3519, 'grad_norm': 4.9796247482299805, 'learning_rate': 1.6934939905552483e-06, 'epoch': 0.8300220750551877, 'num_input_tokens_seen': 49283048, 'completed': '83.00% (752 / 906)', 'remaining time': '0:42:42', 'throughput': '1213.02', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:02:11,118 >> {'loss': 0.2959, 'grad_norm': 4.599273204803467, 'learning_rate': 1.6847517830594304e-06, 'epoch': 0.8311258278145696, 'num_input_tokens_seen': 49348584, 'completed': '83.11% (753 / 906)', 'remaining time': '0:42:25', 'throughput': '1202.81', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:02:24,618 >> {'loss': 0.3448, 'grad_norm': 4.5784502029418945, 'learning_rate': 1.676060488149765e-06, 'epoch': 0.8322295805739515, 'num_input_tokens_seen': 49414120, 'completed': '83.22% (754 / 906)', 'remaining time': '0:42:08', 'throughput': '1213.64', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:02:38,116 >> {'loss': 0.2993, 'grad_norm': 4.545161247253418, 'learning_rate': 1.6674202218072528e-06, 'epoch': 0.8333333333333334, 'num_input_tokens_seen': 49479656, 'completed': '83.33% (755 / 906)', 'remaining time': '0:41:51', 'throughput': '1213.83', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:02:51,647 >> {'loss': 0.3496, 'grad_norm': 5.198947429656982, 'learning_rate': 1.6588310993319411e-06, 'epoch': 0.8344370860927153, 'num_input_tokens_seen': 49545192, 'completed': '83.44% (756 / 906)', 'remaining time': '0:41:33', 'throughput': '1210.86', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:03:05,195 >> {'loss': 0.6534, 'grad_norm': 6.070314407348633, 'learning_rate': 1.6502932353413911e-06, 'epoch': 0.8355408388520972, 'num_input_tokens_seen': 49610728, 'completed': '83.55% (757 / 906)', 'remaining time': '0:41:16', 'throughput': '1209.30', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:03:18,759 >> {'loss': 0.2371, 'grad_norm': 3.886892557144165, 'learning_rate': 1.641806743769142e-06, 'epoch': 0.8366445916114791, 'num_input_tokens_seen': 49676264, 'completed': '83.66% (758 / 906)', 'remaining time': '0:40:59', 'throughput': '1207.92', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:03:32,320 >> {'loss': 0.3792, 'grad_norm': 4.761082649230957, 'learning_rate': 1.633371737863194e-06, 'epoch': 0.8377483443708609, 'num_input_tokens_seen': 49741800, 'completed': '83.77% (759 / 906)', 'remaining time': '0:40:42', 'throughput': '1208.17', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:03:45,868 >> {'loss': 0.2798, 'grad_norm': 4.743200778961182, 'learning_rate': 1.6249883301844977e-06, 'epoch': 0.8388520971302428, 'num_input_tokens_seen': 49807336, 'completed': '83.89% (760 / 906)', 'remaining time': '0:40:24', 'throughput': '1209.32', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:03:59,420 >> {'loss': 0.3047, 'grad_norm': 4.598114967346191, 'learning_rate': 1.616656632605451e-06, 'epoch': 0.8399558498896247, 'num_input_tokens_seen': 49872872, 'completed': '84.00% (761 / 906)', 'remaining time': '0:40:07', 'throughput': '1208.97', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:04:12,848 >> {'loss': 0.2891, 'grad_norm': 4.357061386108398, 'learning_rate': 1.6083767563084056e-06, 'epoch': 0.8410596026490066, 'num_input_tokens_seen': 49938408, 'completed': '84.11% (762 / 906)', 'remaining time': '0:39:50', 'throughput': '1220.15', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:04:26,256 >> {'loss': 0.3881, 'grad_norm': 5.4613938331604, 'learning_rate': 1.6001488117841855e-06, 'epoch': 0.8421633554083885, 'num_input_tokens_seen': 50003944, 'completed': '84.22% (763 / 906)', 'remaining time': '0:39:33', 'throughput': '1221.96', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:04:39,748 >> {'loss': 0.4353, 'grad_norm': 5.408591270446777, 'learning_rate': 1.5919729088306093e-06, 'epoch': 0.8432671081677704, 'num_input_tokens_seen': 50069480, 'completed': '84.33% (764 / 906)', 'remaining time': '0:39:16', 'throughput': '1214.38', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:04:53,205 >> {'loss': 0.3202, 'grad_norm': 4.4144697189331055, 'learning_rate': 1.5838491565510275e-06, 'epoch': 0.8443708609271523, 'num_input_tokens_seen': 50135016, 'completed': '84.44% (765 / 906)', 'remaining time': '0:38:59', 'throughput': '1217.49', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:05:06,744 >> {'loss': 0.3601, 'grad_norm': 4.777880668640137, 'learning_rate': 1.5757776633528654e-06, 'epoch': 0.8454746136865342, 'num_input_tokens_seen': 50200552, 'completed': '84.55% (766 / 906)', 'remaining time': '0:38:41', 'throughput': '1210.10', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:05:20,198 >> {'loss': 0.4695, 'grad_norm': 5.162795066833496, 'learning_rate': 1.5677585369461796e-06, 'epoch': 0.8465783664459161, 'num_input_tokens_seen': 50266088, 'completed': '84.66% (767 / 906)', 'remaining time': '0:38:24', 'throughput': '1217.83', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:05:33,632 >> {'loss': 0.4612, 'grad_norm': 4.954742431640625, 'learning_rate': 1.5597918843422132e-06, 'epoch': 0.847682119205298, 'num_input_tokens_seen': 50331624, 'completed': '84.77% (768 / 906)', 'remaining time': '0:38:07', 'throughput': '1219.54', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:05:47,125 >> {'loss': 0.3481, 'grad_norm': 4.867616653442383, 'learning_rate': 1.5518778118519754e-06, 'epoch': 0.8487858719646799, 'num_input_tokens_seen': 50397160, 'completed': '84.88% (769 / 906)', 'remaining time': '0:37:50', 'throughput': '1214.27', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:06:00,638 >> {'loss': 0.3699, 'grad_norm': 5.606635093688965, 'learning_rate': 1.5440164250848205e-06, 'epoch': 0.8498896247240618, 'num_input_tokens_seen': 50462696, 'completed': '84.99% (770 / 906)', 'remaining time': '0:37:33', 'throughput': '1212.48', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:06:14,264 >> {'loss': 0.2831, 'grad_norm': 4.597039699554443, 'learning_rate': 1.5362078289470369e-06, 'epoch': 0.8509933774834437, 'num_input_tokens_seen': 50528232, 'completed': '85.10% (771 / 906)', 'remaining time': '0:37:16', 'throughput': '1202.37', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:06:27,729 >> {'loss': 0.3296, 'grad_norm': 4.647115707397461, 'learning_rate': 1.5284521276404498e-06, 'epoch': 0.8520971302428256, 'num_input_tokens_seen': 50593768, 'completed': '85.21% (772 / 906)', 'remaining time': '0:36:59', 'throughput': '1216.84', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:06:41,184 >> {'loss': 0.8206, 'grad_norm': 7.4029083251953125, 'learning_rate': 1.520749424661026e-06, 'epoch': 0.8532008830022075, 'num_input_tokens_seen': 50659304, 'completed': '85.32% (773 / 906)', 'remaining time': '0:36:42', 'throughput': '1217.67', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:06:54,664 >> {'loss': 0.5007, 'grad_norm': 5.793678283691406, 'learning_rate': 1.513099822797498e-06, 'epoch': 0.8543046357615894, 'num_input_tokens_seen': 50724840, 'completed': '85.43% (774 / 906)', 'remaining time': '0:36:24', 'throughput': '1215.45', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:07:08,195 >> {'loss': 0.3584, 'grad_norm': 5.196356296539307, 'learning_rate': 1.5055034241299933e-06, 'epoch': 0.8554083885209713, 'num_input_tokens_seen': 50790376, 'completed': '85.54% (775 / 906)', 'remaining time': '0:36:07', 'throughput': '1210.82', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:07:21,702 >> {'loss': 0.2379, 'grad_norm': 3.680781841278076, 'learning_rate': 1.4979603300286655e-06, 'epoch': 0.8565121412803532, 'num_input_tokens_seen': 50855912, 'completed': '85.65% (776 / 906)', 'remaining time': '0:35:50', 'throughput': '1213.01', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:07:35,299 >> {'loss': 0.4034, 'grad_norm': 5.659000873565674, 'learning_rate': 1.490470641152345e-06, 'epoch': 0.8576158940397351, 'num_input_tokens_seen': 50921448, 'completed': '85.76% (777 / 906)', 'remaining time': '0:35:33', 'throughput': '1204.98', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:07:48,884 >> {'loss': 0.3384, 'grad_norm': 4.814398288726807, 'learning_rate': 1.4830344574471986e-06, 'epoch': 0.8587196467991169, 'num_input_tokens_seen': 50986984, 'completed': '85.87% (778 / 906)', 'remaining time': '0:35:16', 'throughput': '1206.07', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:08:02,402 >> {'loss': 0.2315, 'grad_norm': 3.8748183250427246, 'learning_rate': 1.475651878145391e-06, 'epoch': 0.8598233995584988, 'num_input_tokens_seen': 51052520, 'completed': '85.98% (779 / 906)', 'remaining time': '0:34:59', 'throughput': '1211.97', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:08:15,902 >> {'loss': 0.39, 'grad_norm': 4.726593017578125, 'learning_rate': 1.4683230017637653e-06, 'epoch': 0.8609271523178808, 'num_input_tokens_seen': 51118056, 'completed': '86.09% (780 / 906)', 'remaining time': '0:34:42', 'throughput': '1213.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:08:29,469 >> {'loss': 0.3703, 'grad_norm': 4.92621374130249, 'learning_rate': 1.4610479261025247e-06, 'epoch': 0.8620309050772627, 'num_input_tokens_seen': 51183592, 'completed': '86.20% (781 / 906)', 'remaining time': '0:34:25', 'throughput': '1207.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:08:43,088 >> {'loss': 0.1459, 'grad_norm': 2.963569164276123, 'learning_rate': 1.4538267482439264e-06, 'epoch': 0.8631346578366446, 'num_input_tokens_seen': 51249128, 'completed': '86.31% (782 / 906)', 'remaining time': '0:34:08', 'throughput': '1203.00', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:08:56,619 >> {'loss': 0.2845, 'grad_norm': 4.367040634155273, 'learning_rate': 1.4466595645509917e-06, 'epoch': 0.8642384105960265, 'num_input_tokens_seen': 51314664, 'completed': '86.42% (783 / 906)', 'remaining time': '0:33:51', 'throughput': '1210.90', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:09:10,118 >> {'loss': 0.3871, 'grad_norm': 6.132776260375977, 'learning_rate': 1.4395464706662155e-06, 'epoch': 0.8653421633554084, 'num_input_tokens_seen': 51380200, 'completed': '86.53% (784 / 906)', 'remaining time': '0:33:34', 'throughput': '1213.75', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:09:23,782 >> {'loss': 0.2819, 'grad_norm': 4.9763593673706055, 'learning_rate': 1.4324875615102896e-06, 'epoch': 0.8664459161147903, 'num_input_tokens_seen': 51445736, 'completed': '86.64% (785 / 906)', 'remaining time': '0:33:17', 'throughput': '1199.06', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:09:37,410 >> {'loss': 0.2911, 'grad_norm': 4.710178375244141, 'learning_rate': 1.4254829312808405e-06, 'epoch': 0.8675496688741722, 'num_input_tokens_seen': 51511272, 'completed': '86.75% (786 / 906)', 'remaining time': '0:33:00', 'throughput': '1202.23', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:09:50,979 >> {'loss': 0.3404, 'grad_norm': 5.016818523406982, 'learning_rate': 1.4185326734511667e-06, 'epoch': 0.8686534216335541, 'num_input_tokens_seen': 51576808, 'completed': '86.87% (787 / 906)', 'remaining time': '0:32:43', 'throughput': '1207.42', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:10:04,668 >> {'loss': 0.255, 'grad_norm': 4.105848789215088, 'learning_rate': 1.4116368807689968e-06, 'epoch': 0.869757174392936, 'num_input_tokens_seen': 51642344, 'completed': '86.98% (788 / 906)', 'remaining time': '0:32:26', 'throughput': '1196.91', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:10:18,258 >> {'loss': 0.2043, 'grad_norm': 3.7043731212615967, 'learning_rate': 1.4047956452552458e-06, 'epoch': 0.8708609271523179, 'num_input_tokens_seen': 51707880, 'completed': '87.09% (789 / 906)', 'remaining time': '0:32:10', 'throughput': '1205.55', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:10:31,638 >> {'loss': 0.302, 'grad_norm': 4.634637832641602, 'learning_rate': 1.3980090582027943e-06, 'epoch': 0.8719646799116998, 'num_input_tokens_seen': 51773416, 'completed': '87.20% (790 / 906)', 'remaining time': '0:31:53', 'throughput': '1224.53', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:10:44,954 >> {'loss': 0.3963, 'grad_norm': 5.685799598693848, 'learning_rate': 1.3912772101752628e-06, 'epoch': 0.8730684326710817, 'num_input_tokens_seen': 51838952, 'completed': '87.31% (791 / 906)', 'remaining time': '0:31:36', 'throughput': '1230.38', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:10:58,533 >> {'loss': 0.4956, 'grad_norm': 6.1849212646484375, 'learning_rate': 1.384600191005809e-06, 'epoch': 0.8741721854304636, 'num_input_tokens_seen': 51904488, 'completed': '87.42% (792 / 906)', 'remaining time': '0:31:19', 'throughput': '1206.54', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:11:12,039 >> {'loss': 0.3066, 'grad_norm': 4.928698539733887, 'learning_rate': 1.3779780897959266e-06, 'epoch': 0.8752759381898455, 'num_input_tokens_seen': 51970024, 'completed': '87.53% (793 / 906)', 'remaining time': '0:31:02', 'throughput': '1213.15', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:11:25,742 >> {'loss': 0.2387, 'grad_norm': 4.356123447418213, 'learning_rate': 1.3714109949142568e-06, 'epoch': 0.8763796909492274, 'num_input_tokens_seen': 52035560, 'completed': '87.64% (794 / 906)', 'remaining time': '0:30:45', 'throughput': '1195.63', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:11:39,289 >> {'loss': 0.1518, 'grad_norm': 3.3618569374084473, 'learning_rate': 1.3648989939954065e-06, 'epoch': 0.8774834437086093, 'num_input_tokens_seen': 52101096, 'completed': '87.75% (795 / 906)', 'remaining time': '0:30:28', 'throughput': '1209.41', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:11:52,856 >> {'loss': 0.3637, 'grad_norm': 5.220973968505859, 'learning_rate': 1.3584421739387831e-06, 'epoch': 0.8785871964679912, 'num_input_tokens_seen': 52166632, 'completed': '87.86% (796 / 906)', 'remaining time': '0:30:11', 'throughput': '1207.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:12:06,438 >> {'loss': 0.1974, 'grad_norm': 3.5688204765319824, 'learning_rate': 1.3520406209074333e-06, 'epoch': 0.8796909492273731, 'num_input_tokens_seen': 52232168, 'completed': '87.97% (797 / 906)', 'remaining time': '0:29:54', 'throughput': '1206.29', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:12:19,913 >> {'loss': 0.3774, 'grad_norm': 5.112353324890137, 'learning_rate': 1.3456944203268918e-06, 'epoch': 0.8807947019867549, 'num_input_tokens_seen': 52297704, 'completed': '88.08% (798 / 906)', 'remaining time': '0:29:37', 'throughput': '1215.92', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:12:33,401 >> {'loss': 0.1356, 'grad_norm': 2.9946765899658203, 'learning_rate': 1.3394036568840423e-06, 'epoch': 0.8818984547461368, 'num_input_tokens_seen': 52363240, 'completed': '88.19% (799 / 906)', 'remaining time': '0:29:21', 'throughput': '1214.63', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:12:46,752 >> {'loss': 0.5675, 'grad_norm': 6.761628150939941, 'learning_rate': 1.3331684145259897e-06, 'epoch': 0.8830022075055187, 'num_input_tokens_seen': 52428776, 'completed': '88.30% (800 / 906)', 'remaining time': '0:29:04', 'throughput': '1227.26', 'gpu_mem_free': '30139MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-05 00:13:12,893 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-800
[INFO|configuration_utils.py:472] 2025-01-05 00:13:12,896 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-800/config.json
[INFO|configuration_utils.py:807] 2025-01-05 00:13:12,898 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-800/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-05 00:14:09,445 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-800/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-05 00:14:09,449 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-800/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-05 00:14:09,449 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-800/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[WARNING|trainer.py:869] 2025-01-05 00:18:04,261 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 3200, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-05 00:18:18,197 >> {'loss': 0.1852, 'grad_norm': 3.472043991088867, 'learning_rate': 1.3269887764589338e-06, 'epoch': 0.8841059602649006, 'num_input_tokens_seen': 52494312, 'completed': '88.41% (801 / 906)', 'remaining time': '0:29:29', 'throughput': '49.43', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:18:31,554 >> {'loss': 0.2197, 'grad_norm': 4.0303497314453125, 'learning_rate': 1.3208648251470662e-06, 'epoch': 0.8852097130242825, 'num_input_tokens_seen': 52559848, 'completed': '88.52% (802 / 906)', 'remaining time': '0:29:11', 'throughput': '1226.62', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:18:44,785 >> {'loss': 0.5266, 'grad_norm': 6.038300514221191, 'learning_rate': 1.314796642311465e-06, 'epoch': 0.8863134657836644, 'num_input_tokens_seen': 52625384, 'completed': '88.63% (803 / 906)', 'remaining time': '0:28:54', 'throughput': '1238.25', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:18:58,192 >> {'loss': 0.2284, 'grad_norm': 4.095331192016602, 'learning_rate': 1.3087843089290072e-06, 'epoch': 0.8874172185430463, 'num_input_tokens_seen': 52690920, 'completed': '88.74% (804 / 906)', 'remaining time': '0:28:37', 'throughput': '1222.05', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:19:11,573 >> {'loss': 0.1985, 'grad_norm': 3.6835544109344482, 'learning_rate': 1.3028279052312836e-06, 'epoch': 0.8885209713024282, 'num_input_tokens_seen': 52756456, 'completed': '88.85% (805 / 906)', 'remaining time': '0:28:19', 'throughput': '1224.45', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:19:24,724 >> {'loss': 0.5688, 'grad_norm': 6.688685417175293, 'learning_rate': 1.2969275107035344e-06, 'epoch': 0.8896247240618101, 'num_input_tokens_seen': 52821992, 'completed': '88.96% (806 / 906)', 'remaining time': '0:28:02', 'throughput': '1245.88', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:19:38,090 >> {'loss': 0.2624, 'grad_norm': 4.240151882171631, 'learning_rate': 1.291083204083584e-06, 'epoch': 0.890728476821192, 'num_input_tokens_seen': 52887528, 'completed': '89.07% (807 / 906)', 'remaining time': '0:27:45', 'throughput': '1225.79', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:19:51,356 >> {'loss': 0.2818, 'grad_norm': 4.761137008666992, 'learning_rate': 1.2852950633607922e-06, 'epoch': 0.891832229580574, 'num_input_tokens_seen': 52953064, 'completed': '89.18% (808 / 906)', 'remaining time': '0:27:28', 'throughput': '1235.05', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:20:04,609 >> {'loss': 0.4492, 'grad_norm': 5.820546627044678, 'learning_rate': 1.2795631657750113e-06, 'epoch': 0.8929359823399559, 'num_input_tokens_seen': 53018600, 'completed': '89.29% (809 / 906)', 'remaining time': '0:27:10', 'throughput': '1236.16', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:20:18,150 >> {'loss': 0.1956, 'grad_norm': 3.614879846572876, 'learning_rate': 1.2738875878155593e-06, 'epoch': 0.8940397350993378, 'num_input_tokens_seen': 53084136, 'completed': '89.40% (810 / 906)', 'remaining time': '0:26:53', 'throughput': '1209.97', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:20:31,528 >> {'loss': 0.5272, 'grad_norm': 6.228170394897461, 'learning_rate': 1.268268405220195e-06, 'epoch': 0.8951434878587197, 'num_input_tokens_seen': 53149672, 'completed': '89.51% (811 / 906)', 'remaining time': '0:26:36', 'throughput': '1224.75', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:20:44,921 >> {'loss': 0.3557, 'grad_norm': 5.280664920806885, 'learning_rate': 1.2627056929741096e-06, 'epoch': 0.8962472406181016, 'num_input_tokens_seen': 53215208, 'completed': '89.62% (812 / 906)', 'remaining time': '0:26:19', 'throughput': '1223.29', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:20:58,471 >> {'loss': 0.2463, 'grad_norm': 4.466437816619873, 'learning_rate': 1.257199525308927e-06, 'epoch': 0.8973509933774835, 'num_input_tokens_seen': 53280744, 'completed': '89.74% (813 / 906)', 'remaining time': '0:26:02', 'throughput': '1209.11', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:21:11,985 >> {'loss': 0.342, 'grad_norm': 4.941433429718018, 'learning_rate': 1.2517499757017098e-06, 'epoch': 0.8984547461368654, 'num_input_tokens_seen': 53346280, 'completed': '89.85% (814 / 906)', 'remaining time': '0:25:44', 'throughput': '1212.45', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:21:25,541 >> {'loss': 0.2186, 'grad_norm': 3.8996658325195312, 'learning_rate': 1.2463571168739825e-06, 'epoch': 0.8995584988962473, 'num_input_tokens_seen': 53411816, 'completed': '89.96% (815 / 906)', 'remaining time': '0:25:27', 'throughput': '1208.61', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:21:38,992 >> {'loss': 0.4572, 'grad_norm': 5.6412811279296875, 'learning_rate': 1.2410210207907579e-06, 'epoch': 0.9006622516556292, 'num_input_tokens_seen': 53477352, 'completed': '90.07% (816 / 906)', 'remaining time': '0:25:10', 'throughput': '1218.05', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:21:52,463 >> {'loss': 0.2967, 'grad_norm': 4.444023132324219, 'learning_rate': 1.2357417586595803e-06, 'epoch': 0.9017660044150111, 'num_input_tokens_seen': 53542888, 'completed': '90.18% (817 / 906)', 'remaining time': '0:24:53', 'throughput': '1216.21', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:22:05,918 >> {'loss': 0.3782, 'grad_norm': 5.247828483581543, 'learning_rate': 1.23051940092957e-06, 'epoch': 0.9028697571743929, 'num_input_tokens_seen': 53608424, 'completed': '90.29% (818 / 906)', 'remaining time': '0:24:36', 'throughput': '1217.69', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:22:19,263 >> {'loss': 0.2586, 'grad_norm': 4.472630977630615, 'learning_rate': 1.2253540172904894e-06, 'epoch': 0.9039735099337748, 'num_input_tokens_seen': 53673960, 'completed': '90.40% (819 / 906)', 'remaining time': '0:24:19', 'throughput': '1227.70', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:22:32,627 >> {'loss': 0.4628, 'grad_norm': 6.463611602783203, 'learning_rate': 1.2202456766718092e-06, 'epoch': 0.9050772626931567, 'num_input_tokens_seen': 53739496, 'completed': '90.51% (820 / 906)', 'remaining time': '0:24:02', 'throughput': '1225.98', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:22:45,965 >> {'loss': 0.2325, 'grad_norm': 4.195343017578125, 'learning_rate': 1.2151944472417888e-06, 'epoch': 0.9061810154525386, 'num_input_tokens_seen': 53805032, 'completed': '90.62% (821 / 906)', 'remaining time': '0:23:44', 'throughput': '1228.35', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:22:59,380 >> {'loss': 0.351, 'grad_norm': 4.714290618896484, 'learning_rate': 1.2102003964065693e-06, 'epoch': 0.9072847682119205, 'num_input_tokens_seen': 53870568, 'completed': '90.73% (822 / 906)', 'remaining time': '0:23:27', 'throughput': '1221.33', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:23:12,835 >> {'loss': 0.3268, 'grad_norm': 5.385950565338135, 'learning_rate': 1.205263590809268e-06, 'epoch': 0.9083885209713024, 'num_input_tokens_seen': 53936104, 'completed': '90.84% (823 / 906)', 'remaining time': '0:23:10', 'throughput': '1217.69', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:23:26,239 >> {'loss': 0.4413, 'grad_norm': 5.244440078735352, 'learning_rate': 1.200384096329096e-06, 'epoch': 0.9094922737306843, 'num_input_tokens_seen': 54001640, 'completed': '90.95% (824 / 906)', 'remaining time': '0:22:53', 'throughput': '1222.33', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:23:39,569 >> {'loss': 0.4869, 'grad_norm': 6.079404354095459, 'learning_rate': 1.1955619780804757e-06, 'epoch': 0.9105960264900662, 'num_input_tokens_seen': 54067176, 'completed': '91.06% (825 / 906)', 'remaining time': '0:22:36', 'throughput': '1229.14', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:23:52,996 >> {'loss': 0.3709, 'grad_norm': 4.891379356384277, 'learning_rate': 1.190797300412174e-06, 'epoch': 0.9116997792494481, 'num_input_tokens_seen': 54132712, 'completed': '91.17% (826 / 906)', 'remaining time': '0:22:19', 'throughput': '1220.24', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:24:06,480 >> {'loss': 0.2608, 'grad_norm': 4.037010192871094, 'learning_rate': 1.1860901269064366e-06, 'epoch': 0.91280353200883, 'num_input_tokens_seen': 54198248, 'completed': '91.28% (827 / 906)', 'remaining time': '0:22:02', 'throughput': '1215.03', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:24:19,975 >> {'loss': 0.1963, 'grad_norm': 3.740389347076416, 'learning_rate': 1.1814405203781503e-06, 'epoch': 0.9139072847682119, 'num_input_tokens_seen': 54263784, 'completed': '91.39% (828 / 906)', 'remaining time': '0:21:45', 'throughput': '1214.13', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:24:33,479 >> {'loss': 0.2375, 'grad_norm': 4.0328779220581055, 'learning_rate': 1.1768485428739963e-06, 'epoch': 0.9150110375275938, 'num_input_tokens_seen': 54329320, 'completed': '91.50% (829 / 906)', 'remaining time': '0:21:28', 'throughput': '1213.24', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:24:46,949 >> {'loss': 0.5289, 'grad_norm': 6.169814109802246, 'learning_rate': 1.1723142556716265e-06, 'epoch': 0.9161147902869757, 'num_input_tokens_seen': 54394856, 'completed': '91.61% (830 / 906)', 'remaining time': '0:21:11', 'throughput': '1216.30', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:25:00,438 >> {'loss': 0.2342, 'grad_norm': 4.388846397399902, 'learning_rate': 1.167837719278844e-06, 'epoch': 0.9172185430463576, 'num_input_tokens_seen': 54460392, 'completed': '91.72% (831 / 906)', 'remaining time': '0:20:54', 'throughput': '1214.61', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:25:13,994 >> {'loss': 0.3133, 'grad_norm': 4.988517761230469, 'learning_rate': 1.1634189934327954e-06, 'epoch': 0.9183222958057395, 'num_input_tokens_seen': 54525928, 'completed': '91.83% (832 / 906)', 'remaining time': '0:20:37', 'throughput': '1208.67', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:25:27,512 >> {'loss': 0.3411, 'grad_norm': 5.121459484100342, 'learning_rate': 1.1590581370991758e-06, 'epoch': 0.9194260485651214, 'num_input_tokens_seen': 54591464, 'completed': '91.94% (833 / 906)', 'remaining time': '0:20:20', 'throughput': '1212.03', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:25:41,185 >> {'loss': 0.2751, 'grad_norm': 4.2386345863342285, 'learning_rate': 1.1547552084714394e-06, 'epoch': 0.9205298013245033, 'num_input_tokens_seen': 54657000, 'completed': '92.05% (834 / 906)', 'remaining time': '0:20:03', 'throughput': '1198.28', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:25:54,637 >> {'loss': 0.2063, 'grad_norm': 3.978631019592285, 'learning_rate': 1.1505102649700243e-06, 'epoch': 0.9216335540838853, 'num_input_tokens_seen': 54722536, 'completed': '92.16% (835 / 906)', 'remaining time': '0:19:46', 'throughput': '1217.90', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:26:08,114 >> {'loss': 0.4821, 'grad_norm': 5.66132926940918, 'learning_rate': 1.1463233632415866e-06, 'epoch': 0.9227373068432672, 'num_input_tokens_seen': 54788072, 'completed': '92.27% (836 / 906)', 'remaining time': '0:19:29', 'throughput': '1215.70', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:26:21,605 >> {'loss': 0.4443, 'grad_norm': 5.7866692543029785, 'learning_rate': 1.1421945591582428e-06, 'epoch': 0.9238410596026491, 'num_input_tokens_seen': 54853608, 'completed': '92.38% (837 / 906)', 'remaining time': '0:19:12', 'throughput': '1214.46', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:26:35,078 >> {'loss': 0.4269, 'grad_norm': 5.27299690246582, 'learning_rate': 1.1381239078168262e-06, 'epoch': 0.9249448123620309, 'num_input_tokens_seen': 54919144, 'completed': '92.49% (838 / 906)', 'remaining time': '0:18:55', 'throughput': '1216.04', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:26:48,662 >> {'loss': 0.3032, 'grad_norm': 4.533444404602051, 'learning_rate': 1.1341114635381506e-06, 'epoch': 0.9260485651214128, 'num_input_tokens_seen': 54984680, 'completed': '92.60% (839 / 906)', 'remaining time': '0:18:38', 'throughput': '1206.19', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:27:02,121 >> {'loss': 0.6928, 'grad_norm': 7.473748207092285, 'learning_rate': 1.1301572798662849e-06, 'epoch': 0.9271523178807947, 'num_input_tokens_seen': 55050216, 'completed': '92.72% (840 / 906)', 'remaining time': '0:18:21', 'throughput': '1217.31', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:27:15,573 >> {'loss': 0.303, 'grad_norm': 4.378798007965088, 'learning_rate': 1.1262614095678395e-06, 'epoch': 0.9282560706401766, 'num_input_tokens_seen': 55115752, 'completed': '92.83% (841 / 906)', 'remaining time': '0:18:04', 'throughput': '1217.96', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:27:29,045 >> {'loss': 0.5211, 'grad_norm': 5.152125835418701, 'learning_rate': 1.1224239046312627e-06, 'epoch': 0.9293598233995585, 'num_input_tokens_seen': 55181288, 'completed': '92.94% (842 / 906)', 'remaining time': '0:17:47', 'throughput': '1216.11', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:27:42,690 >> {'loss': 0.2154, 'grad_norm': 4.320618152618408, 'learning_rate': 1.1186448162661444e-06, 'epoch': 0.9304635761589404, 'num_input_tokens_seen': 55246824, 'completed': '93.05% (843 / 906)', 'remaining time': '0:17:30', 'throughput': '1200.75', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:27:56,179 >> {'loss': 0.2749, 'grad_norm': 5.254551887512207, 'learning_rate': 1.1149241949025365e-06, 'epoch': 0.9315673289183223, 'num_input_tokens_seen': 55312360, 'completed': '93.16% (844 / 906)', 'remaining time': '0:17:13', 'throughput': '1214.66', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:28:09,654 >> {'loss': 0.4807, 'grad_norm': 5.376104354858398, 'learning_rate': 1.1112620901902775e-06, 'epoch': 0.9326710816777042, 'num_input_tokens_seen': 55377896, 'completed': '93.27% (845 / 906)', 'remaining time': '0:16:56', 'throughput': '1215.83', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:28:23,244 >> {'loss': 0.2671, 'grad_norm': 3.8265469074249268, 'learning_rate': 1.1076585509983285e-06, 'epoch': 0.9337748344370861, 'num_input_tokens_seen': 55443432, 'completed': '93.38% (846 / 906)', 'remaining time': '0:16:40', 'throughput': '1205.66', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:28:36,812 >> {'loss': 0.2879, 'grad_norm': 4.158047199249268, 'learning_rate': 1.104113625414124e-06, 'epoch': 0.934878587196468, 'num_input_tokens_seen': 55508968, 'completed': '93.49% (847 / 906)', 'remaining time': '0:16:23', 'throughput': '1207.51', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:28:50,183 >> {'loss': 0.4776, 'grad_norm': 5.733949661254883, 'learning_rate': 1.1006273607429305e-06, 'epoch': 0.9359823399558499, 'num_input_tokens_seen': 55574504, 'completed': '93.60% (848 / 906)', 'remaining time': '0:16:06', 'throughput': '1225.33', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:29:03,654 >> {'loss': 0.4113, 'grad_norm': 5.860391616821289, 'learning_rate': 1.0971998035072123e-06, 'epoch': 0.9370860927152318, 'num_input_tokens_seen': 55640040, 'completed': '93.71% (849 / 906)', 'remaining time': '0:15:49', 'throughput': '1216.27', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:29:17,171 >> {'loss': 0.3053, 'grad_norm': 4.877603530883789, 'learning_rate': 1.0938309994460127e-06, 'epoch': 0.9381898454746137, 'num_input_tokens_seen': 55705576, 'completed': '93.82% (850 / 906)', 'remaining time': '0:15:32', 'throughput': '1212.05', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:29:30,560 >> {'loss': 0.2688, 'grad_norm': 4.361584663391113, 'learning_rate': 1.090520993514343e-06, 'epoch': 0.9392935982339956, 'num_input_tokens_seen': 55771112, 'completed': '93.93% (851 / 906)', 'remaining time': '0:15:15', 'throughput': '1223.73', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:29:44,003 >> {'loss': 0.2407, 'grad_norm': 4.158669471740723, 'learning_rate': 1.0872698298825822e-06, 'epoch': 0.9403973509933775, 'num_input_tokens_seen': 55836648, 'completed': '94.04% (852 / 906)', 'remaining time': '0:14:58', 'throughput': '1218.76', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:29:57,601 >> {'loss': 0.1298, 'grad_norm': 3.3524842262268066, 'learning_rate': 1.08407755193589e-06, 'epoch': 0.9415011037527594, 'num_input_tokens_seen': 55902184, 'completed': '94.15% (853 / 906)', 'remaining time': '0:14:41', 'throughput': '1204.87', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:30:11,063 >> {'loss': 0.5958, 'grad_norm': 7.272517204284668, 'learning_rate': 1.0809442022736238e-06, 'epoch': 0.9426048565121413, 'num_input_tokens_seen': 55967720, 'completed': '94.26% (854 / 906)', 'remaining time': '0:14:25', 'throughput': '1217.06', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:30:24,502 >> {'loss': 0.2352, 'grad_norm': 4.141610622406006, 'learning_rate': 1.0778698227087736e-06, 'epoch': 0.9437086092715232, 'num_input_tokens_seen': 56033256, 'completed': '94.37% (855 / 906)', 'remaining time': '0:14:08', 'throughput': '1219.21', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:30:37,922 >> {'loss': 0.5763, 'grad_norm': 6.14682674407959, 'learning_rate': 1.0748544542674028e-06, 'epoch': 0.9448123620309051, 'num_input_tokens_seen': 56098792, 'completed': '94.48% (856 / 906)', 'remaining time': '0:13:51', 'throughput': '1220.81', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:30:51,499 >> {'loss': 0.2543, 'grad_norm': 4.349286079406738, 'learning_rate': 1.0718981371881004e-06, 'epoch': 0.9459161147902869, 'num_input_tokens_seen': 56164328, 'completed': '94.59% (857 / 906)', 'remaining time': '0:13:34', 'throughput': '1206.73', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:31:05,107 >> {'loss': 0.3312, 'grad_norm': 5.373385906219482, 'learning_rate': 1.0690009109214443e-06, 'epoch': 0.9470198675496688, 'num_input_tokens_seen': 56229864, 'completed': '94.70% (858 / 906)', 'remaining time': '0:13:17', 'throughput': '1204.01', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:31:18,550 >> {'loss': 0.4741, 'grad_norm': 5.748344421386719, 'learning_rate': 1.0661628141294758e-06, 'epoch': 0.9481236203090507, 'num_input_tokens_seen': 56295400, 'completed': '94.81% (859 / 906)', 'remaining time': '0:13:01', 'throughput': '1218.80', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:31:32,117 >> {'loss': 0.4531, 'grad_norm': 5.545693397521973, 'learning_rate': 1.0633838846851817e-06, 'epoch': 0.9492273730684326, 'num_input_tokens_seen': 56360936, 'completed': '94.92% (860 / 906)', 'remaining time': '0:12:44', 'throughput': '1207.65', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:31:45,788 >> {'loss': 0.3835, 'grad_norm': 5.274410247802734, 'learning_rate': 1.0606641596719908e-06, 'epoch': 0.9503311258278145, 'num_input_tokens_seen': 56426472, 'completed': '95.03% (861 / 906)', 'remaining time': '0:12:27', 'throughput': '1198.45', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:31:59,416 >> {'loss': 0.2417, 'grad_norm': 4.2041826248168945, 'learning_rate': 1.0580036753832781e-06, 'epoch': 0.9514348785871964, 'num_input_tokens_seen': 56492008, 'completed': '95.14% (862 / 906)', 'remaining time': '0:12:10', 'throughput': '1202.19', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:32:12,812 >> {'loss': 0.4734, 'grad_norm': 5.932213306427002, 'learning_rate': 1.0554024673218808e-06, 'epoch': 0.9525386313465783, 'num_input_tokens_seen': 56557544, 'completed': '95.25% (863 / 906)', 'remaining time': '0:11:54', 'throughput': '1223.12', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:32:26,278 >> {'loss': 0.5387, 'grad_norm': 5.501502990722656, 'learning_rate': 1.0528605701996232e-06, 'epoch': 0.9536423841059603, 'num_input_tokens_seen': 56623080, 'completed': '95.36% (864 / 906)', 'remaining time': '0:11:37', 'throughput': '1216.64', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:32:39,832 >> {'loss': 0.3111, 'grad_norm': 4.631185054779053, 'learning_rate': 1.0503780179368569e-06, 'epoch': 0.9547461368653422, 'num_input_tokens_seen': 56688616, 'completed': '95.47% (865 / 906)', 'remaining time': '0:11:20', 'throughput': '1208.84', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:32:53,332 >> {'loss': 0.4427, 'grad_norm': 5.278607368469238, 'learning_rate': 1.047954843662004e-06, 'epoch': 0.9558498896247241, 'num_input_tokens_seen': 56754152, 'completed': '95.58% (866 / 906)', 'remaining time': '0:11:03', 'throughput': '1213.57', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:33:06,881 >> {'loss': 0.1917, 'grad_norm': 3.5461151599884033, 'learning_rate': 1.0455910797111182e-06, 'epoch': 0.956953642384106, 'num_input_tokens_seen': 56819688, 'completed': '95.70% (867 / 906)', 'remaining time': '0:10:47', 'throughput': '1209.22', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:33:20,384 >> {'loss': 0.1873, 'grad_norm': 3.519580125808716, 'learning_rate': 1.043286757627451e-06, 'epoch': 0.9580573951434879, 'num_input_tokens_seen': 56885224, 'completed': '95.81% (868 / 906)', 'remaining time': '0:10:30', 'throughput': '1213.37', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:33:33,845 >> {'loss': 0.5259, 'grad_norm': 6.359206199645996, 'learning_rate': 1.0410419081610324e-06, 'epoch': 0.9591611479028698, 'num_input_tokens_seen': 56950760, 'completed': '95.92% (869 / 906)', 'remaining time': '0:10:13', 'throughput': '1217.17', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:33:47,349 >> {'loss': 0.3572, 'grad_norm': 5.1878557205200195, 'learning_rate': 1.0388565612682591e-06, 'epoch': 0.9602649006622517, 'num_input_tokens_seen': 57016296, 'completed': '96.03% (870 / 906)', 'remaining time': '0:09:56', 'throughput': '1213.28', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:34:00,719 >> {'loss': 0.2851, 'grad_norm': 5.292546272277832, 'learning_rate': 1.0367307461114976e-06, 'epoch': 0.9613686534216336, 'num_input_tokens_seen': 57081832, 'completed': '96.14% (871 / 906)', 'remaining time': '0:09:40', 'throughput': '1225.47', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:34:14,187 >> {'loss': 0.2781, 'grad_norm': 4.41323709487915, 'learning_rate': 1.0346644910586912e-06, 'epoch': 0.9624724061810155, 'num_input_tokens_seen': 57147368, 'completed': '96.25% (872 / 906)', 'remaining time': '0:09:23', 'throughput': '1216.46', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:34:27,725 >> {'loss': 0.378, 'grad_norm': 5.156498908996582, 'learning_rate': 1.0326578236829837e-06, 'epoch': 0.9635761589403974, 'num_input_tokens_seen': 57212904, 'completed': '96.36% (873 / 906)', 'remaining time': '0:09:06', 'throughput': '1210.27', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:34:41,194 >> {'loss': 0.2814, 'grad_norm': 4.593717575073242, 'learning_rate': 1.0307107707623509e-06, 'epoch': 0.9646799116997793, 'num_input_tokens_seen': 57278440, 'completed': '96.47% (874 / 906)', 'remaining time': '0:08:50', 'throughput': '1216.43', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:34:54,631 >> {'loss': 0.3716, 'grad_norm': 4.606041431427002, 'learning_rate': 1.0288233582792448e-06, 'epoch': 0.9657836644591612, 'num_input_tokens_seen': 57343976, 'completed': '96.58% (875 / 906)', 'remaining time': '0:08:33', 'throughput': '1219.25', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:35:08,044 >> {'loss': 0.3239, 'grad_norm': 4.7908124923706055, 'learning_rate': 1.0269956114202435e-06, 'epoch': 0.9668874172185431, 'num_input_tokens_seen': 57409512, 'completed': '96.69% (876 / 906)', 'remaining time': '0:08:16', 'throughput': '1221.55', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:35:21,386 >> {'loss': 0.4617, 'grad_norm': 5.706357002258301, 'learning_rate': 1.0252275545757185e-06, 'epoch': 0.9679911699779249, 'num_input_tokens_seen': 57475048, 'completed': '96.80% (877 / 906)', 'remaining time': '0:08:00', 'throughput': '1227.97', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:35:34,867 >> {'loss': 0.5561, 'grad_norm': 6.367629051208496, 'learning_rate': 1.0235192113395068e-06, 'epoch': 0.9690949227373068, 'num_input_tokens_seen': 57540584, 'completed': '96.91% (878 / 906)', 'remaining time': '0:07:43', 'throughput': '1215.32', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:35:48,333 >> {'loss': 0.2952, 'grad_norm': 4.593899250030518, 'learning_rate': 1.0218706045085982e-06, 'epoch': 0.9701986754966887, 'num_input_tokens_seen': 57606120, 'completed': '97.02% (879 / 906)', 'remaining time': '0:07:26', 'throughput': '1216.75', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:36:01,888 >> {'loss': 0.3793, 'grad_norm': 6.838583946228027, 'learning_rate': 1.0202817560828287e-06, 'epoch': 0.9713024282560706, 'num_input_tokens_seen': 57671656, 'completed': '97.13% (880 / 906)', 'remaining time': '0:07:10', 'throughput': '1208.71', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:36:15,362 >> {'loss': 0.1679, 'grad_norm': 3.928147315979004, 'learning_rate': 1.0187526872645888e-06, 'epoch': 0.9724061810154525, 'num_input_tokens_seen': 57737192, 'completed': '97.24% (881 / 906)', 'remaining time': '0:06:53', 'throughput': '1215.94', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:36:28,972 >> {'loss': 0.2663, 'grad_norm': 4.4882330894470215, 'learning_rate': 1.0172834184585406e-06, 'epoch': 0.9735099337748344, 'num_input_tokens_seen': 57802728, 'completed': '97.35% (882 / 906)', 'remaining time': '0:06:36', 'throughput': '1203.80', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:36:42,398 >> {'loss': 0.1723, 'grad_norm': 3.8216326236724854, 'learning_rate': 1.0158739692713428e-06, 'epoch': 0.9746136865342163, 'num_input_tokens_seen': 57868264, 'completed': '97.46% (883 / 906)', 'remaining time': '0:06:20', 'throughput': '1220.36', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:36:55,886 >> {'loss': 0.1924, 'grad_norm': 4.116147041320801, 'learning_rate': 1.0145243585113936e-06, 'epoch': 0.9757174392935982, 'num_input_tokens_seen': 57933800, 'completed': '97.57% (884 / 906)', 'remaining time': '0:06:03', 'throughput': '1214.65', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:37:09,452 >> {'loss': 0.2267, 'grad_norm': 4.032740592956543, 'learning_rate': 1.0132346041885756e-06, 'epoch': 0.9768211920529801, 'num_input_tokens_seen': 57999336, 'completed': '97.68% (885 / 906)', 'remaining time': '0:05:47', 'throughput': '1207.79', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:37:22,872 >> {'loss': 0.4259, 'grad_norm': 6.308236122131348, 'learning_rate': 1.0120047235140178e-06, 'epoch': 0.977924944812362, 'num_input_tokens_seen': 58064872, 'completed': '97.79% (886 / 906)', 'remaining time': '0:05:30', 'throughput': '1220.80', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:37:36,327 >> {'loss': 0.4391, 'grad_norm': 6.122150421142578, 'learning_rate': 1.0108347328998642e-06, 'epoch': 0.9790286975717439, 'num_input_tokens_seen': 58130408, 'completed': '97.90% (887 / 906)', 'remaining time': '0:05:13', 'throughput': '1217.68', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:37:49,854 >> {'loss': 0.2207, 'grad_norm': 3.8909456729888916, 'learning_rate': 1.0097246479590569e-06, 'epoch': 0.9801324503311258, 'num_input_tokens_seen': 58195944, 'completed': '98.01% (888 / 906)', 'remaining time': '0:04:57', 'throughput': '1211.23', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:38:03,143 >> {'loss': 0.3534, 'grad_norm': 4.869685173034668, 'learning_rate': 1.008674483505126e-06, 'epoch': 0.9812362030905077, 'num_input_tokens_seen': 58261480, 'completed': '98.12% (889 / 906)', 'remaining time': '0:04:40', 'throughput': '1232.92', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:38:16,595 >> {'loss': 0.3576, 'grad_norm': 5.161973476409912, 'learning_rate': 1.0076842535519936e-06, 'epoch': 0.9823399558498896, 'num_input_tokens_seen': 58327016, 'completed': '98.23% (890 / 906)', 'remaining time': '0:04:24', 'throughput': '1217.95', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:38:29,903 >> {'loss': 0.5, 'grad_norm': 6.185498237609863, 'learning_rate': 1.0067539713137842e-06, 'epoch': 0.9834437086092715, 'num_input_tokens_seen': 58392552, 'completed': '98.34% (891 / 906)', 'remaining time': '0:04:07', 'throughput': '1231.16', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:38:43,301 >> {'loss': 0.2778, 'grad_norm': 4.712432384490967, 'learning_rate': 1.0058836492046506e-06, 'epoch': 0.9845474613686535, 'num_input_tokens_seen': 58458088, 'completed': '98.45% (892 / 906)', 'remaining time': '0:03:51', 'throughput': '1222.85', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:38:56,873 >> {'loss': 0.2725, 'grad_norm': 4.5431389808654785, 'learning_rate': 1.0050732988386082e-06, 'epoch': 0.9856512141280354, 'num_input_tokens_seen': 58523624, 'completed': '98.57% (893 / 906)', 'remaining time': '0:03:34', 'throughput': '1207.17', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:39:10,239 >> {'loss': 0.3557, 'grad_norm': 5.538437843322754, 'learning_rate': 1.0043229310293782e-06, 'epoch': 0.9867549668874173, 'num_input_tokens_seen': 58589160, 'completed': '98.68% (894 / 906)', 'remaining time': '0:03:17', 'throughput': '1225.82', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:39:23,597 >> {'loss': 0.3684, 'grad_norm': 5.204857349395752, 'learning_rate': 1.0036325557902454e-06, 'epoch': 0.9878587196467992, 'num_input_tokens_seen': 58654696, 'completed': '98.79% (895 / 906)', 'remaining time': '0:03:01', 'throughput': '1226.52', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:39:37,355 >> {'loss': 0.1827, 'grad_norm': 3.494764804840088, 'learning_rate': 1.0030021823339229e-06, 'epoch': 0.9889624724061811, 'num_input_tokens_seen': 58720232, 'completed': '98.90% (896 / 906)', 'remaining time': '0:02:44', 'throughput': '1190.91', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:39:50,773 >> {'loss': 0.429, 'grad_norm': 5.376889705657959, 'learning_rate': 1.0024318190724313e-06, 'epoch': 0.9900662251655629, 'num_input_tokens_seen': 58785768, 'completed': '99.01% (897 / 906)', 'remaining time': '0:02:28', 'throughput': '1220.98', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:40:04,204 >> {'loss': 0.326, 'grad_norm': 4.6836838722229, 'learning_rate': 1.0019214736169832e-06, 'epoch': 0.9911699779249448, 'num_input_tokens_seen': 58851304, 'completed': '99.12% (898 / 906)', 'remaining time': '0:02:11', 'throughput': '1219.89', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:40:17,620 >> {'loss': 0.2861, 'grad_norm': 4.479922771453857, 'learning_rate': 1.0014711527778844e-06, 'epoch': 0.9922737306843267, 'num_input_tokens_seen': 58916840, 'completed': '99.23% (899 / 906)', 'remaining time': '0:01:55', 'throughput': '1221.18', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:40:31,065 >> {'loss': 0.3314, 'grad_norm': 4.647156715393066, 'learning_rate': 1.0010808625644427e-06, 'epoch': 0.9933774834437086, 'num_input_tokens_seen': 58982376, 'completed': '99.34% (900 / 906)', 'remaining time': '0:01:38', 'throughput': '1218.61', 'gpu_mem_free': '30139MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-05 00:40:57,038 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-900
[INFO|configuration_utils.py:472] 2025-01-05 00:40:57,040 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-900/config.json
[INFO|configuration_utils.py:807] 2025-01-05 00:40:57,042 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-900/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-05 00:41:54,713 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-900/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-05 00:41:54,716 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-900/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-05 00:41:54,717 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-900/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[WARNING|trainer.py:869] 2025-01-05 00:45:49,734 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 3600, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[INFO|trainer.py:175] 2025-01-05 00:46:03,332 >> {'loss': 0.5011, 'grad_norm': 6.767654895782471, 'learning_rate': 1.000750608184886e-06, 'epoch': 0.9944812362030905, 'num_input_tokens_seen': 59047912, 'completed': '99.45% (901 / 906)', 'remaining time': '0:01:24', 'throughput': '49.31', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:46:16,772 >> {'loss': 0.2887, 'grad_norm': 4.77541446685791, 'learning_rate': 1.0004803940462948e-06, 'epoch': 0.9955849889624724, 'num_input_tokens_seen': 59113448, 'completed': '99.56% (902 / 906)', 'remaining time': '0:01:07', 'throughput': '1219.05', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:46:30,163 >> {'loss': 0.2941, 'grad_norm': 4.733397483825684, 'learning_rate': 1.0002702237545419e-06, 'epoch': 0.9966887417218543, 'num_input_tokens_seen': 59178984, 'completed': '99.67% (903 / 906)', 'remaining time': '0:00:50', 'throughput': '1223.53', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:46:43,629 >> {'loss': 0.2109, 'grad_norm': 4.308443069458008, 'learning_rate': 1.0001201001142449e-06, 'epoch': 0.9977924944812362, 'num_input_tokens_seen': 59244520, 'completed': '99.78% (904 / 906)', 'remaining time': '0:00:33', 'throughput': '1216.66', 'gpu_mem_free': '30139MB'}
[INFO|trainer.py:175] 2025-01-05 00:46:57,321 >> {'loss': 0.3868, 'grad_norm': 5.600368022918701, 'learning_rate': 1.000030025128729e-06, 'epoch': 0.9988962472406181, 'num_input_tokens_seen': 59310056, 'completed': '99.89% (905 / 906)', 'remaining time': '0:00:16', 'throughput': '1196.61', 'gpu_mem_free': '30131MB'}
[INFO|trainer.py:175] 2025-01-05 00:47:10,887 >> {'loss': 0.4341, 'grad_norm': 5.8607497215271, 'learning_rate': 1.0000000000000002e-06, 'epoch': 1.0, 'num_input_tokens_seen': 59375592, 'completed': '100.00% (906 / 906)', 'remaining time': '0:00:00', 'throughput': '1207.77', 'gpu_mem_free': '30139MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-05 00:47:37,040 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-906
[INFO|configuration_utils.py:472] 2025-01-05 00:47:37,043 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-906/config.json
[INFO|configuration_utils.py:807] 2025-01-05 00:47:37,044 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-906/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-05 00:48:35,231 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-906/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-05 00:48:35,235 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-906/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-05 00:48:35,236 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/checkpoint-906/special_tokens_map.json
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[WARNING|trainer.py:869] 2025-01-05 00:52:30,011 >> Save streaming dataset state: {'epoch': 0, 'sample_in_epoch': 3624, 'num_canonical_nodes': 1, 'shuffle_seed': 42, 'initial_physical_nodes': 1}
[INFO|trainer.py:2394] 2025-01-05 00:52:30,601 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:175] 2025-01-05 00:52:30,604 >> {'train_runtime': 15557.945, 'train_samples_per_second': 0.233, 'train_steps_per_second': 0.058, 'train_loss': 0.4684678308715094, 'epoch': 1.0, 'num_input_tokens_seen': 59375592, 'completed': '100.00% (906 / 906)', 'remaining time': '0:00:00', 'throughput': '0.00', 'gpu_mem_free': '21115MB'}
/scratch3/workspace/ctpham_umass_edu-ft/envs/prolong-final/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[INFO|trainer.py:3503] 2025-01-05 00:52:56,161 >> Saving model checkpoint to /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_
[INFO|configuration_utils.py:472] 2025-01-05 00:52:56,168 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/config.json
[INFO|configuration_utils.py:807] 2025-01-05 00:52:56,170 >> Configuration saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/generation_config.json
[INFO|modeling_utils.py:2807] 2025-01-05 00:53:56,574 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2684] 2025-01-05 00:53:56,578 >> tokenizer config file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/tokenizer_config.json
[INFO|tokenization_utils_base.py:2693] 2025-01-05 00:53:56,579 >> Special tokens file saved in /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/special_tokens_map.json
***** train metrics *****
  epoch                    =        1.0
  num_input_tokens_seen    =   59375592
  train_loss               =     0.4685
  train_runtime            = 4:19:17.94
  train_samples_per_second =      0.233
  train_steps_per_second   =      0.058
[rank1]:[W105 00:53:59.672684990 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[rank3]:[W105 00:53:59.674433943 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[rank2]:[W105 00:53:59.676267421 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
wandb: - 0.015 MB of 0.015 MB uploadedwandb: \ 0.015 MB of 0.021 MB uploadedwandb: | 0.344 MB of 0.344 MB uploadedwandb: 
wandb: Run history:
wandb:                 train/epoch ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:           train/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:             train/grad_norm █▄▃▂▂▂▂▂▂▂▁▂▂▃▂▂▂▂▂▂▁▂▂▂▁▁▁▁▂▁▂▂▁▂▂▂▂▂▃▁
wandb:         train/learning_rate ▂▄███████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
wandb:                  train/loss █▆▆▄▄▃▃▃▄▃▃▆▃█▄▇▅▃▄▄▂▃▃▃▂▂▂▁▃▂▃▃▁▂▃▂▂▄▃▁
wandb: train/num_input_tokens_seen ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: 
wandb: Run summary:
wandb:                  total_flos 6.684140179701105e+17
wandb:                 train/epoch 1.0
wandb:           train/global_step 906
wandb:             train/grad_norm 5.86075
wandb:         train/learning_rate 0.0
wandb:                  train/loss 0.4341
wandb: train/num_input_tokens_seen 59375592
wandb:                  train_loss 0.46847
wandb:               train_runtime 15557.945
wandb:    train_samples_per_second 0.233
wandb:      train_steps_per_second 0.058
wandb: 
wandb: 🚀 View run wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_ at: https://wandb.ai/chtmp223/prolong/runs/95wq5z4x
wandb: ⭐️ View project at: https://wandb.ai/chtmp223/prolong
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: /scratch3/workspace/ctpham_umass_edu-models/wp_hparam_prolong-512K-base_bsz-16_lr-1e-5_epochs-1_/wandb/run-20250104_203314-95wq5z4x/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.
[rank0]:[W105 00:54:08.269053270 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())