[2024-12-19 19:29:19,864] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:29:21,447] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-12-19 19:29:21,447] [INFO] [runner.py:571:main] cmd = /vol3/ctr/.conda/envs/llava-rlhf/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMywgNCwgNSwgNl19 --master_addr=127.0.0.1 --master_port=29504 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed /vol3/home/ctr/llava-rlhf/LLaVA/scripts/zero3_offload.json --model_name_or_path /vol3/home/ctr/llava-rlhf/models/llava-RLAIF-V-7B --version v1 --data_path /vol3/home/ctr/llava-rlhf/datasets/aokvqa/llava_sft_data_value_aokvqa_sft_12_19.json --image_folder /vol3/home/ctr/llava-rlhf/datasets/coco/ --vision_tower /vol3/home/ctr/llava-rlhf/models/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir /vol3/home/ctr/llava-rlhf/models/llava-v1.5-7b-sft-prm-v1 --num_train_epochs 1 --per_device_train_batch_size 8 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 1 --learning_rate 1e-5 --weight_decay 0.05 --warmup_ratio 0.1 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 8 --lazy_preprocess True --report_to wandb [2024-12-19 19:29:23,682] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:29:25,306] [INFO] [launch.py:138:main] 0 NCCL_TIMEOUT=360 [2024-12-19 19:29:25,306] [INFO] [launch.py:138:main] 0 NCCL_IB_TIMEOUT=360 [2024-12-19 19:29:25,306] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [3, 4, 5, 6]} [2024-12-19 19:29:25,306] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0 [2024-12-19 19:29:25,306] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]}) [2024-12-19 19:29:25,306] [INFO] [launch.py:163:main] dist_world_size=4 [2024-12-19 19:29:25,306] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=3,4,5,6 [2024-12-19 19:29:29,264] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:29:29,274] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:29:29,291] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:29:29,296] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:29:30,409] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-19 19:29:30,410] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-19 19:29:30,412] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-19 19:29:30,465] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-19 19:29:30,465] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Currently logged in as: 894699297roy (894699297roy-wuhan-university). Use `wandb login --relogin` to force relogin wandb: - Waiting for wandb.init()... wandb: \ Waiting for wandb.init()... You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. wandb: | Waiting for wandb.init()... wandb: / Waiting for wandb.init()... wandb: Tracking run with wandb version 0.18.5 wandb: Run data is saved locally in /vol3/home/ctr/llava-rlhf/LLaVA/wandb/run-20241219_192932-47os2r1a wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run decent-forest-5 wandb: ⭐️ View project at https://wandb.ai/894699297roy-wuhan-university/llava_prm_sft wandb: 🚀 View run at https://wandb.ai/894699297roy-wuhan-university/llava_prm_sft/runs/47os2r1a You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2024-12-19 19:29:46,503] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 299, num_elems = 6.76B Loading checkpoint shards: 0%| | 0/3 [00:00) value label: tensor([[0.4668], [0.4668], [1.0000], [0.5000], [0.8008], [0.8008], [0.8008], [0.2002]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[-2.2500], [-0.4258], [-1.8281], [ 0.0398], [-1.9531], [-0.6875], [-0.8359], [-0.1357]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.6016], [0.6016], [0.4668], [0.4668], [0.7500], [0.2500]], device='cuda:3', dtype=torch.bfloat16) predicted value: tensor([[-1.3281], [-2.7188], [-0.9570], [-2.0781], [-1.4922], [-1.1797], [ 0.0474], [ 0.5508]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.8008], [0.4668], [0.2002], [0.8008], [0.8320], [0.3340], [0.2500]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[-1.3516], [-2.1406], [-1.7109], [-0.8008], [-0.7109], [-2.3594], [-1.1953], [-1.4375]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.2500], [0.2500], [0.2500], [0.2002], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 1.078125 loss: 0.91796875 loss: 0.75 loss: 1.0390625 predicted value: tensor([[-1.6484], [-1.5000], [-1.6016], [-0.9492], [-1.1484], [-1.9453], [-1.1250], [-0.6680]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.3750], [0.8008], [0.8008], [0.6680], [0.4004], [0.3340]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[-1.4375], [-2.6406], [-1.2812], [-0.8359], [-2.3281], [-0.6797], [-1.9141], [-1.8828]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [1.0000], [0.2002], [0.8008], [0.5000], [0.2002]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[-1.9844], [-0.4082], [-1.9766], [-2.0781], [-0.0801], [-0.7500], [-0.6836], [-1.3281]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[-1.4297], [-0.4902], [-2.3750], [ 0.2305], [-1.5156], [-1.8203], [-0.9336], [-1.2422]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.2500], [0.5000], [0.8320], [0.2002], [0.2002]], device='cuda:3', dtype=torch.bfloat16) value label: tensor([[0.4668], [0.8008], [0.8320], [1.0000], [0.3750], [1.0000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 1.265625 loss: 0.96875 loss: 0.84375 loss: 0.96875 predicted value: tensor([[-0.5352], [-1.7656], [-1.7422], [-0.1025], [-1.3359], [-1.0312], [ 0.4688], [-0.4336]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.6680], [0.8008], [0.0204], [1.0000], [0.2002], [0.1670]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[-0.3379], [-1.3125], [-0.5117], [-1.1641], [-1.8203], [-0.3223], [-1.2969], [-0.8203]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[-2.4375], [-0.5820], [-0.5977], [-0.6133], [-0.4297], [-2.1562], [ 0.4570], [-0.8320]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[-1.3750], [-1.4766], [-2.2812], [-1.7969], [-1.1641], [-0.5156], [ 0.1621], [-1.9531]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.6016], [1.0000], [0.2002], [0.5000], [0.3750], [0.4004]], device='cuda:3', dtype=torch.bfloat16) value label: tensor([[0.5547], [0.6680], [0.8320], [0.6016], [0.6016], [0.6016], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) value label: tensor([[1.0000], [0.2500], [0.8008], [0.6680], [0.2500], [1.0000], [0.0400], [0.5000]], device='cuda:2', dtype=torch.bfloat16) loss: 0.5859375loss: 1.03125 loss: 0.56640625 loss: 0.7734375 predicted value: tensor([[-0.6250], [-1.0938], [-1.1562], [-0.3184], [-1.2578], [-1.4219], [-1.2188], [-0.4883]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[-0.5664], [-2.0938], [-0.5391], [-2.4219], [-0.7695], [-0.1436], [-0.9570], [-1.9609]], device='cuda:1', dtype=torch.bfloat16, grad_fn=)value label: tensor([[0.2500], [1.0000], [1.0000], [0.6016], [0.8008], [1.0000], [0.5000], [0.5000]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[-2.2656], [-1.9688], [ 0.9492], [-0.7227], [-1.0781], [-1.2344], [-0.9922], [-1.8047]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [1.0000], [0.2500], [0.2002], [0.4004], [0.6680], [0.2002]], device='cuda:1', dtype=torch.bfloat16) value label: tensor([[0.5547], [0.8008], [1.0000], [0.4668], [0.3340], [0.4004], [0.3340], [0.2002]], device='cuda:3', dtype=torch.bfloat16) predicted value: tensor([[-1.4922], [-2.7188], [ 0.1055], [-1.6016], [-0.4844], [-0.2520], [-0.0432], [-0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.2002], [0.3750], [0.6680], [0.5000], [0.6016], [0.5000], [0.3340]], device='cuda:0', dtype=torch.bfloat16) /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) loss: 0.59375 loss: 0.859375 loss: 0.7734375 loss: 0.85546875 0%| | 1/983 [00:39<10:46:38, 39.51s/it] {'loss': 3.4678, 'learning_rate': 0.0, 'epoch': 0.0} 0%| | 1/983 [00:39<10:46:38, 39.51s/it]predicted value: tensor([[1.2734], [2.2969], [2.2500], [1.9375], [2.9062], [1.9922], [0.4746], [2.8750]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[1.1250], [3.6719], [2.2344], [2.9688], [2.5781], [1.7656], [0.2754], [3.2188]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5547], [0.3340], [0.8008], [0.4668], [0.6016], [0.5000], [0.1670]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[1.9141], [1.7266], [2.8906], [2.6406], [3.4688], [2.3594], [3.9688], [1.9297]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[ 1.7578], [ 1.9062], [-1.3594], [ 1.8047], [ 2.7969], [ 2.7500], [ 2.4062], [ 3.2031]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.7500], [0.8008], [0.4668], [0.3340], [0.1670]], device='cuda:3', dtype=torch.bfloat16) value label: tensor([[1.0000], [0.4668], [0.5547], [0.5703], [1.0000], [1.0000], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) value label: tensor([[1.0000], [1.0000], [0.3750], [0.4668], [0.5000], [0.0400], [0.2500], [0.4004]], device='cuda:2', dtype=torch.bfloat16) loss: 1.0 loss: 0.86328125loss: 1.2578125 loss: 0.75 predicted value: tensor([[2.0938], [1.9766], [1.3281], [1.7969], [2.7344], [2.0781], [2.1406], [3.0781]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[1.9141], [1.6328], [0.4668], [2.0000], [0.0140], [3.5156], [2.0469], [2.6719]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [0.3750], [0.2500], [0.6016], [0.2002], [0.2002]], device='cuda:2', dtype=torch.bfloat16) value label: tensor([[0.5547], [0.4668], [0.6680], [0.8320], [1.0000], [0.6680], [0.0400], [0.2500]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[1.7344], [2.1562], [3.1094], [2.2031], [2.2500], [1.7031], [3.3438], [2.7969]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[2.6719], [1.7266], [0.8281], [2.2500], [0.9414], [2.9531], [1.5469], [3.1562]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.2500], [0.5000], [0.6016], [0.4004], [0.6016], [0.2002]], device='cuda:0', dtype=torch.bfloat16) value label: tensor([[1.0000], [0.7500], [0.8008], [0.4668], [0.8008], [0.6016], [0.5000], [0.2002]], device='cuda:3', dtype=torch.bfloat16) loss: 0.81640625 loss: 0.73828125loss: 0.90625 loss: 0.734375 predicted value: tensor([[-0.5078], [ 2.7188], [ 2.2812], [ 3.7031], [ 3.3594], [ 2.4062], [ 2.8594], [ 2.4531]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[1.3047], [2.2344], [1.7812], [1.9766], [3.2344], [1.7734], [2.6719], [3.2812]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.8008], [0.7500], [0.5000], [0.4004], [0.1670]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[ 2.0781], [-0.2344], [ 2.6875], [ 2.6875], [-0.1201], [ 2.9844], [ 2.6719], [ 1.3047]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.4668], [1.0000], [1.0000], [0.6016], [0.6016], [0.4004], [0.2500]], device='cuda:3', dtype=torch.bfloat16) predicted value: tensor([[1.5547], [1.5234], [1.9531], [1.7734], [2.0312], [2.9219], [2.7344], [2.6719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.4668], [0.8008], [0.7500], [1.0000], [0.6016], [0.6016], [0.2002]], device='cuda:0', dtype=torch.bfloat16) value label: tensor([[1.0000], [1.0000], [0.6680], [0.6016], [0.3340], [0.5547], [0.3340], [0.2500]], device='cuda:1', dtype=torch.bfloat16) loss: 0.88671875 loss: 0.66796875 loss: 1.2890625 loss: 0.66015625 predicted value: tensor([[2.3281], [2.8125], [0.6094], [2.8438], [2.5938], [3.5938], [2.6094], [3.0938]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.8555], [0.4668], [0.6680], [0.2500], [0.4004], [0.4004], [0.4004]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[2.8281], [1.2344], [2.9375], [2.7500], [1.2500], [2.1562], [0.6719], [2.0469]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.5547], [1.0000], [0.6016], [0.3750], [0.4004], [0.2002]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[1.6406], [0.7617], [2.4062], [3.0000], [2.8281], [2.2969], [3.3750], [3.4531]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[2.8125], [1.6016], [2.7344], [1.8438], [2.4219], [2.9219], [1.4766], [1.5703]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [1.0000], [0.3340], [1.0000], [0.5000], [1.0000], [0.1670]], device='cuda:3', dtype=torch.bfloat16) value label: tensor([[0.2500], [0.4648], [0.8320], [0.4668], [0.3750], [0.8008], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.6015625 loss: 1.265625 loss: 1.0234375 loss: 0.79296875 0%| | 2/983 [01:03<8:17:54, 30.45s/it] {'loss': 3.5635, 'learning_rate': 1.5084420062289415e-06, 'epoch': 0.0} 0%| | 2/983 [01:03<8:17:54, 30.45s/it]predicted value: tensor([[1.5391], [1.9688], [1.4062], [2.9375], [1.4766], [2.6406], [3.2188], [2.1406]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[1.6016], [1.6328], [2.8594], [1.2891], [2.6562], [2.1562], [2.0156], [2.8750]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [0.2002], [0.2002], [0.5000], [0.5000], [0.2500], [0.5000]], device='cuda:2', dtype=torch.bfloat16) value label: tensor([[0.4668], [0.8320], [0.8008], [1.0000], [0.7500], [0.3340], [0.2002], [0.2500]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[1.5625], [3.6719], [2.5938], [2.0000], [2.2188], [3.9219], [2.3750], [2.8281]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[1.6484], [3.2188], [3.1719], [3.0781], [3.4219], [2.6562], [2.7812], [2.2344]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.2500], [0.6016], [0.2500], [0.5000], [0.6016], [0.2002]], device='cuda:3', dtype=torch.bfloat16) value label: tensor([[0.5547], [0.2002], [0.3750], [0.0400], [0.6016], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.90234375 loss: 1.609375 loss: 1.328125 loss: 0.73046875 predicted value: tensor([[3.5000], [0.8672], [3.1719], [2.3594], [3.0156], [3.9062], [0.0175], [2.9688]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[1.8750], [1.8203], [1.8984], [2.3125], [2.9219], [3.1250], [1.7266], [1.5781]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [1.0000], [0.8320], [1.0000], [0.3340], [0.4004], [0.2500]], device='cuda:2', dtype=torch.bfloat16) value label: tensor([[0.8320], [0.6680], [0.5547], [1.0000], [0.5000], [0.3340], [0.2500], [0.2500]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[2.6250], [1.4297], [2.3594], [1.3672], [0.8750], [0.0356], [3.8906], [1.8359]], device='cuda:0', dtype=torch.bfloat16, grad_fn=)predicted value: tensor([[1.9375], [0.6445], [1.8438], [3.3750], [3.0156], [2.2969], [1.5078], [2.2500]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6016], [1.0000], [1.0000], [0.3340], [0.6016], [0.5000], [0.1670]], device='cuda:3', dtype=torch.bfloat16) value label: tensor([[1.0000], [0.4668], [1.0000], [1.0000], [0.5000], [0.3340], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 1.1796875 loss: 0.64453125loss: 0.71875 loss: 0.73828125 predicted value: tensor([[1.0234], [2.0781], [2.1875], [1.5312], [3.0625], [2.3750], [2.8906], [2.2344]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[ 0.0796], [ 2.0312], [ 2.6250], [ 2.2969], [ 2.6875], [ 2.0625], [-1.5703], [ 2.3281]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.3340], [1.0000], [0.4668], [0.2500], [0.2500], [0.4004], [0.2500]], device='cuda:1', dtype=torch.bfloat16) value label: tensor([[1.0000], [1.0000], [0.4668], [0.4668], [0.2002], [0.6016], [0.4004], [0.2002]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[ 2.5625], [ 1.8984], [ 3.0938], [ 0.2539], [ 1.9453], [ 3.4844], [ 1.2422], [-0.7734]], device='cuda:3', dtype=torch.bfloat16, grad_fn=)predicted value: tensor([[2.2812], [1.7422], [2.2031], [1.5312], [2.8594], [1.8359], [2.4531], [2.5312]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.4668], [0.3750], [0.6016], [0.2002], [0.4004], [0.2002]], device='cuda:3', dtype=torch.bfloat16) value label: tensor([[0.3340], [0.3340], [0.6016], [0.4668], [0.8008], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.83984375 loss: 0.78125 loss: 0.8828125 loss: 0.83203125 predicted value: tensor([[2.0312], [2.3281], [1.9141], [2.3594], [2.4688], [2.9375], [3.1719], [2.3906]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[-0.1484], [ 2.3438], [ 2.4844], [ 3.2188], [ 3.0625], [ 2.9844], [ 2.3594], [-0.3789]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.6680], [0.7500], [0.2500], [0.3340], [0.2500]], device='cuda:1', dtype=torch.bfloat16) value label: tensor([[0.4668], [0.4668], [0.4668], [0.3750], [0.3340], [1.0000], [0.4004], [0.1670]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[2.6250], [1.4688], [1.8203], [2.4688], [2.7344], [1.3438], [2.8906], [2.9844]], device='cuda:0', dtype=torch.bfloat16, grad_fn=)predicted value: tensor([[ 1.2344], [ 1.4766], [ 1.2734], [-1.3594], [ 2.3438], [ 2.9844], [ 2.0156], [ 2.1875]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6016], [0.5000], [0.5000], [0.2002], [1.0000], [1.0000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) value label: tensor([[0.3750], [0.4668], [0.7148], [0.6680], [0.2500], [0.8008], [0.2002], [0.4004]], device='cuda:3', dtype=torch.bfloat16) loss: 0.68359375 loss: 0.84765625 loss: 0.95703125 loss: 0.98828125 0%| | 3/983 [01:27<7:28:42, 27.47s/it] {'loss': 3.666, 'learning_rate': 2.390824014385461e-06, 'epoch': 0.0} 0%| | 3/983 [01:27<7:28:42, 27.47s/it]predicted value: tensor([[0.9453], [1.2969], [1.9531], [2.2344], [1.8672], [3.0312], [0.0579], [2.2656]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [1.0000], [0.2002], [0.6680], [0.6680], [0.4004], [0.7500]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[2.7344], [2.2969], [1.8516], [0.7031], [2.8906], [2.6562], [2.2812], [2.4531]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.2002], [0.3750], [0.7500], [1.0000], [0.3340], [0.1670]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[ 2.3750], [ 1.0703], [ 0.7422], [ 1.7891], [-0.0430], [ 3.1094], [ 3.2500], [ 1.2734]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[2.2656], [1.5078], [1.4609], [2.6719], [2.5625], [2.2812], [2.3594], [2.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.2002], [0.8008], [1.0000], [1.0000], [1.0000], [0.2002]], device='cuda:3', dtype=torch.bfloat16) value label: tensor([[0.4668], [0.6680], [0.4668], [0.5547], [0.2002], [1.0000], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.76171875 loss: 0.82421875 loss: 0.5234375 loss: 0.4765625 predicted value: tensor([[1.2969], [2.5625], [0.9102], [0.5625], [1.8281], [2.0781], [1.8125], [3.5625]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.8008], [0.2500], [0.5000], [0.4004], [0.4004]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[2.3594], [0.6445], [2.6719], [2.2969], [2.8281], [2.3125], [1.5859], [2.5312]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6680], [0.3750], [0.4668], [0.6016], [0.4004], [0.5000], [0.2500]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[2.5000], [1.8594], [1.4375], [1.4531], [1.2891], [2.3438], [2.1562], [2.9688]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [0.8008], [0.6016], [0.8008], [1.0000], [1.0000], [0.2500]], device='cuda:3', dtype=torch.bfloat16) predicted value: tensor([[ 1.1250], [ 1.5234], [ 2.4062], [ 1.8047], [-0.1992], [ 2.5312], [ 1.6328], [ 2.3125]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8008], [1.0000], [0.6680], [0.4668], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.83984375 loss: 0.478515625 loss: 0.52734375 loss: 0.62890625 predicted value: tensor([[ 1.6094], [ 1.8906], [ 2.4688], [ 2.7344], [-0.1226], [ 0.4609], [ 2.8750], [ 2.2500]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[ 2.0469], [ 2.3438], [ 0.9297], [ 2.2188], [ 2.9531], [-0.9336], [ 2.7656], [ 3.5312]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [1.0000], [0.2500], [0.2852], [0.4004], [0.2002]], device='cuda:2', dtype=torch.bfloat16) value label: tensor([[0.5547], [0.4668], [0.6680], [0.4668], [0.5000], [0.5000], [0.6016], [0.2500]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[ 0.8750], [ 2.1562], [ 2.6406], [ 2.4062], [ 2.0000], [ 2.2031], [-0.6328], [ 2.6875]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8008], [0.8008], [0.5000], [0.8008], [1.0000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) predicted value: tensor([[2.0938], [0.7852], [2.2812], [2.1875], [2.1094], [3.0625], [2.2344], [2.7188]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.8320], [0.6016], [0.6016], [1.0000], [0.0400], [0.1670]], device='cuda:3', dtype=torch.bfloat16) loss: 0.59375 loss: 0.640625 loss: 1.015625 loss: 0.73828125 predicted value: tensor([[3.4375], [1.9531], [2.6406], [2.4531], [2.1094], [2.3438], [2.7656], [1.8047]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[2.0469], [1.6328], [1.5312], [2.5625], [2.8125], [2.2031], [1.5625], [3.7812]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.8320], [1.0000], [0.1670], [0.8320], [0.4277], [0.5000], [0.5000]], device='cuda:2', dtype=torch.bfloat16) value label: tensor([[0.4668], [0.3750], [0.8320], [0.6016], [0.4668], [0.3340], [0.5000], [0.4004]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[2.4375], [1.7656], [1.5391], [1.6875], [1.8359], [2.4688], [2.4219], [1.7578]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[ 1.5547], [ 1.4375], [ 1.7422], [ 2.2500], [ 1.5859], [ 1.9141], [-0.7617], [-0.4785]], device='cuda:0', dtype=torch.bfloat16, grad_fn=)value label: tensor([[1.0000], [0.5547], [1.0000], [0.8008], [0.6680], [0.7500], [0.3340], [0.5000]], device='cuda:3', dtype=torch.bfloat16) value label: tensor([[0.8320], [0.8008], [0.2500], [0.7500], [0.8008], [0.4004], [0.5000], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.93359375 loss: 0.46484375 loss: 0.33984375 loss: 0.94140625 0%| | 4/983 [01:51<7:02:22, 25.89s/it] {'loss': 2.6821, 'learning_rate': 3.016884012457883e-06, 'epoch': 0.0} 0%| | 4/983 [01:51<7:02:22, 25.89s/it]predicted value: tensor([[1.4062], [2.6719], [3.4062], [1.8672], [0.9453], [2.3281], [1.6250], [1.0312]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [1.0000], [0.8008], [0.6016], [0.2500], [0.3340], [0.2500]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[ 1.1406], [ 2.2031], [ 1.6719], [ 1.5078], [ 1.3359], [-0.9414], [ 2.3750], [ 2.0156]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.4668], [1.0000], [0.4668], [0.4668], [0.4004], [0.2500]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[1.1562], [1.5703], [2.7344], [1.1250], [1.1797], [0.9883], [0.7109], [1.4844]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.3750], [0.7500], [0.7500], [0.6016], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) predicted value: tensor([[ 1.4297], [ 2.1719], [ 2.2969], [ 1.5625], [ 1.5547], [ 2.1094], [ 1.6250], [-0.1279]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8008], [0.4668], [0.3340], [0.5000], [0.2852], [0.5000]], device='cuda:3', dtype=torch.bfloat16) loss: 0.427734375 loss: 0.275390625 loss: 0.4140625 loss: 0.61328125 predicted value: tensor([[1.0000], [0.6289], [2.6875], [0.8125], [1.2031], [2.1250], [2.6250], [2.7969]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3750], [0.4668], [0.4668], [0.7500], [0.6016], [0.2002]], device='cuda:1', dtype=torch.bfloat16) predicted value: tensor([[1.5078], [0.7266], [1.6328], [2.4219], [1.2422], [2.2031], [1.6484], [2.2500]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [1.0000], [0.3750], [1.0000], [0.7500], [0.4004], [0.2002]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[0.0469], [0.5977], [0.4199], [1.5547], [1.3438], [1.7812], [1.5938], [2.0938]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[1.0391], [1.3203], [2.0469], [1.7578], [0.5664], [1.2422], [1.5312], [2.2344]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.4668], [0.2500], [0.3340], [0.4277], [0.3340], [0.2500]], device='cuda:3', dtype=torch.bfloat16) value label: tensor([[0.2500], [1.0000], [0.3145], [0.2500], [0.7148], [0.7500], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.435546875 loss: 0.255859375 loss: 0.375 loss: 0.59765625 predicted value: tensor([[ 1.1562], [-0.0262], [ 1.3125], [ 1.7109], [ 0.3496], [ 2.5625], [ 2.0625], [ 2.0000]], device='cuda:1', dtype=torch.bfloat16, grad_fn=) predicted value: tensor([[0.9375], [1.5547], [1.8594], [1.9609], [1.6562], [2.2969], [2.0469], [1.5234]], device='cuda:2', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.8008], [1.0000], [0.4668], [0.6680], [0.5000], [0.2002]], device='cuda:1', dtype=torch.bfloat16) value label: tensor([[0.5547], [0.3340], [0.3340], [0.3340], [0.2500], [0.4668], [0.2500], [0.2500]], device='cuda:2', dtype=torch.bfloat16) predicted value: tensor([[1.5078], [2.4062], [1.6328], [2.3594], [1.3203], [1.5469], [2.0781], [1.0156]], device='cuda:3', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.2500], [0.4668], [0.2002], [0.7500], [0.3340], [0.0625], [0.1670]], device='cuda:3', dtype=torch.bfloat16) predicted value: tensor([[0.6250], [1.1406], [1.3516], [0.9805], [2.0625], [2.3125], [1.5156], [0.5312]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.3340], [0.5000], [0.6016], [0.5000], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) [2024-12-19 19:34:42,478] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:34:44,215] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-12-19 19:34:44,216] [INFO] [runner.py:571:main] cmd = /vol3/ctr/.conda/envs/llava-rlhf/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMywgNCwgNSwgNl19 --master_addr=127.0.0.1 --master_port=29504 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed /vol3/home/ctr/llava-rlhf/LLaVA/scripts/zero3_offload.json --model_name_or_path /vol3/home/ctr/llava-rlhf/models/llava-RLAIF-V-7B --version v1 --data_path /vol3/home/ctr/llava-rlhf/datasets/aokvqa/llava_sft_data_value_aokvqa_sft_12_19.json --image_folder /vol3/home/ctr/llava-rlhf/datasets/coco/ --vision_tower /vol3/home/ctr/llava-rlhf/models/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir /vol3/home/ctr/llava-rlhf/models/llava-v1.5-7b-sft-prm-v1 --num_train_epochs 1 --per_device_train_batch_size 16 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 1 --learning_rate 1e-5 --weight_decay 0.05 --warmup_ratio 0.1 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 8 --lazy_preprocess True --report_to wandb [2024-12-19 19:34:46,522] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:34:48,111] [INFO] [launch.py:138:main] 0 NCCL_TIMEOUT=360 [2024-12-19 19:34:48,111] [INFO] [launch.py:138:main] 0 NCCL_IB_TIMEOUT=360 [2024-12-19 19:34:48,111] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [3, 4, 5, 6]} [2024-12-19 19:34:48,111] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0 [2024-12-19 19:34:48,111] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]}) [2024-12-19 19:34:48,111] [INFO] [launch.py:163:main] dist_world_size=4 [2024-12-19 19:34:48,111] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=3,4,5,6 [2024-12-19 19:34:52,026] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:34:52,070] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:34:52,117] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:34:52,263] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-19 19:34:53,135] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-19 19:34:53,187] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-19 19:34:53,213] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-19 19:34:53,475] [INFO] [comm.py:637:init_distributed] cdb=None [2024-12-19 19:34:53,475] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. wandb: Currently logged in as: 894699297roy (894699297roy-wuhan-university). Use `wandb login --relogin` to force relogin wandb: - Waiting for wandb.init()... wandb: \ Waiting for wandb.init()... wandb: Tracking run with wandb version 0.18.5 wandb: Run data is saved locally in /vol3/home/ctr/llava-rlhf/LLaVA/wandb/run-20241219_193455-wrrxxq6h wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run upbeat-violet-6 wandb: ⭐️ View project at https://wandb.ai/894699297roy-wuhan-university/llava_prm_sft wandb: 🚀 View run at https://wandb.ai/894699297roy-wuhan-university/llava_prm_sft/runs/wrrxxq6h You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2024-12-19 19:35:07,260] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 299, num_elems = 6.76B Loading checkpoint shards: 0%| | 0/3 [00:00) value label: tensor([[1.0000], [0.8555], [0.7500], [0.8008], [0.3750], [0.8008], [0.6680], [0.4668], [0.6016], [0.0204], [0.8008], [0.8320], [0.3340], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.1748046875 loss: 0.1748046875loss: 0.373046875 loss: 0.1689453125 predicted value: tensor([[ 0.9844], [ 1.2891], [ 3.1094], [ 1.5312], [ 0.4609], [ 0.5977], [ 0.5469], [ 1.3828], [ 2.5156], [ 0.1523], [ 1.0391], [ 1.3516], [ 1.2500], [-0.0630], [ 0.0066], [ 0.5234]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.2500], [0.4668], [0.6680], [0.8008], [0.3340], [0.2002], [0.2002], [0.8008], [0.6016], [0.4004], [0.3340], [0.0400], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.205078125 loss: 0.30859375 loss: 0.1298828125 loss: 0.255859375 predicted value: tensor([[ 1.5000], [ 0.8203], [ 0.8242], [ 0.9141], [ 2.6250], [ 1.4453], [-1.3281], [ 0.5117], [ 0.2559], [-0.7734], [ 0.7031], [-0.6523], [ 0.6406], [ 1.1562], [ 0.8711], [ 2.0000]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.2500], [0.4668], [0.4668], [0.8008], [0.5000], [1.0000], [0.6016], [0.3750], [0.4668], [0.2002], [1.0000], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.2099609375 loss: 0.296875loss: 0.1318359375 loss: 0.228515625 predicted value: tensor([[ 1.2734], [ 1.7031], [-0.0564], [ 1.6094], [ 2.0938], [ 1.1875], [ 0.8867], [ 1.4219], [ 1.4922], [ 1.3047], [ 1.0312], [ 1.6562], [ 0.6602], [ 1.6016], [ 1.5156], [ 0.5977]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.4668], [0.4668], [0.8320], [1.0000], [1.0000], [0.3750], [0.8008], [0.5000], [0.0400], [0.6016], [0.6680], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:1330: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) loss: 0.2041015625 loss: 0.375 loss: 0.1943359375 loss: 0.185546875 0%| | 1/492 [00:45<6:14:20, 45.74s/it] {'loss': 0.9043, 'learning_rate': 0.0, 'epoch': 0.0} 0%| | 1/492 [00:45<6:14:20, 45.74s/it]predicted value: tensor([[-0.4531], [-2.0781], [-2.3906], [-2.0781], [-2.0625], [-2.2188], [-1.6562], [-0.7539], [-1.5391], [-2.7656], [-1.0547], [-1.3281], [-1.4844], [-1.0781], [-1.6719], [-2.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4648], [0.4668], [0.4668], [0.8320], [0.4668], [0.6016], [0.6680], [0.2500], [0.3340], [0.2852], [1.0000], [0.5000], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 1.203125 loss: 1.2578125loss: 1.171875 loss: 1.4765625 predicted value: tensor([[-1.1016], [-1.0000], [-1.5859], [-2.0000], [-2.6562], [-1.8203], [-1.2188], [-1.5234], [-0.8359], [-0.8789], [-1.7969], [-1.6641], [-2.3750], [-1.7578], [-2.3750], [-0.6641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3340], [0.8008], [1.0000], [0.6016], [0.6016], [1.0000], [0.5000], [0.6680], [0.2500], [1.0000], [0.4004], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 1.421875 loss: 1.2578125 loss: 1.4453125 loss: 1.3671875 predicted value: tensor([[-2.2656], [-1.2188], [-1.3438], [-1.3672], [-2.0469], [-1.8203], [-2.2500], [-2.1719], [-2.1875], [-2.2656], [-2.2344], [ 0.4727], [-2.2188], [-1.5703], [-1.6250], [-1.6797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [1.0000], [0.6680], [0.5000], [0.2002], [0.8008], [0.2002], [0.6680], [0.6016], [0.7500], [1.0000], [1.0000], [0.1670], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 1.5 loss: 1.4453125 loss: 1.40625 loss: 1.3359375 predicted value: tensor([[-0.6719], [-1.4219], [-3.0625], [-2.3438], [-0.7188], [-1.6797], [-2.0938], [-1.9219], [-1.8906], [-1.8047], [-3.1562], [-1.9297], [-1.8750], [-2.5781], [-0.8984], [-1.8828]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.3750], [1.0000], [1.0000], [1.0000], [0.2002], [0.4668], [0.3340], [1.0000], [0.8008], [0.6016], [0.0400], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 1.4375 loss: 1.640625 loss: 1.515625 loss: 1.6484375 0%| | 2/492 [01:17<5:05:18, 37.38s/it] {'loss': 5.6328, 'learning_rate': 1.7718382013555792e-06, 'epoch': 0.0} 0%| | 2/492 [01:17<5:05:18, 37.38s/it]predicted value: tensor([[-1.8438], [-2.3438], [-1.6406], [-2.1562], [-1.0625], [-3.1406], [-2.0938], [-1.5547], [-0.8555], [-2.0312], [-0.1235], [-2.6406], [-2.8281], [-2.2344], [-1.4219], [-2.7812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [1.0000], [0.4668], [0.3340], [0.4668], [0.4668], [0.4668], [0.6016], [0.4668], [0.2500], [0.3340], [0.3340], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 1.4296875 loss: 1.546875loss: 1.53125 loss: 1.4921875 predicted value: tensor([[-1.1797], [ 0.9453], [-2.4375], [-1.2500], [-1.7578], [-1.8359], [-2.4844], [-1.0859], [-3.2188], [-1.0234], [-0.0894], [-2.0781], [-1.8125], [-1.9844], [-2.5000], [-1.7891]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.2500], [0.8320], [0.4668], [0.5547], [0.2500], [0.3340], [0.6016], [0.6016], [0.6680], [0.4004], [0.3340], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 1.8515625 loss: 1.28125loss: 1.2890625 loss: 1.5390625 predicted value: tensor([[-0.8945], [-1.8750], [-0.6289], [-2.0781], [-1.4609], [-2.1875], [-1.4375], [-1.8672], [-1.2109], [-1.6719], [-1.8906], [-0.9648], [-2.0469], [-1.2734], [-1.0312], [-2.4219]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.5547], [0.3750], [0.4668], [0.3145], [0.5000], [1.0000], [0.6680], [0.5000], [0.8008], [0.6016], [0.3145], [0.2500], [0.2002], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 1.6875 loss: 1.1484375loss: 1.046875 loss: 1.203125 predicted value: tensor([[-3.0781], [-1.4453], [-1.3438], [-2.0156], [-1.8594], [-2.1250], [-1.4141], [-1.2734], [-1.1094], [-1.7812], [-2.2656], [-2.4219], [-1.5703], [-0.5938], [-1.4531], [-2.6875]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.3340], [0.3145], [0.3340], [0.5547], [0.2002], [0.8008], [0.2500], [0.8008], [0.6680], [0.7500], [0.2500], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 1.3359375 loss: 1.28125 loss: 1.2734375 loss: 1.3515625 1%| | 3/492 [01:48<4:41:39, 34.56s/it] {'loss': 5.5723, 'learning_rate': 2.808297106493815e-06, 'epoch': 0.01} 1%| | 3/492 [01:48<4:41:39, 34.56s/it]predicted value: tensor([[-1.8047], [-1.6250], [-1.2188], [-1.2188], [-2.3438], [-1.4453], [-1.3281], [-1.4141], [-2.1719], [-1.7969], [-1.3438], [-1.6328], [-1.0469], [-1.6016], [-1.1562], [-2.4844]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [0.5547], [0.3750], [0.4668], [0.5547], [0.4668], [0.4668], [1.0000], [0.3750], [0.0400], [1.0000], [1.0000], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.98046875 loss: 1.2265625loss: 1.0859375 loss: 1.296875 predicted value: tensor([[-0.6953], [-1.2812], [-1.7734], [-1.4219], [-1.3438], [-1.3750], [-2.2031], [-0.9062], [-1.6875], [-2.6719], [-1.6484], [-1.7109], [-1.3125], [-1.8750], [-1.9766], [-1.9141]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [0.3750], [1.0000], [0.3340], [0.4668], [0.5000], [0.6016], [0.6680], [0.6016], [0.6016], [0.4004], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 1.359375 loss: 1.2109375loss: 1.1796875 loss: 1.1171875 predicted value: tensor([[-2.3125], [-1.0625], [-0.8828], [-0.9219], [-1.6641], [-1.5000], [-0.9414], [-0.8984], [-1.3594], [-2.3438], [-0.8398], [-0.6289], [-0.1543], [-2.3750], [-1.9688], [-1.5625]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.8008], [0.6680], [0.1670], [0.2500], [0.3750], [0.6680], [0.3340], [0.5000], [0.5000], [0.4004], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 1.109375 loss: 0.91796875loss: 0.9453125 loss: 1.25 predicted value: tensor([[-1.7656], [-1.8516], [-0.8984], [-1.3125], [-2.0156], [-1.1719], [-2.2031], [-2.0781], [-2.3125], [-1.1250], [-2.4062], [-1.9922], [-1.2031], [-1.8359], [-1.4922], [-1.8281]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [1.0000], [0.4668], [0.4668], [0.4668], [0.3750], [0.2500], [0.4668], [1.0000], [0.3340], [0.2500], [0.2500], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 1.1640625 loss: 1.1171875 loss: 0.8203125 loss: 1.234375 1%| | 4/492 [02:19<4:30:15, 33.23s/it] {'loss': 4.5039, 'learning_rate': 3.5436764027111585e-06, 'epoch': 0.01} 1%| | 4/492 [02:19<4:30:15, 33.23s/it]predicted value: tensor([[-0.9922], [ 0.2158], [-1.9219], [-1.2266], [-0.9883], [-0.6680], [ 0.1816], [-1.0078], [ 0.0830], [-0.3574], [-0.5742], [-1.1406], [ 0.1118], [-2.2344], [-0.7930], [-1.0156]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2002], [0.2002], [0.8008], [0.8008], [0.4668], [0.7500], [0.2500], [1.0000], [0.3340], [0.3340], [0.6016], [0.2002], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.609375 loss: 0.322265625 loss: 0.482421875 loss: 0.703125 predicted value: tensor([[-0.1377], [-0.3281], [-1.0312], [-1.2109], [-1.9609], [-0.3340], [-1.3359], [-2.6562], [-0.8320], [ 0.3770], [-1.1562], [-0.5508], [-1.1953], [-1.6406], [-0.9062], [-1.6562]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [1.0000], [0.7500], [0.4668], [0.3340], [0.5547], [0.4668], [0.5547], [0.6680], [0.4004], [0.6016], [0.3340], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.56640625 loss: 0.7109375loss: 0.330078125 loss: 0.7578125 predicted value: tensor([[-0.4414], [-1.2500], [-0.4414], [ 0.2949], [-0.9570], [-1.3672], [-1.5547], [-1.3828], [-0.3262], [-1.4062], [-1.4375], [-1.2031], [-0.4199], [-0.8164], [-0.1670], [-0.6484]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.4668], [0.4668], [0.2500], [0.3750], [0.2715], [1.0000], [0.6016], [1.0000], [0.5000], [0.6016], [0.4004], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.5234375 loss: 0.482421875 loss: 0.5625 loss: 0.55859375 predicted value: tensor([[-1.2969], [-0.4004], [ 0.5156], [-1.0547], [-1.0859], [-1.4688], [-0.0042], [-1.2109], [-1.4219], [-0.9023], [-0.9453], [-1.4453], [-0.9023], [-1.5312], [-0.6836], [-1.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.5703], [0.8008], [1.0000], [0.6680], [1.0000], [0.2500], [0.5000], [0.2500], [0.4668], [0.8008], [0.5000], [0.7500], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.5390625 loss: 0.65234375 loss: 0.78125 loss: 0.6328125 1%| | 5/492 [02:50<4:22:53, 32.39s/it] {'loss': 2.3037, 'learning_rate': 4.114080899322211e-06, 'epoch': 0.01} 1%| | 5/492 [02:50<4:22:53, 32.39s/it]predicted value: tensor([[ 0.5781], [ 1.0391], [ 0.7969], [ 0.9219], [ 1.0234], [ 1.3828], [ 0.7500], [ 1.5547], [ 0.7930], [ 1.1094], [-0.1943], [ 1.0234], [ 0.5078], [ 0.7422], [-1.1406], [ 1.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.2002], [0.8008], [1.0000], [1.0000], [0.7500], [1.0000], [1.0000], [1.0000], [0.2852], [0.8008], [0.5000], [0.3340], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.1826171875 loss: 0.1484375 loss: 0.0810546875 loss: 0.05908203125 predicted value: tensor([[1.1094], [0.5078], [1.6953], [0.4492], [0.6328], [0.5352], [1.3672], [0.8984], [1.8672], [1.8672], [0.5703], [0.6250], [1.7344], [0.3672], [1.0156], [0.6367]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.5547], [0.4648], [0.4668], [1.0000], [0.4668], [0.4668], [1.0000], [1.0000], [1.0000], [0.3340], [0.2500], [0.5000], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.09912109375 loss: 0.1162109375loss: 0.107421875 loss: 0.0810546875 predicted value: tensor([[ 0.7148], [ 1.3438], [ 0.8594], [ 0.8672], [ 1.3594], [ 1.0000], [ 0.2578], [ 0.5820], [ 1.1250], [ 0.1689], [ 0.5430], [ 0.3887], [-0.1221], [-0.0845], [ 0.8398], [ 0.8398]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.3145], [0.6680], [1.0000], [1.0000], [0.3750], [0.7500], [0.5000], [0.7500], [0.7500], [0.2002], [0.0204], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.06640625 loss: 0.04443359375loss: 0.1259765625 loss: 0.083984375 predicted value: tensor([[0.6094], [1.1328], [0.2256], [0.6445], [0.7969], [1.3047], [1.3672], [0.1196], [1.2500], [0.5469], [1.2734], [1.3281], [0.1611], [0.8555], [0.2969], [1.8672]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.3750], [0.8008], [0.4668], [0.8008], [0.7500], [0.5703], [0.6016], [0.4668], [0.6016], [0.3340], [0.2500], [0.6016], [0.4004], [0.6680], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.10107421875 loss: 0.07763671875 loss: 0.053466796875 loss: 0.126953125 1%| | 6/492 [03:21<4:18:53, 31.96s/it] {'loss': 0.3887, 'learning_rate': 4.5801353078493935e-06, 'epoch': 0.01} 1%| | 6/492 [03:21<4:18:53, 31.96s/it]predicted value: tensor([[1.8828], [1.4922], [1.8594], [0.9883], [1.9453], [0.3711], [0.9219], [1.9531], [2.1875], [1.2891], [2.4844], [1.9688], [1.2500], [1.7266], [2.2812], [1.7969]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6680], [1.0000], [0.6016], [0.4668], [0.8008], [0.4668], [0.8008], [0.3340], [1.0000], [1.0000], [0.5000], [0.2500], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.29296875 loss: 0.3125 loss: 0.365234375 loss: 0.232421875 predicted value: tensor([[2.0938], [2.0312], [1.3047], [1.5781], [1.6875], [1.6562], [1.1719], [1.5938], [2.1719], [2.2344], [1.9453], [1.9766], [2.0469], [2.2031], [1.3828], [1.4766]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.0400], [0.3340], [0.6680], [1.0000], [0.6680], [1.0000], [0.6016], [0.3340], [0.6016], [1.0000], [0.1670], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.40234375 loss: 0.43359375loss: 0.369140625 loss: 0.3984375 predicted value: tensor([[0.4297], [0.9180], [1.7109], [2.1406], [1.6094], [1.4844], [1.2109], [1.5156], [1.1875], [2.1094], [1.8516], [1.6484], [2.1562], [0.0908], [0.7031], [1.0078]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5547], [0.6016], [0.1670], [1.0000], [1.0000], [0.3340], [1.0000], [0.6680], [0.7500], [0.4004], [0.5000], [0.7500], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.287109375 loss: 0.240234375loss: 0.396484375 loss: 0.291015625 predicted value: tensor([[0.9688], [2.2812], [1.2344], [1.3125], [0.9375], [1.4844], [1.5781], [1.3750], [2.3281], [2.2656], [1.5781], [0.7266], [1.0625], [1.1562], [1.6250], [1.6953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.6172], [0.8320], [0.3750], [0.2002], [0.6016], [0.6680], [0.0625], [0.3340], [0.8008], [0.8008], [0.3340], [0.5000], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.52734375 loss: 0.423828125 loss: 0.34765625 loss: 0.3203125 1%|▏ | 7/492 [03:52<4:15:45, 31.64s/it] {'loss': 1.4102, 'learning_rate': 4.9741786956652775e-06, 'epoch': 0.01} 1%|▏ | 7/492 [03:52<4:15:45, 31.64s/it]predicted value: tensor([[1.2344], [1.2891], [1.2109], [1.0625], [1.1484], [1.7422], [1.0859], [1.7734], [2.0625], [0.7422], [1.4609], [1.1641], [1.3906], [1.3438], [1.7188], [1.0469]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [1.0000], [0.8320], [0.4668], [0.4668], [0.8008], [0.6016], [0.0400], [1.0000], [0.6016], [0.3340], [0.4004], [0.4004], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.2734375 loss: 0.244140625 loss: 0.201171875 loss: 0.31640625 predicted value: tensor([[0.4551], [1.5469], [2.0000], [1.3125], [1.3281], [1.4688], [1.3281], [1.2891], [1.6953], [1.0859], [1.3359], [0.9102], [0.3086], [1.5312], [0.4941], [2.6406]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8555], [1.0000], [0.2500], [0.3750], [0.8008], [0.4668], [1.0000], [0.4668], [0.2002], [0.8008], [1.0000], [0.5000], [0.5000], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.296875 loss: 0.2275390625loss: 0.251953125 loss: 0.291015625 predicted value: tensor([[0.9883], [1.6250], [0.6953], [1.3906], [0.8008], [0.7539], [0.5234], [2.6094], [1.8594], [1.1875], [1.1016], [1.4844], [1.0391], [2.3281], [0.8281], [1.7266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.5547], [0.3340], [0.8320], [0.1670], [0.3340], [1.0000], [1.0000], [0.4004], [0.3750], [0.6016], [0.5000], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.201171875 loss: 0.2275390625loss: 0.2421875 loss: 0.296875 predicted value: tensor([[1.0469], [1.8750], [0.9805], [1.5078], [2.0781], [1.2578], [0.8555], [1.2578], [2.6250], [0.6133], [0.9375], [2.2656], [1.3672], [1.9531], [2.4375], [0.0635]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [0.4668], [0.3340], [1.0000], [1.0000], [0.7500], [0.5547], [0.7500], [0.3750], [0.6016], [0.5000], [0.3340], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.423828125 loss: 0.330078125loss: 0.2734375 loss: 0.412109375 2%|▏ | 8/492 [04:23<4:14:22, 31.53s/it] {'loss': 1.1274, 'learning_rate': 5.315514604066737e-06, 'epoch': 0.02} 2%|▏ | 8/492 [04:23<4:14:22, 31.53s/it]predicted value: tensor([[ 0.5898], [ 0.1592], [ 0.6055], [-0.3984], [ 0.4238], [ 0.3926], [ 1.3516], [ 1.2969], [ 0.0120], [ 0.3418], [ 0.9297], [ 0.3184], [ 0.5625], [ 0.7422], [ 0.7109], [ 0.8633]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.3750], [0.3750], [0.3750], [0.4277], [0.6680], [0.8320], [0.6016], [0.6016], [0.5000], [0.4004], [0.4004], [0.4004], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.08740234375 loss: 0.045654296875loss: 0.10595703125 loss: 0.1328125 predicted value: tensor([[ 0.3652], [ 0.8984], [ 0.6562], [ 1.8984], [ 0.4492], [ 0.9023], [ 1.0000], [ 1.3125], [ 1.1953], [ 0.2021], [-0.0845], [ 1.4297], [ 0.8438], [ 1.3047], [ 0.9648], [ 0.8633]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8008], [0.4668], [0.2500], [0.4668], [0.6016], [0.3750], [1.0000], [0.8008], [0.4004], [0.5000], [0.6016], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.052001953125 loss: 0.11328125loss: 0.10595703125 loss: 0.1328125 predicted value: tensor([[ 1.2422], [-0.0059], [ 0.6172], [ 1.2578], [ 0.4648], [ 0.4453], [ 0.9648], [ 1.3516], [ 0.7422], [ 0.4668], [ 0.6953], [ 0.1226], [ 0.8555], [ 0.9531], [ 0.8242], [ 0.5156]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [0.4668], [0.2002], [0.4668], [0.8008], [0.7500], [1.0000], [0.3750], [0.4668], [0.2500], [0.4004], [0.7500], [0.3340], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0986328125 loss: 0.05078125loss: 0.15234375 loss: 0.1611328125 predicted value: tensor([[1.4453], [0.9258], [0.9219], [1.1641], [1.7578], [1.3984], [1.9219], [1.3359], [0.3281], [1.2500], [0.5938], [0.8086], [0.0830], [0.9570], [0.9141], [0.0991]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6680], [0.3750], [1.0000], [1.0000], [0.4668], [1.0000], [0.6016], [1.0000], [0.2500], [0.2500], [0.4004], [0.1670], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.1044921875 loss: 0.126953125 loss: 0.1318359375 loss: 0.1005859375 2%|▏ | 9/492 [04:55<4:13:18, 31.47s/it] {'loss': 0.4257, 'learning_rate': 5.61659421298763e-06, 'epoch': 0.02} 2%|▏ | 9/492 [04:55<4:13:18, 31.47s/it]predicted value: tensor([[ 1.1797], [ 0.5273], [-0.0583], [-0.2480], [ 0.2246], [-0.3008], [ 0.1992], [-0.7578], [-0.0811], [ 0.0571], [-0.9023], [-0.2773], [-0.6445], [-0.2598], [-0.6094], [-1.2969]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.7500], [0.2002], [1.0000], [0.2500], [1.0000], [0.4668], [0.2715], [0.7500], [0.2002], [0.6016], [0.3340], [0.4004], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.27734375 loss: 0.2001953125loss: 0.1669921875 loss: 0.251953125 predicted value: tensor([[ 0.2139], [ 0.0474], [-0.5156], [-0.1040], [ 0.5078], [ 0.1221], [ 0.3047], [-0.2061], [ 0.3809], [-0.3672], [-1.1953], [-0.5195], [-0.9180], [-0.4531], [-0.6133], [-0.7930]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.8320], [0.7500], [0.8008], [0.8008], [0.8008], [1.0000], [0.6680], [0.5000], [0.4668], [0.6016], [0.6016], [0.3340], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.1982421875 loss: 0.22265625loss: 0.189453125 loss: 0.1611328125 predicted value: tensor([[-0.7070], [ 0.3594], [-0.3145], [-0.5703], [-0.6719], [-0.8672], [ 0.1992], [-1.2969], [-0.0013], [ 0.3770], [ 0.0928], [ 0.0134], [-0.9648], [-0.4180], [-0.6172], [-0.3418]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [0.8008], [0.2500], [0.2500], [0.8008], [0.7500], [0.7500], [0.8320], [0.4004], [0.2500], [0.5000], [0.2852], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.26171875 loss: 0.2451171875loss: 0.1669921875 loss: 0.166015625 predicted value: tensor([[-0.2471], [-0.4688], [-0.2754], [-0.3945], [-0.9062], [-0.3730], [-0.2227], [-0.3555], [-0.8555], [ 0.3223], [-0.2090], [-0.4609], [-1.0859], [-1.2031], [-0.1504], [-0.5039]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.2500], [0.6680], [0.4668], [0.8320], [0.8008], [1.0000], [0.3340], [0.3750], [0.6016], [0.3340], [0.5000], [0.4004], [0.2002], [0.7500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.267578125loss: 0.1669921875 loss: 0.263671875 loss: 0.2490234375 2%|▏ | 10/492 [05:26<4:11:54, 31.36s/it] {'loss': 0.8638, 'learning_rate': 5.885919100677791e-06, 'epoch': 0.02} 2%|▏ | 10/492 [05:26<4:11:54, 31.36s/it]predicted value: tensor([[-0.6406], [-0.0811], [-1.1875], [-0.0420], [-0.7109], [-0.6016], [-0.4336], [-0.8086], [-0.4375], [-0.5039], [-0.8633], [ 0.0503], [-1.7891], [-1.2344], [-1.5781], [-1.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8984], [0.8555], [0.4668], [1.0000], [0.4668], [0.4668], [1.0000], [0.6016], [0.6016], [1.0000], [0.7500], [0.3340], [0.1670], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.4921875 loss: 0.6484375 loss: 0.52734375 loss: 0.478515625 predicted value: tensor([[-0.2490], [-0.5586], [-1.3828], [-0.6172], [ 0.0894], [-0.6562], [-1.2500], [-1.2344], [-0.9609], [-1.0625], [-0.6328], [-1.9609], [-0.9336], [-0.8906], [-1.1484], [-1.1250]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.5547], [0.4668], [1.0000], [0.2500], [0.4668], [0.8008], [0.6016], [0.2500], [0.2500], [0.2002], [0.3340], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.515625 loss: 0.5390625 loss: 0.51171875 loss: 0.6484375 predicted value: tensor([[-0.6250], [-0.9531], [-0.5820], [-0.6484], [-0.8008], [-1.1016], [-0.3887], [-0.9961], [-0.4277], [ 0.1689], [-0.7617], [-0.5195], [-1.3906], [-0.4355], [-0.6094], [-1.6328]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5703], [1.0000], [1.0000], [0.8008], [0.5703], [0.4668], [0.5000], [0.6016], [1.0000], [0.2500], [0.3340], [0.5000], [0.4004], [0.5000], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.46875 loss: 0.46875loss: 0.3828125 loss: 0.41015625 predicted value: tensor([[-0.3672], [-1.3594], [-1.1484], [-0.5547], [ 0.9531], [-1.1016], [-1.2422], [-1.3828], [-1.3828], [-1.0781], [-0.5586], [-1.0312], [-0.4141], [-0.8242], [-1.9219], [-0.9727]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.4668], [0.4668], [0.8320], [1.0000], [0.4668], [0.6016], [0.7148], [0.6016], [0.5000], [0.3340], [0.4004], [0.4004], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.5234375 loss: 0.3984375 loss: 0.51953125 loss: 0.5703125 2%|▏ | 11/492 [05:58<4:13:07, 31.57s/it] {'loss': 2.0259, 'learning_rate': 6.1295530968789295e-06, 'epoch': 0.02} 2%|▏ | 11/492 [05:58<4:13:07, 31.57s/it]predicted value: tensor([[-0.8945], [-0.3828], [-0.9297], [-0.1660], [-0.7969], [-1.1094], [-0.0101], [-1.3125], [-0.0344], [-0.5898], [ 0.0217], [-0.2539], [-0.2275], [-0.6680], [-0.9453], [-1.5391]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.3340], [0.6680], [0.8008], [0.8008], [0.2500], [0.4004], [0.2002], [0.7500], [0.3340], [0.5000], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.34375 loss: 0.36328125loss: 0.212890625 loss: 0.55078125 predicted value: tensor([[ 0.2988], [ 0.1348], [ 0.0530], [-0.6133], [-0.3555], [-0.2617], [-0.1289], [-0.8750], [-1.3516], [-0.9531], [-0.9609], [-0.4375], [-0.1836], [-0.1631], [-1.0469], [-0.8438]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.2500], [0.8008], [1.0000], [0.4668], [0.8008], [0.4668], [0.3750], [0.7500], [0.2500], [0.4668], [0.4004], [0.2002], [0.4004], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.38671875 loss: 0.3125loss: 0.484375 loss: 0.345703125 predicted value: tensor([[ 0.6367], [ 0.6484], [-0.6875], [-0.4492], [-0.9180], [ 0.1245], [-1.3047], [-0.0688], [-0.3730], [ 0.1367], [-0.9961], [ 0.0219], [ 0.1021], [-0.6484], [-0.6562], [-1.0078]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.4668], [0.8008], [0.8320], [0.4668], [1.0000], [1.0000], [1.0000], [0.6016], [1.0000], [1.0000], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.244140625 loss: 0.318359375 loss: 0.38671875 loss: 0.26171875 predicted value: tensor([[-0.4531], [ 0.0522], [-0.1670], [ 0.5742], [-1.0781], [ 0.0050], [-0.2217], [-0.4746], [-0.4062], [-0.9062], [ 0.5078], [-0.5781], [-0.4336], [-0.6875], [-0.6719], [-1.6953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.4668], [1.0000], [0.5547], [1.0000], [1.0000], [0.3750], [0.5000], [0.2500], [0.6016], [0.4004], [0.7500], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.26171875 loss: 0.357421875 loss: 0.365234375 loss: 0.306640625 2%|▏ | 12/492 [06:29<4:12:00, 31.50s/it] {'loss': 1.3755, 'learning_rate': 6.3519735092049725e-06, 'epoch': 0.02} 2%|▏ | 12/492 [06:29<4:12:00, 31.50s/it]predicted value: tensor([[ 0.6719], [ 1.6094], [-0.0957], [ 0.5273], [ 0.0781], [ 0.1235], [ 0.7617], [ 0.9062], [ 0.8867], [-0.2324], [ 0.0311], [-0.1416], [ 0.5938], [-0.3066], [-0.6875], [-0.0374]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5195], [1.0000], [0.4668], [0.4668], [0.4668], [0.2500], [1.0000], [1.0000], [1.0000], [0.4668], [0.4004], [0.4004], [0.6680], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0712890625 loss: 0.05078125 loss: 0.10400390625 loss: 0.08984375 predicted value: tensor([[ 0.9297], [ 0.2793], [ 0.5391], [ 0.3652], [ 0.0762], [ 1.1953], [ 0.8633], [ 0.4492], [ 0.1064], [ 0.1289], [ 0.2334], [ 0.0525], [-0.1494], [ 0.3887], [-0.8320], [-0.9141]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.4668], [0.8320], [0.3340], [0.4668], [0.2002], [0.2500], [1.0000], [0.2500], [0.3340], [0.5000], [0.5000], [0.5000], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.048095703125 loss: 0.09423828125loss: 0.072265625 loss: 0.1279296875 predicted value: tensor([[ 0.4336], [ 0.1641], [ 0.2363], [ 0.9141], [-0.0047], [-0.2266], [ 0.3535], [ 0.7891], [-0.0508], [-0.7344], [ 0.6523], [ 0.4746], [-0.0732], [-0.0786], [ 0.1846], [-0.8320]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.3750], [0.3750], [0.8320], [0.4668], [0.4668], [1.0000], [0.5000], [0.6680], [0.5000], [0.2500], [0.2002], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.09326171875 loss: 0.0830078125loss: 0.056884765625 loss: 0.07861328125 predicted value: tensor([[ 0.7930], [ 0.6367], [ 0.9648], [ 0.7266], [ 0.4629], [ 0.6836], [ 0.2852], [ 0.1865], [ 0.9297], [ 0.7148], [-0.5391], [-0.1357], [ 0.4551], [ 0.1387], [-0.1025], [-0.8438]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [0.6680], [1.0000], [0.3145], [0.6016], [0.3340], [0.6016], [0.7500], [0.2500], [0.4004], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0908203125 loss: 0.046875 loss: 0.052001953125 loss: 0.06201171875 3%|▎ | 13/492 [07:01<4:11:56, 31.56s/it] {'loss': 0.3055, 'learning_rate': 6.55658045441586e-06, 'epoch': 0.03} 3%|▎ | 13/492 [07:01<4:11:56, 31.56s/it]predicted value: tensor([[1.6953], [1.3828], [2.0469], [1.8672], [0.7539], [1.5234], [1.2422], [1.5312], [1.4609], [2.1562], [1.8125], [1.6953], [1.4609], [1.4609], [1.7656], [1.3203]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [1.0000], [0.6680], [0.5547], [0.6250], [0.4668], [0.8320], [1.0000], [0.5000], [0.5000], [0.0400], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.251953125 loss: 0.298828125loss: 0.41796875 loss: 0.26953125 predicted value: tensor([[1.8672], [1.4531], [0.9922], [1.4609], [0.8594], [1.4453], [1.5234], [1.6328], [1.8516], [1.1328], [1.2734], [1.7969], [0.7227], [1.1875], [1.2891], [1.0703]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [1.0000], [0.3750], [0.3340], [0.6016], [0.2002], [0.5547], [0.5703], [0.2500], [0.6016], [0.4004], [0.2500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.26953125 loss: 0.3046875 loss: 0.2080078125 loss: 0.390625 predicted value: tensor([[1.5781], [1.5625], [2.5312], [1.7422], [1.6641], [2.1094], [1.9297], [0.8359], [1.5859], [1.1953], [1.6875], [1.3984], [2.3125], [0.7305], [0.9297], [1.3516]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5547], [1.0000], [1.0000], [0.5547], [0.6680], [0.3750], [0.7500], [0.6016], [0.8008], [1.0000], [0.4004], [1.0000], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.2333984375 loss: 0.36328125loss: 0.267578125 loss: 0.359375 predicted value: tensor([[1.3125], [1.9219], [1.2578], [1.5156], [1.3203], [1.1484], [1.2578], [0.9883], [1.9922], [1.8047], [1.3750], [1.4766], [1.4766], [1.7344], [1.2812], [1.8750]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.2500], [0.2500], [0.4668], [0.8008], [0.3750], [0.6016], [0.8008], [1.0000], [0.7500], [0.4004], [0.4004], [0.5000], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.2890625 loss: 0.26953125 loss: 0.26171875 loss: 0.357421875 3%|▎ | 14/492 [07:33<4:11:39, 31.59s/it] {'loss': 1.2031, 'learning_rate': 6.7460168970208566e-06, 'epoch': 0.03} 3%|▎ | 14/492 [07:33<4:11:39, 31.59s/it]predicted value: tensor([[1.8906], [1.7812], [1.6719], [2.6250], [1.9297], [1.8594], [1.7109], [2.1406], [1.6250], [1.9609], [2.4688], [2.1094], [1.5156], [2.5469], [1.9922], [1.6562]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.8008], [1.0000], [1.0000], [0.3340], [0.3340], [0.6016], [0.6016], [0.5000], [0.8008], [0.8008], [0.4004], [0.2852], [0.4004], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.8203125 loss: 0.53515625loss: 0.51171875 loss: 0.6640625 predicted value: tensor([[2.2344], [2.1406], [2.2812], [2.9688], [1.6172], [1.7969], [2.4219], [1.9609], [2.1406], [2.0469], [3.3750], [2.5312], [1.7422], [1.5703], [2.2344], [1.3516]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [1.0000], [0.2500], [0.8008], [0.8008], [0.2500], [1.0000], [0.7500], [1.0000], [0.3340], [0.8008], [0.3340], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.6640625 loss: 0.65234375loss: 0.515625 loss: 0.5859375 predicted value: tensor([[1.8203], [1.8281], [2.2812], [2.5938], [2.6094], [2.6562], [2.0000], [2.3750], [2.6875], [1.7266], [1.8281], [1.3438], [2.2031], [1.7578], [1.8359], [1.3594]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.5547], [0.3340], [1.0000], [0.4668], [1.0000], [0.6016], [0.8008], [0.2500], [0.4004], [0.5000], [0.6016], [0.6016], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.59375 loss: 0.61328125loss: 0.61328125 loss: 0.59375 predicted value: tensor([[2.9844], [2.0625], [2.2812], [1.8750], [1.7266], [2.0781], [2.0312], [3.0469], [1.0781], [2.1250], [2.3594], [1.5625], [2.9219], [2.1875], [1.8438], [2.0781]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [0.3750], [0.8008], [1.0000], [0.6016], [1.0000], [0.6016], [0.6016], [0.8008], [0.3340], [0.3340], [0.3340], [0.1426], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.63671875 loss: 0.4375 loss: 0.6484375loss: 0.62109375 3%|▎ | 15/492 [08:04<4:09:50, 31.43s/it] {'loss': 2.4268, 'learning_rate': 6.922378005816025e-06, 'epoch': 0.03} 3%|▎ | 15/492 [08:04<4:09:50, 31.43s/it]predicted value: tensor([[1.5234], [1.8828], [1.7266], [2.6250], [1.3594], [2.6250], [2.0156], [1.7812], [2.4844], [1.8516], [1.8984], [2.4844], [1.1953], [1.9609], [1.6094], [1.5703]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.4668], [1.0000], [0.6016], [1.0000], [0.6016], [0.7500], [0.2500], [1.0000], [0.5000], [1.0000], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.474609375 loss: 0.45703125 loss: 0.46875 loss: 0.5234375 predicted value: tensor([[1.9219], [1.6172], [2.2344], [2.7812], [2.1094], [1.1562], [2.6250], [2.2031], [2.1562], [1.7188], [2.1719], [1.7969], [2.0625], [1.9141], [2.3906], [1.2188]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.2500], [0.4668], [0.5547], [1.0000], [0.8008], [0.8008], [0.6016], [0.6016], [0.5000], [0.3340], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.42578125 loss: 0.39453125 loss: 0.5703125 loss: 0.46875 predicted value: tensor([[1.5781], [1.6562], [1.3906], [1.9453], [2.2969], [1.9844], [1.7422], [1.4453], [2.0469], [1.6328], [2.1719], [1.5391], [1.8047], [1.1719], [1.9453], [2.3125]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.5547], [0.4668], [0.8008], [0.3145], [0.2500], [1.0000], [0.4668], [1.0000], [0.2500], [0.3340], [0.4004], [0.0625], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.39453125 loss: 0.494140625loss: 0.4765625 loss: 0.55078125 predicted value: tensor([[1.9375], [2.0469], [1.4609], [1.6797], [1.3594], [1.3906], [2.1719], [1.4766], [1.5469], [2.2500], [1.5547], [1.7734], [1.4922], [1.7500], [1.5156], [2.2031]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.5547], [0.3750], [0.4668], [0.3750], [1.0000], [0.2500], [0.2500], [0.7500], [0.3340], [0.3340], [0.5000], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.50390625 loss: 0.4296875 loss: 0.43359375 loss: 0.453125 3%|▎ | 16/492 [08:35<4:08:30, 31.32s/it] {'loss': 1.8799, 'learning_rate': 7.087352805422317e-06, 'epoch': 0.03} 3%|▎ | 16/492 [08:35<4:08:30, 31.32s/it]predicted value: tensor([[0.5312], [1.0078], [1.0938], [1.6562], [1.4844], [0.6836], [1.3906], [0.5625], [1.4062], [0.6914], [0.8984], [0.7695], [0.8203], [0.7891], [0.9961], [1.1094]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8672], [0.8008], [1.0000], [1.0000], [1.0000], [0.4668], [0.8008], [0.2500], [1.0000], [0.2002], [0.3340], [0.0625], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.1298828125 loss: 0.07080078125 loss: 0.115234375 loss: 0.11376953125 predicted value: tensor([[0.9844], [0.9883], [1.0703], [0.9336], [0.9609], [0.7148], [0.6875], [0.6445], [1.0547], [1.3828], [1.4453], [1.2031], [0.6484], [0.7695], [1.8516], [0.8789]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [1.0000], [0.2500], [1.0000], [0.3340], [0.3750], [0.2500], [1.0000], [0.7500], [0.3145], [0.8008], [0.3340], [0.2002], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.1416015625 loss: 0.1044921875loss: 0.0869140625 loss: 0.103515625 predicted value: tensor([[1.0547], [0.8555], [0.7305], [0.9688], [0.0547], [0.7891], [0.5117], [0.5742], [1.1016], [0.7109], [0.5820], [1.7578], [0.8477], [0.4395], [0.7070], [0.7617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.4668], [1.0000], [0.2500], [0.3750], [0.2500], [0.6680], [0.7500], [0.4004], [0.3340], [0.4004], [0.4004], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.1279296875 loss: 0.059326171875loss: 0.07177734375 loss: 0.09033203125 predicted value: tensor([[0.5312], [1.4375], [1.0000], [0.6445], [1.0391], [0.7812], [0.3711], [0.8867], [0.9258], [1.4844], [0.8164], [0.6289], [1.0156], [0.4375], [1.5703], [0.8203]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.3750], [0.8320], [0.6680], [0.4668], [0.2715], [0.6016], [0.2500], [0.8008], [0.8320], [0.5000], [0.7500], [0.3340], [0.2002], [0.1426], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.06298828125 loss: 0.07763671875 loss: 0.08984375 loss: 0.10107421875 3%|▎ | 17/492 [09:07<4:09:12, 31.48s/it] {'loss': 0.3868, 'learning_rate': 7.242322808748768e-06, 'epoch': 0.03} 3%|▎ | 17/492 [09:07<4:09:12, 31.48s/it]predicted value: tensor([[-0.1875], [-0.1279], [-0.1328], [-0.7930], [-0.1650], [ 0.0383], [ 0.6367], [-0.3672], [-0.4883], [-0.4629], [-0.6758], [-0.7500], [-0.4844], [-0.1992], [-0.3379], [ 0.3535]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.4668], [0.3750], [1.0000], [1.0000], [0.6016], [0.5000], [0.6016], [0.3340], [0.2500], [0.4004], [0.2002], [0.1670], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.1826171875 loss: 0.162109375 loss: 0.13671875 loss: 0.1396484375 predicted value: tensor([[ 0.0126], [ 0.0182], [-0.1670], [-0.3027], [-0.6836], [-0.2041], [ 0.1152], [ 0.2832], [-0.3906], [-0.1328], [-0.4023], [ 0.2275], [-0.3691], [-0.0306], [-0.3828], [ 0.1406]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [0.3750], [0.3750], [0.4668], [0.7500], [0.6016], [1.0000], [0.7500], [0.5000], [0.7500], [1.0000], [0.2500], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.1552734375 loss: 0.138671875 loss: 0.173828125 loss: 0.2314453125 predicted value: tensor([[-0.4629], [ 0.0703], [-0.4277], [-0.3145], [-0.1875], [-0.6172], [ 0.3320], [-0.3340], [-0.2031], [ 0.1021], [-0.3496], [-0.5391], [-0.5977], [-0.0339], [-0.5742], [ 0.9375]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.8320], [0.4668], [0.5547], [0.2002], [0.5703], [0.7500], [1.0000], [0.4668], [0.3340], [0.3340], [0.4004], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.1533203125 loss: 0.166015625loss: 0.169921875 loss: 0.13671875 predicted value: tensor([[-0.4805], [-0.2617], [ 0.2637], [ 0.3242], [-0.0957], [-0.0991], [ 0.1377], [-0.3301], [-0.4023], [-0.2656], [-0.0100], [-0.2500], [-0.0588], [-0.0562], [-0.3008], [ 0.0137]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.8008], [1.0000], [1.0000], [0.6680], [1.0000], [0.6680], [0.4668], [0.2500], [0.3340], [0.4668], [0.3340], [0.0400], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.1513671875 loss: 0.158203125 loss: 0.1318359375 loss: 0.123046875 4%|▎ | 18/492 [09:38<4:08:21, 31.44s/it] {'loss': 0.6277, 'learning_rate': 7.388432414343207e-06, 'epoch': 0.04} 4%|▎ | 18/492 [09:38<4:08:21, 31.44s/it]predicted value: tensor([[-0.6094], [-0.7969], [-0.4766], [ 0.1914], [-0.7852], [ 0.0233], [-0.6914], [-0.3555], [-0.7930], [-1.3438], [-0.5312], [-0.5391], [-0.6211], [-0.8594], [-0.6289], [-0.5547]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.4668], [1.0000], [0.3340], [1.0000], [0.2500], [0.4668], [0.7500], [0.6016], [0.7500], [0.2500], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.255859375 loss: 0.32421875 loss: 0.287109375 loss: 0.2431640625 predicted value: tensor([[-0.0703], [-0.5430], [-0.6562], [-0.5352], [-0.6484], [-0.7656], [-0.7539], [-0.5508], [-0.6680], [-0.5000], [-0.5273], [-0.8047], [-0.4414], [-0.3574], [-0.4746], [-1.1484]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.5547], [0.5000], [0.8008], [0.4668], [0.8008], [0.3340], [0.2500], [0.6016], [0.8008], [0.2500], [0.5000], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.296875 loss: 0.318359375loss: 0.1962890625 loss: 0.2890625 predicted value: tensor([[-0.3887], [-0.7930], [-0.1787], [-0.4746], [-0.5430], [ 0.0151], [-0.6055], [-0.6836], [-0.7500], [-0.6719], [-0.0830], [-0.8047], [-0.4062], [-0.6562], [-0.4746], [-0.7812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.8008], [1.0000], [0.5547], [0.7500], [1.0000], [1.0000], [0.5547], [0.4668], [0.7500], [0.8008], [0.6016], [1.0000], [0.2500], [0.2500], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.271484375 loss: 0.3671875 loss: 0.380859375 loss: 0.3203125 predicted value: tensor([[-0.5391], [-0.4980], [-0.4941], [-0.5039], [-0.7500], [-0.8945], [-0.3965], [-0.5781], [-0.0923], [ 0.0068], [-0.2715], [-0.4688], [-0.7227], [-0.1611], [-0.6289], [-0.9336]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.8008], [0.3750], [0.8555], [0.4668], [0.8008], [0.3750], [1.0000], [1.0000], [0.7500], [0.7500], [0.4004], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.328125 loss: 0.30078125 loss: 0.3046875 loss: 0.2490234375 4%|▍ | 19/492 [10:10<4:08:06, 31.47s/it] {'loss': 1.1833, 'learning_rate': 7.52664024490876e-06, 'epoch': 0.04} 4%|▍ | 19/492 [10:10<4:08:06, 31.47s/it]predicted value: tensor([[-0.4844], [ 0.1245], [-0.2422], [-0.5703], [-0.1377], [ 0.1416], [-0.2139], [ 0.2363], [ 0.5352], [ 0.2090], [ 0.2383], [ 0.0786], [-0.1836], [-0.3926], [-0.1357], [-0.4414]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.6016], [0.6680], [0.2002], [0.5000], [0.2500], [0.8008], [1.0000], [0.8008], [0.8008], [0.2002], [0.5000], [0.0400], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0625 loss: 0.10546875 loss: 0.10205078125 loss: 0.1162109375 predicted value: tensor([[-0.0532], [-0.3516], [ 0.0620], [-0.4961], [ 0.3477], [ 0.0840], [ 0.3125], [-0.0957], [-0.0820], [ 0.1172], [ 0.9180], [-0.6445], [ 0.0962], [-0.3867], [-0.3555], [ 0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.5547], [0.2500], [1.0000], [0.8320], [1.0000], [0.6680], [0.0400], [0.3340], [1.0000], [0.6016], [1.0000], [0.3340], [0.1670], [0.0400]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0888671875 loss: 0.10986328125loss: 0.126953125 loss: 0.103515625 predicted value: tensor([[-0.2227], [ 0.2451], [-0.2676], [-0.0493], [ 0.1621], [-0.1079], [ 0.1611], [-0.4531], [-0.1797], [-0.1865], [ 0.5039], [-0.0070], [-0.2930], [ 0.0625], [-0.2236], [ 0.0011]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [1.0000], [0.3750], [0.4277], [1.0000], [0.6016], [0.3340], [0.3340], [1.0000], [0.7500], [0.2002], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.115234375 loss: 0.10009765625loss: 0.125 loss: 0.0712890625 predicted value: tensor([[-0.0352], [-0.0184], [ 0.2354], [ 0.6172], [-0.5508], [ 0.2676], [-0.1299], [ 0.1455], [-0.2139], [ 0.1650], [-0.1758], [ 0.3262], [ 0.1807], [-0.3672], [-0.0811], [-0.5195]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [1.0000], [0.4668], [0.8008], [0.7500], [0.8008], [0.3750], [0.5000], [0.2500], [0.4004], [0.5000], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.125 loss: 0.1103515625 loss: 0.08349609375loss: 0.138671875 4%|▍ | 20/492 [10:40<4:06:23, 31.32s/it] {'loss': 0.4211, 'learning_rate': 7.657757302033369e-06, 'epoch': 0.04} 4%|▍ | 20/492 [10:40<4:06:23, 31.32s/it]predicted value: tensor([[1.1016], [1.8203], [0.9883], [1.3828], [1.1406], [1.0391], [0.4238], [0.9805], [0.9258], [0.9922], [1.7422], [0.8594], [0.7656], [0.9102], [0.5156], [0.4219]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.2500], [0.3750], [1.0000], [0.5703], [0.4668], [0.2500], [0.8008], [0.4668], [1.0000], [0.4004], [0.5000], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.10986328125 loss: 0.0673828125 loss: 0.07568359375 loss: 0.10791015625 predicted value: tensor([[1.0625], [0.6250], [0.3555], [1.2578], [1.6797], [0.8828], [1.5625], [0.7500], [0.9297], [0.8047], [1.2891], [1.0391], [0.9258], [1.5000], [0.9453], [1.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.4668], [0.2002], [1.0000], [1.0000], [0.4668], [1.0000], [0.5000], [0.2500], [0.3340], [1.0000], [0.4004], [0.6016], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.045166015625 loss: 0.0888671875 loss: 0.08447265625 loss: 0.0712890625 predicted value: tensor([[0.5195], [0.8711], [1.6406], [0.5859], [0.8867], [0.6250], [0.5664], [0.7266], [1.4844], [1.2344], [1.0703], [0.6367], [0.2793], [0.9492], [0.5938], [0.4766]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.8008], [1.0000], [0.4668], [0.4668], [0.3340], [0.4668], [0.4668], [1.0000], [0.2500], [0.4668], [0.0625], [0.5000], [0.6016], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.09521484375 loss: 0.049072265625 loss: 0.0712890625 loss: 0.06591796875 predicted value: tensor([[1.9141], [1.0078], [0.7383], [1.1250], [0.4902], [0.7070], [0.8945], [1.2656], [0.7500], [1.1484], [0.8203], [0.9492], [0.8320], [0.6367], [0.6250], [0.4844]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.4668], [0.3750], [0.4668], [0.3340], [0.8008], [0.3750], [0.4668], [0.5000], [0.2500], [0.5000], [0.5000], [0.2500], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.050537109375 loss: 0.060791015625 loss: 0.1005859375 loss: 0.09619140625 4%|▍ | 21/492 [11:12<4:06:32, 31.41s/it] {'loss': 0.3101, 'learning_rate': 7.782475802159092e-06, 'epoch': 0.04} 4%|▍ | 21/492 [11:12<4:06:32, 31.41s/it]predicted value: tensor([[1.2500], [1.0391], [1.2344], [0.8164], [1.9062], [1.2812], [1.5547], [0.6992], [0.7383], [1.0781], [0.7422], [1.0469], [1.3750], [1.3125], [0.6016], [0.5039]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.6680], [0.5547], [1.0000], [0.4668], [1.0000], [0.3750], [0.3340], [0.3340], [0.4004], [0.4004], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.060302734375 loss: 0.0966796875 loss: 0.061767578125 loss: 0.0869140625 predicted value: tensor([[1.0078], [1.3516], [1.2656], [1.6250], [1.3750], [1.9453], [1.3203], [1.0859], [0.9141], [1.2656], [0.8281], [0.9453], [0.9766], [0.6484], [0.9336], [0.8828]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.6680], [0.4668], [1.0000], [1.0000], [1.0000], [0.8008], [0.5000], [0.7500], [0.6016], [0.6016], [0.4004], [0.6016], [0.4004], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0654296875 loss: 0.08251953125 loss: 0.078125 loss: 0.06640625 predicted value: tensor([[0.9805], [1.6875], [1.4766], [1.2188], [1.4531], [0.7539], [1.2891], [0.6445], [1.2578], [1.0547], [1.1875], [1.2188], [1.3125], [1.1641], [0.9297], [1.3516]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8008], [0.3750], [1.0000], [0.8008], [0.7500], [0.8008], [0.6016], [0.6016], [0.6016], [1.0000], [0.3340], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.06591796875 loss: 0.107421875 loss: 0.0634765625 loss: 0.0732421875 predicted value: tensor([[1.3281], [1.6875], [1.3125], [0.8438], [1.2188], [1.0547], [0.9375], [0.9688], [0.5117], [0.9297], [0.7734], [1.0156], [0.4316], [0.4121], [0.2637], [0.5273]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.4668], [1.0000], [0.2500], [0.4668], [0.5547], [0.2500], [0.4668], [0.5000], [0.5547], [0.2852], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.087890625 loss: 0.05859375 loss: 0.08447265625 loss: 0.0654296875 4%|▍ | 22/492 [11:43<4:05:27, 31.34s/it] {'loss': 0.3011, 'learning_rate': 7.90139129823451e-06, 'epoch': 0.04} 4%|▍ | 22/492 [11:43<4:05:27, 31.34s/it]predicted value: tensor([[ 0.5430], [ 0.5938], [ 0.4316], [ 0.6914], [ 0.3867], [ 0.5234], [ 0.2236], [ 0.3867], [ 0.2275], [ 1.1719], [ 0.4961], [ 0.3008], [ 0.0459], [ 0.3555], [-0.0156], [ 0.1406]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.2002], [0.5547], [0.2002], [0.5000], [0.6016], [0.4668], [0.4004], [1.0000], [0.6016], [0.0625], [0.3340], [0.2002], [0.1426], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0142822265625 loss: 0.0189208984375 loss: 0.00885009765625 loss: 0.020751953125 predicted value: tensor([[ 0.6641], [ 0.3301], [ 0.9180], [ 0.9102], [ 0.7812], [ 0.5352], [ 0.6719], [ 0.4258], [ 0.3398], [ 0.0437], [ 0.4160], [ 0.6211], [-0.0615], [ 0.2275], [ 0.6094], [ 0.3574]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6172], [1.0000], [0.8008], [1.0000], [0.2500], [1.0000], [0.6016], [0.3340], [0.6016], [0.2500], [0.3340], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.03076171875 loss: 0.019775390625loss: 0.0169677734375 loss: 0.012939453125 predicted value: tensor([[0.7930], [0.1396], [0.6914], [0.2578], [1.1953], [0.2793], [0.6094], [0.9609], [0.4414], [0.5352], [0.4727], [0.6133], [0.4668], [0.1201], [0.2246], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [0.4668], [0.5547], [0.8008], [1.0000], [0.8008], [0.4668], [1.0000], [1.0000], [0.6680], [0.3340], [0.6016], [0.5000], [0.2500], [0.0625], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006439208984375 loss: 0.01806640625loss: 0.01123046875 loss: 0.01708984375 predicted value: tensor([[ 0.8633], [ 0.4922], [ 0.5195], [ 0.7656], [ 0.3223], [ 0.4961], [ 0.5820], [ 0.2217], [ 0.5312], [-0.0972], [ 0.6094], [ 0.5781], [ 0.0554], [ 0.4609], [ 0.2930], [ 0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.2500], [1.0000], [0.8320], [0.8320], [0.2500], [0.2500], [0.6016], [0.4004], [0.8008], [0.5000], [0.4004], [0.3340], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01611328125 loss: 0.0174560546875 loss: 0.016845703125 loss: 0.0196533203125 5%|▍ | 23/492 [12:14<4:04:24, 31.27s/it] {'loss': 0.0665, 'learning_rate': 8.015019879940584e-06, 'epoch': 0.05} 5%|▍ | 23/492 [12:14<4:04:24, 31.27s/it]predicted value: tensor([[0.6602], [0.5625], [1.1172], [1.3594], [0.6211], [0.4609], [0.8555], [0.3477], [0.5664], [0.5000], [0.5625], [0.4961], [0.3047], [0.3965], [0.2109], [0.0688]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3750], [1.0000], [1.0000], [0.4668], [0.8008], [0.7500], [0.7500], [0.6016], [0.7500], [0.2500], [0.5000], [0.4004], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01312255859375 loss: 0.01165771484375 loss: 0.006134033203125 loss: 0.019775390625 predicted value: tensor([[ 0.3223], [ 0.6602], [ 0.5820], [ 0.8789], [ 0.4375], [ 0.6289], [ 1.0234], [ 0.3906], [ 0.4668], [ 0.4258], [ 0.7969], [ 0.5938], [ 0.2490], [ 0.3066], [-0.0486], [-0.0535]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.6680], [0.4668], [1.0000], [0.3750], [0.6016], [1.0000], [0.8008], [0.3340], [0.6016], [1.0000], [0.3340], [0.2002], [0.2002], [0.1426], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0164794921875 loss: 0.007232666015625 loss: 0.01129150390625 loss: 0.01385498046875 predicted value: tensor([[0.6016], [0.5859], [1.1641], [0.6680], [0.5234], [0.7695], [0.6094], [0.5898], [0.4707], [0.1826], [0.2490], [0.7930], [0.7422], [0.2295], [0.1592], [0.3105]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [1.0000], [0.8008], [0.8008], [0.8008], [1.0000], [0.3750], [0.2500], [0.6016], [0.8008], [1.0000], [0.2500], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01190185546875 loss: 0.018798828125 loss: 0.01312255859375 loss: 0.01409912109375 predicted value: tensor([[ 0.4590], [ 0.7578], [ 0.3828], [ 1.2344], [ 0.3672], [ 0.3359], [ 0.4590], [ 0.7500], [ 0.3516], [ 0.2812], [ 0.5469], [ 0.3125], [ 0.1787], [ 0.5430], [ 0.2930], [-0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.2500], [1.0000], [0.4668], [0.4668], [0.3750], [0.7500], [0.8008], [0.4004], [0.6016], [0.1670], [0.4004], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006744384765625 loss: 0.0128173828125 loss: 0.006195068359375 loss: 0.0146484375 5%|▍ | 24/492 [12:46<4:03:31, 31.22s/it] {'loss': 0.0495, 'learning_rate': 8.123811710560552e-06, 'epoch': 0.05} 5%|▍ | 24/492 [12:46<4:03:31, 31.22s/it]predicted value: tensor([[1.2500], [0.8867], [0.8750], [0.9609], [1.3516], [0.9570], [0.7109], [1.4922], [1.0547], [1.2656], [1.4219], [0.8750], [1.0078], [0.9727], [0.5742], [0.6680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.4668], [0.8008], [1.0000], [0.3750], [0.7500], [1.0000], [0.7500], [1.0000], [1.0000], [0.5000], [0.4004], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0595703125 loss: 0.0439453125 loss: 0.06494140625 loss: 0.06787109375 predicted value: tensor([[0.7539], [1.2578], [0.7383], [0.8438], [1.2500], [1.1016], [1.1953], [1.2812], [0.8516], [1.1719], [0.8555], [0.9375], [0.8516], [0.9297], [0.7070], [0.4844]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.6680], [0.2500], [0.8008], [0.2500], [0.4668], [0.8008], [0.5000], [0.5000], [0.2500], [0.3340], [0.6016], [0.5000], [0.1426], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.051025390625 loss: 0.0810546875 loss: 0.036865234375 loss: 0.07275390625 predicted value: tensor([[1.3047], [1.4375], [1.0078], [1.1484], [1.1172], [1.1172], [1.0156], [0.9727], [1.3359], [0.8516], [0.7656], [1.2109], [1.0234], [0.8125], [0.7305], [1.0078]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.3145], [0.2002], [0.5547], [0.6016], [0.6680], [0.8008], [0.8008], [0.5000], [0.5000], [0.4004], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.047119140625 loss: 0.08056640625 loss: 0.045166015625 loss: 0.05859375 predicted value: tensor([[1.1953], [1.3359], [1.3281], [0.8555], [1.0859], [0.7617], [1.0469], [1.1797], [1.0469], [0.7344], [0.6641], [0.9102], [1.0547], [1.2344], [0.9375], [0.3887]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [0.6680], [0.4668], [0.6680], [0.7500], [0.3750], [0.6016], [0.2500], [0.4004], [0.4004], [0.7500], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.057373046875 loss: 0.06689453125 loss: 0.07177734375 loss: 0.060546875 5%|▌ | 25/492 [13:16<4:02:27, 31.15s/it] {'loss': 0.2415, 'learning_rate': 8.228161798644422e-06, 'epoch': 0.05} 5%|▌ | 25/492 [13:16<4:02:27, 31.15s/it]predicted value: tensor([[1.3750], [1.0234], [0.7344], [0.9023], [0.8438], [0.8281], [1.1172], [0.5586], [0.8711], [0.8047], [0.8633], [1.2188], [1.1797], [0.6562], [1.0078], [0.5312]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3340], [0.2500], [0.8008], [0.3750], [0.8320], [0.2500], [0.6016], [0.7500], [0.5000], [1.0000], [0.7500], [0.1670], [0.6016], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0537109375 loss: 0.037353515625loss: 0.03466796875 loss: 0.06591796875 predicted value: tensor([[1.1797], [1.2188], [0.9375], [1.3359], [1.1094], [0.9141], [1.1328], [0.7734], [1.0547], [0.8906], [1.0156], [0.7305], [0.8086], [0.9219], [0.8516], [0.7695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.2500], [1.0000], [0.5000], [0.5547], [0.8008], [0.7500], [0.6016], [0.5000], [0.5000], [0.5000], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0458984375 loss: 0.052978515625 loss: 0.07080078125 loss: 0.0458984375 predicted value: tensor([[1.1719], [0.7930], [0.8711], [1.3672], [0.8359], [1.3125], [1.2656], [0.8203], [0.7070], [0.9414], [0.7227], [0.8750], [0.8477], [0.9609], [0.6797], [0.7188]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [1.0000], [0.2500], [1.0000], [1.0000], [0.6016], [0.2500], [0.4668], [0.0625], [0.3340], [0.5000], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0400390625 loss: 0.050048828125loss: 0.055419921875 loss: 0.07080078125 predicted value: tensor([[1.1875], [1.3516], [1.1094], [0.8125], [0.7891], [1.2188], [0.8633], [1.0156], [1.1406], [1.1875], [1.0781], [0.7227], [1.0078], [0.7852], [0.9375], [0.8789]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.6680], [0.5547], [0.5703], [1.0000], [0.4668], [0.4668], [1.0000], [0.7500], [0.7500], [0.4004], [0.7500], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.042236328125 loss: 0.048828125 loss: 0.0771484375 loss: 0.04443359375 5%|▌ | 26/492 [13:48<4:01:52, 31.14s/it] {'loss': 0.209, 'learning_rate': 8.328418655771438e-06, 'epoch': 0.05} 5%|▌ | 26/492 [13:48<4:01:52, 31.14s/it]predicted value: tensor([[0.8906], [0.3809], [0.6562], [0.9258], [0.3945], [0.5312], [0.4648], [0.4766], [0.4375], [0.3770], [0.4492], [0.1865], [0.3672], [0.3066], [0.1279], [0.3574]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3750], [1.0000], [0.4648], [0.4668], [0.7500], [0.3340], [0.8008], [0.4004], [0.2500], [0.5000], [0.4004], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006256103515625 loss: 0.008544921875loss: 0.009765625 loss: 0.015625 predicted value: tensor([[0.7617], [0.6211], [0.3867], [0.3633], [0.4023], [0.7695], [0.5039], [0.4277], [0.2334], [0.3711], [0.5586], [0.4492], [0.8242], [0.1172], [0.1982], [0.3594]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.2500], [0.4668], [0.7500], [1.0000], [0.4668], [0.2002], [0.7500], [0.5000], [0.6680], [0.4668], [0.6016], [0.0278], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01446533203125 loss: 0.011474609375loss: 0.007568359375 loss: 0.0108642578125 predicted value: tensor([[0.5156], [0.6055], [0.6328], [0.4043], [0.3340], [0.8047], [0.2891], [0.8242], [0.8555], [0.4453], [0.6055], [0.6992], [0.6250], [0.2871], [0.3047], [0.1484]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.5547], [0.8008], [0.4668], [1.0000], [0.6016], [1.0000], [1.0000], [0.7500], [0.5000], [0.4668], [0.6016], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005615234375 loss: 0.0107421875loss: 0.0133056640625 loss: 0.007080078125 predicted value: tensor([[0.4258], [0.5391], [0.8750], [0.3926], [0.4980], [0.4980], [0.5391], [0.8789], [0.5195], [0.4395], [0.5938], [0.5312], [0.1270], [0.5391], [0.1270], [0.5312]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.7148], [1.0000], [0.4668], [0.8008], [0.4668], [0.7500], [1.0000], [0.7148], [0.3340], [1.0000], [0.4004], [0.4004], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01171875 loss: 0.01171875 loss: 0.0093994140625 loss: 0.01361083984375 5%|▌ | 27/492 [14:19<4:01:05, 31.11s/it] {'loss': 0.0419, 'learning_rate': 8.424891319481442e-06, 'epoch': 0.05} 5%|▌ | 27/492 [14:19<4:01:05, 31.11s/it]predicted value: tensor([[0.4844], [0.5703], [0.3809], [0.4297], [0.3105], [0.5078], [0.4922], [0.5156], [0.2988], [0.5898], [0.3203], [0.3223], [0.4629], [0.0388], [0.2197], [0.2598]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [0.6016], [0.4668], [0.0625], [0.6680], [0.6016], [0.6016], [0.2500], [0.7500], [0.5000], [0.4004], [0.4004], [0.0278], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0086669921875 loss: 0.00634765625loss: 0.0086669921875 loss: 0.01263427734375 predicted value: tensor([[0.9766], [0.4902], [0.5781], [0.3965], [0.4746], [0.9102], [0.5625], [0.5117], [0.5195], [0.6250], [0.6016], [0.2754], [0.3359], [0.1611], [0.1069], [0.4160]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.3750], [0.8008], [1.0000], [0.2500], [0.4668], [0.3750], [0.4668], [0.6016], [0.5000], [0.2002], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00482177734375 loss: 0.0087890625loss: 0.005401611328125 loss: 0.00830078125 predicted value: tensor([[0.6211], [0.4414], [0.5312], [0.5859], [1.2031], [0.6016], [0.9453], [0.3750], [0.4238], [0.3945], [0.6328], [0.4883], [0.3438], [0.0938], [0.0374], [0.1855]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [0.8008], [0.8320], [1.0000], [0.8008], [1.0000], [0.4668], [0.7500], [0.3340], [0.5000], [0.2500], [0.3340], [0.0400], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005523681640625 loss: 0.0081787109375 loss: 0.00616455078125 loss: 0.01141357421875 predicted value: tensor([[0.2832], [0.4688], [0.4727], [0.6484], [0.3672], [0.6094], [0.9180], [0.6602], [0.2910], [1.0078], [0.8242], [0.4980], [0.1406], [0.0977], [0.1494], [0.1406]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.3750], [0.6680], [0.3340], [1.0000], [1.0000], [0.4668], [0.2500], [1.0000], [1.0000], [0.2852], [0.0625], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005126953125 loss: 0.01251220703125loss: 0.0030059814453125 loss: 0.00909423828125 6%|▌ | 28/492 [14:50<4:00:40, 31.12s/it] {'loss': 0.0312, 'learning_rate': 8.517855098376436e-06, 'epoch': 0.06} 6%|▌ | 28/492 [14:50<4:00:40, 31.12s/it]predicted value: tensor([[0.4707], [1.0625], [0.6094], [0.5508], [0.8828], [1.0547], [1.0000], [0.7578], [1.0078], [0.8047], [0.6914], [0.5938], [0.3008], [0.5312], [0.8750], [0.3457]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.1670], [0.3750], [0.2002], [0.2500], [0.2500], [0.4668], [0.8008], [0.8008], [0.3145], [0.2002], [0.2500], [0.5000], [0.1670], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.032958984375 loss: 0.053466796875loss: 0.046142578125 loss: 0.0224609375 predicted value: tensor([[0.9180], [0.7500], [1.2500], [0.8203], [0.9648], [0.8594], [0.8477], [0.8398], [0.7148], [0.9336], [1.1797], [0.6719], [0.7500], [0.5117], [0.4297], [0.4766]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.8320], [0.3340], [0.8008], [0.8008], [0.5000], [0.8008], [0.7500], [0.7500], [1.0000], [0.2500], [0.4004], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.05224609375 loss: 0.0198974609375loss: 0.04541015625 loss: 0.05126953125 predicted value: tensor([[0.6602], [0.7773], [1.2109], [0.8047], [1.4609], [1.0000], [0.7422], [0.8438], [0.9805], [0.9492], [0.8828], [0.9336], [0.6133], [0.6406], [0.6367], [0.3965]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.4668], [1.0000], [0.6016], [1.0000], [0.3340], [0.2500], [0.4668], [0.5000], [0.6016], [0.6016], [0.7500], [0.4004], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.038330078125 loss: 0.03564453125loss: 0.042724609375 loss: 0.038330078125 predicted value: tensor([[0.9609], [0.6641], [0.7773], [0.8789], [0.8984], [0.9492], [0.8164], [0.9258], [0.9062], [1.4297], [0.8438], [1.0781], [0.3770], [0.5625], [0.6914], [0.4434]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [0.5547], [0.4668], [0.4668], [0.3340], [0.8008], [0.6016], [0.8320], [0.5000], [1.0000], [0.3340], [0.6016], [0.2500], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.03076171875 loss: 0.032470703125 loss: 0.06640625 loss: 0.0244140625 6%|▌ | 29/492 [15:21<3:59:31, 31.04s/it] {'loss': 0.1582, 'learning_rate': 8.607556308626424e-06, 'epoch': 0.06} 6%|▌ | 29/492 [15:21<3:59:31, 31.04s/it]predicted value: tensor([[0.5195], [0.9727], [0.6406], [0.9531], [1.1016], [0.5898], [1.3516], [0.9180], [1.4531], [0.9336], [0.7148], [0.6758], [0.7305], [0.6562], [0.5195], [0.5000]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.3750], [0.8008], [0.6680], [0.7500], [1.0000], [0.2715], [1.0000], [0.4668], [0.2500], [0.7500], [0.4004], [0.6016], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.044921875 loss: 0.031005859375loss: 0.025634765625 loss: 0.033447265625 predicted value: tensor([[0.7617], [0.9648], [0.7344], [1.0469], [1.0391], [0.9727], [0.6602], [0.8164], [0.9336], [0.7188], [0.9688], [0.9492], [0.6172], [0.5391], [0.4688], [0.5664]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.4668], [0.3145], [0.8008], [0.4668], [0.3340], [0.4668], [0.8008], [0.8008], [0.6680], [0.4668], [0.2500], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0289306640625 loss: 0.036865234375loss: 0.033203125 loss: 0.0252685546875 predicted value: tensor([[1.2031], [0.5352], [0.8242], [0.7305], [0.8438], [0.9336], [1.3125], [0.7930], [1.0938], [0.8086], [0.7617], [0.7656], [0.8750], [1.1250], [0.5039], [0.4414]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.3750], [0.3750], [0.2002], [0.6016], [1.0000], [0.2500], [0.8008], [0.7500], [0.3340], [0.5000], [0.5000], [1.0000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.032470703125 loss: 0.030517578125loss: 0.0277099609375 loss: 0.0322265625 predicted value: tensor([[0.6836], [0.7266], [0.9688], [0.8516], [0.9297], [0.8516], [0.6562], [1.0312], [0.7969], [0.6172], [0.8008], [0.8867], [0.7578], [0.6562], [0.5273], [0.4648]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [0.4668], [0.4668], [0.8320], [0.3340], [1.0000], [0.8008], [0.3340], [0.3750], [0.5000], [0.5000], [0.5000], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0198974609375 loss: 0.03369140625 loss: 0.033203125 loss: 0.025390625 6%|▌ | 30/492 [15:52<3:59:20, 31.08s/it] {'loss': 0.1236, 'learning_rate': 8.694216207171605e-06, 'epoch': 0.06} 6%|▌ | 30/492 [15:52<3:59:20, 31.08s/it]predicted value: tensor([[ 0.3965], [ 0.2773], [ 0.7461], [ 0.4238], [ 0.3027], [ 0.9922], [ 0.1309], [ 0.5000], [ 0.2910], [ 0.3594], [ 0.4121], [ 0.2734], [ 0.1387], [ 0.2344], [-0.0272], [-0.1914]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4004], [0.2002], [1.0000], [0.6016], [0.2500], [1.0000], [0.2500], [0.6016], [0.6016], [0.3750], [0.3340], [0.0400], [0.2002], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00787353515625 loss: 0.0087890625loss: 0.0093994140625 loss: 0.006439208984375 predicted value: tensor([[ 0.5352], [ 0.7422], [ 0.4473], [ 0.6406], [ 0.2441], [ 1.0625], [ 0.8164], [ 0.8750], [ 0.4844], [ 0.6172], [ 0.3906], [ 0.2129], [ 0.4258], [-0.0305], [-0.0046], [ 0.0820]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.6680], [0.8008], [0.5000], [1.0000], [0.8008], [1.0000], [0.3750], [0.6016], [0.4004], [0.2002], [0.7500], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01104736328125 loss: 0.007080078125loss: 0.0068359375 loss: 0.01153564453125 predicted value: tensor([[ 1.0000], [ 0.2393], [ 0.9609], [ 0.5508], [ 0.1689], [ 1.0938], [ 0.3867], [ 0.3848], [ 0.4375], [ 0.3320], [ 0.4863], [ 0.3477], [-0.0114], [ 0.3105], [ 0.0415], [ 0.0654]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [1.0000], [0.8008], [0.3340], [1.0000], [0.3340], [0.1670], [0.3750], [0.7500], [0.6016], [0.2500], [0.0278], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0098876953125 loss: 0.00689697265625 loss: 0.007293701171875 loss: 0.0113525390625 predicted value: tensor([[ 0.4121], [ 0.5547], [ 0.6758], [ 0.5078], [ 0.3906], [ 1.1094], [ 1.0625], [ 0.2891], [ 0.4395], [ 0.4824], [ 0.4531], [ 0.3320], [ 0.0209], [ 0.0593], [ 0.1846], [-0.0044]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2002], [0.5547], [0.4668], [0.3750], [1.0000], [1.0000], [0.5000], [0.6016], [0.2002], [0.5000], [0.3340], [0.0278], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0142822265625 loss: 0.00628662109375 loss: 0.0093994140625 loss: 0.005126953125 6%|▋ | 31/492 [16:23<3:59:18, 31.15s/it] {'loss': 0.0349, 'learning_rate': 8.778034279758329e-06, 'epoch': 0.06} 6%|▋ | 31/492 [16:23<3:59:18, 31.15s/it]predicted value: tensor([[ 4.3945e-01], [ 1.0547e+00], [ 5.8203e-01], [ 3.8477e-01], [ 3.9453e-01], [ 1.1016e+00], [ 2.5781e-01], [ 5.7031e-01], [ 1.0625e+00], [ 2.8516e-01], [ 4.6484e-01], [ 4.4336e-01], [ 5.1562e-01], [ 5.0000e-01], [-4.1504e-02], [ 6.1989e-05]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.2002], [0.5547], [1.0000], [0.7500], [0.5000], [1.0000], [0.2500], [0.6016], [0.6680], [0.6016], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00921630859375 loss: 0.0089111328125 loss: 0.00860595703125 loss: 0.00787353515625 predicted value: tensor([[ 0.3203], [ 0.5078], [ 0.3906], [ 0.5586], [ 0.5586], [ 0.9961], [ 0.8867], [ 0.5547], [ 0.3379], [ 0.3789], [ 0.6094], [ 1.1250], [ 0.2578], [ 0.1299], [ 0.3711], [-0.0869]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.2500], [0.6680], [0.5547], [1.0000], [1.0000], [0.7500], [0.2500], [0.5000], [0.4668], [1.0000], [0.3340], [0.2002], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00836181640625 loss: 0.00555419921875loss: 0.006134033203125 loss: 0.01019287109375 predicted value: tensor([[0.4727], [1.0078], [0.9258], [0.4688], [0.4590], [0.5625], [0.6797], [0.7500], [0.5625], [0.4238], [0.6680], [0.2852], [0.2637], [0.3672], [0.2012], [0.0325]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [0.8008], [0.8008], [0.8008], [0.6016], [1.0000], [0.6016], [0.6016], [0.6016], [0.5000], [0.2500], [0.6016], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00579833984375 loss: 0.00958251953125 loss: 0.0064697265625 loss: 0.0072021484375 predicted value: tensor([[0.5938], [0.5039], [0.4648], [0.7305], [0.9375], [0.4199], [0.4941], [0.5234], [0.2793], [0.4434], [0.2969], [0.2637], [0.5508], [0.0669], [0.1865], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.4668], [0.8008], [1.0000], [0.6016], [0.6016], [0.7500], [0.2500], [0.3340], [0.5000], [0.2500], [0.4668], [0.2002], [0.2002], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00885009765625 loss: 0.0034332275390625 loss: 0.0057373046875loss: 0.00750732421875 7%|▋ | 32/492 [16:54<3:58:30, 31.11s/it] {'loss': 0.0299, 'learning_rate': 8.859191006777897e-06, 'epoch': 0.07} 7%|▋ | 32/492 [16:54<3:58:30, 31.11s/it]predicted value: tensor([[0.4922], [1.2109], [1.2812], [0.7617], [1.2734], [0.6836], [1.2734], [0.8633], [0.9492], [0.7188], [0.7422], [0.7266], [0.6914], [0.5352], [0.5742], [0.6992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [0.4668], [1.0000], [0.3340], [1.0000], [0.8008], [0.6016], [0.2500], [0.4004], [0.7500], [0.4004], [0.3340], [0.0278], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0164794921875 loss: 0.037109375 loss: 0.0234375 loss: 0.015869140625 predicted value: tensor([[0.6836], [0.9570], [0.7695], [1.2031], [0.6719], [0.6562], [0.9570], [1.2500], [0.8438], [0.9531], [0.6641], [0.7773], [0.4551], [0.4121], [0.3027], [0.5234]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.8008], [1.0000], [0.6016], [0.2500], [0.8008], [1.0000], [0.2002], [0.7500], [0.4668], [0.3340], [0.2500], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.02099609375 loss: 0.0223388671875loss: 0.024169921875 loss: 0.025634765625 predicted value: tensor([[1.5469], [0.7188], [1.1562], [1.2500], [0.7344], [0.9414], [1.3047], [1.2891], [0.8984], [0.7578], [1.2734], [0.6367], [1.3125], [0.4160], [0.4746], [0.4375]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [1.0000], [0.4668], [0.5547], [1.0000], [1.0000], [0.6016], [0.4277], [1.0000], [0.6016], [1.0000], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.02392578125 loss: 0.022705078125 loss: 0.026123046875 loss: 0.02294921875 predicted value: tensor([[0.7734], [0.6094], [0.8125], [0.8398], [1.3984], [1.3438], [0.7031], [1.3438], [1.0469], [0.6914], [0.8594], [0.9883], [0.5430], [0.6016], [0.5312], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.8320], [1.0000], [1.0000], [0.2500], [1.0000], [0.8320], [0.4004], [0.6016], [0.4668], [0.3340], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.02294921875 loss: 0.0274658203125 loss: 0.0263671875 loss: 0.01544189453125 7%|▋ | 33/492 [17:25<3:58:01, 31.11s/it] {'loss': 0.0935, 'learning_rate': 8.937850203372744e-06, 'epoch': 0.07} 7%|▋ | 33/492 [17:25<3:58:01, 31.11s/it]predicted value: tensor([[0.6289], [1.3828], [0.9336], [0.8633], [0.8633], [0.8555], [0.7305], [0.6133], [0.6367], [0.7695], [0.6719], [0.6758], [0.5117], [0.3164], [0.4062], [0.4238]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6094], [1.0000], [0.5547], [0.8008], [0.8008], [0.6680], [0.8008], [0.5000], [0.5000], [0.5000], [0.7500], [0.2500], [0.5000], [0.1670], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.023681640625 loss: 0.0211181640625 loss: 0.01190185546875 loss: 0.01385498046875 predicted value: tensor([[1.2969], [0.8750], [0.7852], [0.7695], [1.2109], [0.6641], [0.7227], [0.8320], [0.7773], [0.6016], [0.7891], [0.6992], [0.5977], [0.5664], [0.2793], [0.5703]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.5547], [0.8008], [1.0000], [0.6016], [0.2002], [0.3750], [0.6016], [0.0400], [0.6016], [0.6016], [0.4004], [0.4004], [0.2002], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0225830078125 loss: 0.0225830078125 loss: 0.0213623046875 loss: 0.0203857421875 predicted value: tensor([[1.4141], [0.7188], [1.4844], [0.7773], [0.6875], [0.6250], [0.6172], [1.2734], [0.7305], [0.9570], [0.6016], [0.5273], [0.5977], [0.6758], [0.4922], [0.4180]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [1.0000], [0.4668], [0.2500], [0.2002], [0.2002], [1.0000], [0.4004], [0.3750], [0.4004], [0.4004], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.03125 loss: 0.02978515625loss: 0.0152587890625 loss: 0.02197265625 predicted value: tensor([[0.6172], [0.6602], [1.2188], [0.8438], [0.8242], [0.7695], [0.6641], [0.6914], [0.7930], [0.5273], [0.8945], [0.6289], [0.7109], [0.5430], [0.3379], [0.3594]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [1.0000], [0.8008], [0.6016], [0.3340], [0.7500], [0.6680], [0.6680], [0.4004], [0.4668], [0.6016], [0.7500], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0093994140625 loss: 0.01513671875 loss: 0.0198974609375 loss: 0.0174560546875 7%|▋ | 34/492 [17:57<3:58:59, 31.31s/it] {'loss': 0.0794, 'learning_rate': 9.014161010104347e-06, 'epoch': 0.07} 7%|▋ | 34/492 [17:57<3:58:59, 31.31s/it]predicted value: tensor([[0.4531], [0.5195], [0.3262], [0.4707], [0.8398], [0.4004], [0.3320], [0.4258], [1.0078], [0.9102], [0.2930], [0.2754], [0.4570], [0.0500], [0.1777], [0.0972]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.8008], [0.8008], [0.4668], [1.0000], [0.2002], [0.4668], [0.6016], [1.0000], [1.0000], [0.1426], [0.4004], [0.2500], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00384521484375 loss: 0.01251220703125loss: 0.01092529296875 loss: 0.00799560546875 predicted value: tensor([[0.2305], [0.5117], [0.8750], [0.2988], [0.4062], [0.4043], [0.3867], [0.5156], [0.5703], [0.4863], [0.6914], [0.6797], [0.4883], [0.3145], [0.3906], [0.2021]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [1.0000], [0.4668], [0.3750], [0.4668], [0.6016], [0.6016], [0.6016], [0.6016], [1.0000], [0.8008], [0.5000], [0.0400], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0108642578125 loss: 0.006195068359375 loss: 0.006317138671875 loss: 0.0050048828125 predicted value: tensor([[0.4160], [0.5039], [0.4492], [0.6406], [0.5156], [0.3145], [0.5352], [0.4551], [0.5859], [0.5898], [0.5625], [0.2100], [0.5039], [0.4570], [0.6680], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [0.5547], [0.8008], [0.2500], [0.8008], [0.8008], [0.6016], [0.7500], [0.6016], [0.4004], [0.5000], [0.5000], [0.4668], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004638671875 loss: 0.006134033203125 loss: 0.007568359375 loss: 0.00653076171875 predicted value: tensor([[0.6992], [0.8984], [0.4375], [0.1885], [0.3945], [0.2734], [0.5195], [0.3633], [0.5430], [0.3555], [0.5586], [0.3965], [0.3984], [0.1680], [0.3457], [0.2061]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.8320], [0.2500], [0.3340], [0.4668], [0.3750], [0.6016], [0.4668], [0.7500], [0.6016], [0.3340], [0.4004], [0.2500], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007568359375loss: 0.0034027099609375 loss: 0.009521484375 loss: 0.002685546875 7%|▋ | 35/492 [18:29<3:58:53, 31.36s/it] {'loss': 0.0279, 'learning_rate': 9.08825959498749e-06, 'epoch': 0.07} 7%|▋ | 35/492 [18:29<3:58:53, 31.36s/it]predicted value: tensor([[0.7148], [1.0547], [0.4746], [0.8242], [0.6406], [0.8516], [0.8945], [0.6211], [0.4062], [0.8594], [0.1797], [0.1206], [0.4941], [0.2217], [0.1953], [0.0149]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6992], [1.0000], [0.8008], [1.0000], [0.6680], [1.0000], [1.0000], [0.6680], [0.2500], [1.0000], [0.0400], [0.2002], [0.5000], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001922607421875 loss: 0.004241943359375 loss: 0.0130615234375 loss: 0.006866455078125 predicted value: tensor([[0.5586], [0.2246], [0.2949], [0.9062], [0.4629], [0.3203], [0.4199], [0.5000], [0.6523], [0.3945], [0.4355], [0.3633], [0.3281], [0.1816], [0.3125], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3340], [0.2002], [1.0000], [0.8008], [0.2500], [0.4668], [0.4668], [0.7500], [0.2500], [0.7500], [0.2002], [0.2500], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00872802734375 loss: 0.006622314453125loss: 0.00836181640625 loss: 0.002899169921875 predicted value: tensor([[0.3379], [0.3594], [0.1699], [0.4453], [0.7656], [0.8477], [0.5195], [0.4512], [0.7070], [0.8398], [0.5234], [0.3477], [0.5312], [0.0615], [0.1992], [0.1865]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.4668], [1.0000], [1.0000], [0.4668], [0.4668], [0.7148], [1.0000], [0.7500], [0.5000], [0.6016], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0050048828125 loss: 0.00555419921875loss: 0.00311279296875 loss: 0.007476806640625 predicted value: tensor([[0.8086], [0.5117], [0.2246], [0.3379], [0.5000], [0.4883], [0.4023], [0.5508], [0.4883], [0.4902], [0.4219], [0.4551], [0.4082], [0.0820], [0.3477], [0.2412]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.2500], [0.3340], [0.4668], [0.6680], [0.2500], [0.8008], [0.5000], [0.8008], [0.4668], [0.7500], [0.3340], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007354736328125 loss: 0.004486083984375 loss: 0.006072998046875loss: 0.00677490234375 7%|▋ | 36/492 [19:00<3:59:12, 31.47s/it] {'loss': 0.0246, 'learning_rate': 9.160270615698787e-06, 'epoch': 0.07} 7%|▋ | 36/492 [19:00<3:59:12, 31.47s/it]predicted value: tensor([[0.7539], [0.6367], [0.9375], [0.6016], [0.6836], [0.5352], [1.2109], [0.6250], [0.7344], [1.1328], [0.6914], [0.8164], [0.6367], [0.3867], [0.7109], [0.4902]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8203], [0.4668], [0.8008], [0.4668], [0.5000], [0.2500], [1.0000], [0.3340], [0.2002], [1.0000], [0.5000], [0.5000], [0.3340], [0.1670], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.025390625 loss: 0.01556396484375loss: 0.020263671875 loss: 0.01043701171875 predicted value: tensor([[0.8086], [0.7852], [0.7617], [0.7227], [0.8125], [0.6602], [0.6562], [0.7305], [0.7500], [0.6719], [0.6719], [0.6406], [1.1094], [0.4199], [0.7383], [0.4414]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3750], [0.7148], [0.6680], [0.5000], [0.3750], [0.3340], [0.2500], [0.6016], [0.2500], [0.4004], [0.4004], [1.0000], [0.2002], [0.7500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.016845703125 loss: 0.0169677734375 loss: 0.017822265625 loss: 0.0172119140625 predicted value: tensor([[0.8438], [0.7812], [0.7188], [0.5469], [0.8164], [0.5977], [0.7422], [0.8203], [0.7031], [0.6875], [0.6875], [0.6406], [0.6758], [0.3418], [0.3887], [0.3750]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.3750], [0.2500], [0.8008], [0.2500], [0.3750], [0.6680], [0.7500], [0.5000], [0.7500], [0.6016], [0.5000], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01483154296875 loss: 0.01019287109375loss: 0.018798828125 loss: 0.0172119140625 predicted value: tensor([[0.6797], [0.7695], [1.2812], [0.6016], [0.7227], [1.1172], [0.7656], [0.8516], [0.6680], [1.1016], [0.8086], [0.6719], [0.6484], [0.7500], [0.4902], [0.4453]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [1.0000], [0.2500], [0.4668], [1.0000], [0.6016], [0.5547], [0.6016], [1.0000], [0.6016], [0.5000], [0.2002], [0.6016], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0211181640625 loss: 0.0147705078125 loss: 0.0203857421875 loss: 0.01495361328125 8%|▊ | 37/492 [19:32<3:58:58, 31.51s/it] {'loss': 0.0682, 'learning_rate': 9.230308481401767e-06, 'epoch': 0.08} 8%|▊ | 37/492 [19:32<3:58:58, 31.51s/it]predicted value: tensor([[0.6758], [1.1328], [0.6641], [0.8867], [0.8008], [0.4688], [0.7070], [0.5547], [0.6328], [0.6016], [0.4805], [0.6211], [0.4141], [0.3477], [0.4492], [0.4785]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.3750], [0.7500], [0.7500], [0.2002], [0.8008], [0.4668], [0.5000], [0.4004], [0.2002], [0.4004], [0.2002], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0126953125 loss: 0.00982666015625loss: 0.0108642578125 loss: 0.024169921875 predicted value: tensor([[1.0391], [0.8281], [0.4219], [1.0078], [0.6367], [0.5312], [0.8164], [1.1172], [1.1641], [1.0469], [0.6016], [0.5078], [0.3945], [0.5859], [0.4707], [0.4258]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.2500], [1.0000], [0.3750], [0.2500], [0.8008], [1.0000], [1.0000], [1.0000], [0.4004], [0.5000], [0.2002], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0093994140625 loss: 0.01214599609375 loss: 0.00982666015625 loss: 0.01470947265625 predicted value: tensor([[0.8086], [0.7969], [0.6562], [1.1562], [0.6523], [0.7344], [1.1250], [1.1562], [0.7266], [0.5977], [0.7578], [0.5508], [0.4961], [0.4141], [0.4648], [0.3770]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.5547], [0.5547], [1.0000], [0.4668], [0.3340], [1.0000], [1.0000], [0.2002], [0.6016], [0.4668], [0.3340], [0.2500], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.021484375 loss: 0.0147705078125loss: 0.018310546875 loss: 0.013916015625 predicted value: tensor([[0.7266], [0.7109], [0.7188], [0.6328], [0.7969], [0.7734], [1.0859], [0.5820], [0.4980], [0.5898], [0.6719], [0.7969], [0.4199], [0.5273], [0.4102], [0.4531]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.5547], [0.4668], [0.6016], [0.4668], [1.0000], [0.5000], [0.4004], [0.4004], [0.3340], [0.8008], [0.2002], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0084228515625 loss: 0.009765625loss: 0.0087890625 loss: 0.01434326171875 8%|▊ | 38/492 [20:03<3:57:27, 31.38s/it] {'loss': 0.0534, 'learning_rate': 9.29847844626434e-06, 'epoch': 0.08} 8%|▊ | 38/492 [20:03<3:57:27, 31.38s/it]predicted value: tensor([[0.4688], [0.3633], [0.9688], [0.4492], [0.3047], [0.6367], [0.8516], [0.4824], [0.4121], [0.3809], [0.3320], [0.3320], [0.3945], [0.4707], [0.1982], [0.1143]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.7500], [1.0000], [0.6680], [0.2500], [0.5547], [1.0000], [0.4668], [0.8008], [0.4004], [0.3340], [0.5000], [0.6016], [0.3750], [0.1250], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00408935546875 loss: 0.007537841796875 loss: 0.004852294921875 loss: 0.0106201171875 predicted value: tensor([[1.0312], [0.3086], [0.5742], [0.6016], [0.9492], [0.4590], [1.0625], [0.7148], [0.4785], [0.3516], [0.2490], [0.1875], [0.2451], [0.1299], [0.1152], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.4668], [0.8555], [1.0000], [0.4668], [1.0000], [0.8008], [0.5000], [0.2500], [0.4004], [0.4004], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0111083984375 loss: 0.003875732421875 loss: 0.005523681640625 loss: 0.01031494140625 predicted value: tensor([[0.9062], [0.6172], [0.5273], [0.5273], [0.8906], [0.4707], [0.7969], [0.5352], [0.4766], [0.5625], [0.5664], [0.3770], [0.8398], [0.3633], [0.1045], [0.0820]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.4668], [0.4668], [1.0000], [0.8008], [1.0000], [0.8008], [0.6016], [0.4668], [0.8008], [0.5000], [1.0000], [0.7500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006683349609375 loss: 0.0072021484375 loss: 0.0091552734375 loss: 0.00225830078125 predicted value: tensor([[0.5781], [0.5703], [0.8242], [0.3945], [0.3340], [0.4590], [0.4336], [0.5234], [0.4551], [0.2256], [0.4785], [0.3906], [0.2246], [0.1758], [0.3125], [0.0496]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6680], [1.0000], [0.8008], [0.5000], [0.7500], [0.6016], [0.6016], [0.6016], [0.0625], [0.5547], [0.4004], [0.5000], [0.2500], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00592041015625 loss: 0.00848388671875loss: 0.00799560546875 loss: 0.006103515625 8%|▊ | 39/492 [20:34<3:55:33, 31.20s/it] {'loss': 0.0279, 'learning_rate': 9.364877560909674e-06, 'epoch': 0.08} 8%|▊ | 39/492 [20:34<3:55:33, 31.20s/it]predicted value: tensor([[1.0000], [0.4395], [0.9375], [0.5703], [0.9844], [0.5430], [0.4590], [0.9609], [0.8047], [0.2812], [0.3750], [0.3906], [0.3887], [0.3809], [0.1475], [0.2090]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [1.0000], [0.8008], [1.0000], [0.8320], [0.3340], [1.0000], [1.0000], [0.4004], [0.2500], [0.4004], [0.7500], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01043701171875 loss: 0.005828857421875 loss: 0.00482177734375 loss: 0.007171630859375 predicted value: tensor([[0.4785], [0.5898], [0.3594], [0.4258], [0.4785], [0.4004], [0.5430], [0.3926], [0.3672], [0.5977], [0.7969], [0.3164], [0.2852], [0.4688], [0.1826], [0.2373]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5312], [0.6680], [0.4668], [0.4668], [0.2500], [0.3750], [0.4668], [0.2500], [0.2002], [0.5547], [1.0000], [0.4004], [0.3340], [0.6016], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005523681640625 loss: 0.00311279296875loss: 0.005462646484375 loss: 0.0062255859375 predicted value: tensor([[0.9414], [0.4863], [0.5156], [0.5117], [0.5312], [1.0312], [0.4414], [0.5664], [0.9844], [0.3359], [0.9922], [0.3555], [0.2754], [0.3047], [0.1465], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.8008], [0.8008], [0.4668], [1.0000], [0.4668], [0.8320], [1.0000], [0.3340], [1.0000], [0.5000], [0.3340], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007659912109375 loss: 0.004638671875 loss: 0.002197265625 loss: 0.009521484375 predicted value: tensor([[1.0312], [0.3965], [0.6406], [0.4492], [0.2852], [0.4668], [0.6094], [0.5273], [0.5586], [0.3906], [0.2891], [0.2832], [0.4180], [0.2207], [0.0204], [0.1836]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.5547], [0.6016], [0.2002], [0.6680], [0.8008], [0.7500], [0.5000], [0.5000], [0.4004], [0.4004], [0.4004], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003875732421875 loss: 0.006011962890625loss: 0.005615234375 loss: 0.005126953125 8%|▊ | 40/492 [21:05<3:54:40, 31.15s/it] {'loss': 0.0233, 'learning_rate': 9.429595503388948e-06, 'epoch': 0.08} 8%|▊ | 40/492 [21:05<3:54:40, 31.15s/it]predicted value: tensor([[0.8633], [0.8203], [0.8359], [0.5664], [0.7617], [0.7266], [0.6680], [0.7578], [0.6445], [0.6250], [0.5586], [0.5195], [1.1484], [0.6172], [0.3477], [0.3555]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.5547], [0.2500], [0.2500], [0.5547], [0.7500], [0.4668], [0.6016], [0.6016], [0.2500], [0.2500], [1.0000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01495361328125 loss: 0.011474609375 loss: 0.01470947265625 loss: 0.015869140625 predicted value: tensor([[0.5977], [0.6992], [0.5234], [0.7539], [0.5664], [0.5977], [0.8047], [1.1719], [0.6328], [0.6367], [0.7461], [0.5820], [0.5312], [0.5820], [0.2676], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.8008], [0.3340], [0.4668], [0.2002], [0.4668], [0.8320], [1.0000], [0.6016], [0.6016], [0.3340], [0.5000], [0.4004], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00689697265625 loss: 0.0111083984375loss: 0.0091552734375 loss: 0.01318359375 predicted value: tensor([[0.5781], [1.2734], [0.7891], [0.8984], [1.1562], [0.8594], [0.9297], [1.1875], [0.6523], [0.6367], [1.1875], [0.6914], [0.5938], [0.2910], [0.5625], [0.3398]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.5547], [0.8555], [1.0000], [0.8008], [0.6680], [1.0000], [0.6016], [0.8008], [1.0000], [0.7500], [0.4004], [0.2002], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0150146484375 loss: 0.007049560546875loss: 0.01300048828125 loss: 0.015869140625 predicted value: tensor([[0.7773], [0.9688], [0.7227], [0.6055], [0.6836], [0.7852], [1.2656], [1.1562], [0.6523], [1.0234], [0.6953], [0.4219], [0.7578], [0.8086], [0.1953], [0.4082]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.5547], [0.2500], [0.4668], [0.8008], [1.0000], [1.0000], [0.7500], [1.0000], [0.7500], [0.4004], [0.7500], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01239013671875 loss: 0.01214599609375 loss: 0.01019287109375 loss: 0.0130615234375 8%|▊ | 41/492 [21:36<3:54:21, 31.18s/it] {'loss': 0.049, 'learning_rate': 9.492715307531484e-06, 'epoch': 0.08} 8%|▊ | 41/492 [21:36<3:54:21, 31.18s/it]predicted value: tensor([[0.5977], [1.0938], [0.7461], [0.8242], [0.7539], [0.5820], [0.6914], [0.5938], [1.2109], [1.1484], [0.5859], [0.6250], [0.4551], [0.3809], [0.3574], [0.3848]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.8008], [0.8008], [0.2500], [0.4668], [0.2002], [1.0000], [1.0000], [0.5000], [0.2002], [0.2002], [0.1670], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01220703125 loss: 0.01324462890625 loss: 0.01422119140625 loss: 0.0107421875 predicted value: tensor([[0.8750], [0.8008], [0.8750], [0.5859], [0.6406], [1.2422], [0.6758], [0.7969], [0.6562], [0.7656], [0.5312], [0.5273], [0.5898], [0.3340], [0.2314], [0.4512]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.8008], [0.4668], [0.7500], [1.0000], [0.2500], [0.6680], [0.5000], [0.8008], [0.5000], [0.3340], [0.5000], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01116943359375 loss: 0.00946044921875loss: 0.006683349609375 loss: 0.0159912109375 predicted value: tensor([[1.3125], [0.6328], [0.6914], [0.8750], [0.7461], [0.5000], [1.1328], [0.4727], [0.6484], [0.7500], [0.7891], [0.7070], [0.4980], [0.3535], [0.3301], [0.4707]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.4668], [0.8008], [0.7500], [0.2500], [1.0000], [0.4004], [0.6016], [0.8008], [0.4668], [0.4004], [0.5000], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01251220703125 loss: 0.0106201171875loss: 0.006439208984375 loss: 0.01507568359375 predicted value: tensor([[0.5977], [0.6875], [0.6992], [0.8203], [0.7148], [1.1094], [0.9062], [0.5781], [0.7461], [0.5352], [0.6211], [0.6641], [0.6172], [0.3145], [0.4531], [0.4668]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.8008], [0.6016], [1.0000], [0.8008], [0.2500], [0.4668], [0.5000], [0.5000], [0.5000], [0.4004], [0.2002], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007720947265625 loss: 0.01141357421875 loss: 0.01202392578125loss: 0.009765625 9%|▊ | 42/492 [22:07<3:53:03, 31.08s/it] {'loss': 0.0448, 'learning_rate': 9.554314003514673e-06, 'epoch': 0.09} 9%|▊ | 42/492 [22:07<3:53:03, 31.08s/it]predicted value: tensor([[0.9844], [0.4199], [0.4883], [0.9453], [0.6484], [0.5742], [0.8945], [0.3750], [0.2480], [0.4922], [0.2109], [0.3652], [0.1963], [0.1553], [0.1182], [0.0967]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.6680], [1.0000], [0.4668], [0.6680], [1.0000], [0.2500], [0.2500], [0.6680], [0.0400], [0.2500], [0.5000], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006988525390625 loss: 0.0076904296875 loss: 0.00567626953125 loss: 0.005828857421875 predicted value: tensor([[0.4258], [0.9141], [0.4180], [0.5742], [0.4590], [0.5977], [0.4980], [0.7969], [0.6406], [0.4941], [0.3242], [0.4551], [0.2773], [0.2324], [0.2158], [0.0396]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.3340], [0.4668], [0.4668], [0.8008], [0.4668], [1.0000], [0.7500], [0.2500], [0.5000], [0.4004], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005340576171875 loss: 0.0057373046875loss: 0.00872802734375 loss: 0.006744384765625 predicted value: tensor([[0.3008], [0.9453], [0.5273], [0.2266], [0.4473], [0.9648], [0.4922], [0.3438], [0.3008], [0.6211], [0.4062], [0.3301], [0.3379], [0.1504], [0.0845], [0.1187]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [1.0000], [0.4668], [0.2500], [0.3750], [1.0000], [0.3145], [0.4668], [0.2500], [0.6016], [0.4004], [0.2002], [0.5000], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00634765625 loss: 0.00244140625loss: 0.013427734375 loss: 0.003997802734375 predicted value: tensor([[0.6055], [0.9258], [0.5156], [0.6016], [0.2832], [0.4355], [0.5781], [0.5547], [0.4668], [0.2539], [0.2578], [0.3945], [0.3730], [0.3691], [0.1123], [0.0874]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.8320], [0.2500], [0.4668], [0.6016], [0.6016], [0.7500], [0.4004], [0.4277], [0.5000], [0.5000], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005035400390625 loss: 0.00762939453125 loss: 0.004058837890625 loss: 0.008544921875 9%|▊ | 43/492 [22:38<3:53:09, 31.16s/it] {'loss': 0.0261, 'learning_rate': 9.614463183050538e-06, 'epoch': 0.09} 9%|▊ | 43/492 [22:38<3:53:09, 31.16s/it]predicted value: tensor([[0.5547], [0.4043], [0.3926], [0.4629], [0.6250], [0.4727], [0.7773], [0.6172], [0.3164], [0.4219], [0.3359], [0.3457], [0.2871], [0.2891], [0.1318], [0.1230]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.6680], [0.5547], [0.5000], [1.0000], [0.6016], [0.6016], [0.6016], [0.5000], [0.4004], [0.3340], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033416748046875 loss: 0.004302978515625loss: 0.002655029296875 loss: 0.0036163330078125 predicted value: tensor([[1.0391], [0.9570], [0.9727], [0.5586], [0.8359], [0.8633], [0.6289], [0.5117], [0.2266], [0.3809], [0.3066], [0.4238], [0.3086], [0.2793], [0.1069], [0.1504]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [1.0000], [0.8008], [1.0000], [1.0000], [0.6680], [0.2500], [0.5000], [0.4004], [0.4004], [0.4004], [0.4004], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00830078125 loss: 0.007781982421875 loss: 0.004638671875 loss: 0.00543212890625 predicted value: tensor([[0.4043], [0.5859], [0.5000], [0.4453], [1.0000], [0.4023], [0.4980], [0.4844], [0.3477], [0.3828], [0.3848], [0.3281], [0.2617], [0.0013], [0.1777], [0.0703]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3145], [0.3340], [0.3750], [0.3750], [1.0000], [0.4668], [0.5000], [0.6016], [0.5000], [0.6016], [0.5000], [0.4004], [0.2500], [0.2002], [0.1426], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004547119140625 loss: 0.0042724609375loss: 0.005584716796875 loss: 0.00518798828125 predicted value: tensor([[0.4492], [0.5664], [0.4746], [0.5000], [0.7500], [0.8750], [0.3633], [1.0078], [0.4082], [0.2539], [0.5039], [0.3008], [0.3770], [0.3281], [0.0840], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.4668], [0.4668], [0.8320], [1.0000], [0.1670], [1.0000], [0.6016], [0.4004], [0.6680], [0.4004], [0.5000], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023345947265625 loss: 0.00433349609375 loss: 0.0047607421875 loss: 0.007232666015625 9%|▉ | 44/492 [23:09<3:52:54, 31.19s/it] {'loss': 0.0196, 'learning_rate': 9.673229499590088e-06, 'epoch': 0.09} 9%|▉ | 44/492 [23:09<3:52:54, 31.19s/it]predicted value: tensor([[0.6875], [1.2891], [0.6055], [0.6797], [0.8359], [0.7500], [0.5312], [1.2031], [0.6367], [0.7266], [0.6719], [0.6406], [0.6523], [0.6992], [0.4824], [0.4160]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [1.0000], [0.2500], [0.5547], [0.6680], [0.4668], [0.2500], [1.0000], [0.3750], [0.8008], [0.3340], [0.2500], [0.4004], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0113525390625 loss: 0.01483154296875 loss: 0.0186767578125 loss: 0.01275634765625 predicted value: tensor([[0.6133], [0.7617], [0.8047], [0.7812], [1.2500], [0.5820], [0.7031], [0.8164], [0.7461], [0.7070], [0.5859], [0.6914], [0.5781], [0.6484], [0.4004], [0.3867]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.5547], [0.5547], [1.0000], [0.3340], [0.6680], [0.4668], [0.6016], [0.5000], [0.4004], [0.5000], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.015869140625 loss: 0.0145263671875loss: 0.01409912109375 loss: 0.01708984375 predicted value: tensor([[0.8203], [0.5742], [0.5273], [0.7852], [0.6719], [1.1875], [0.7266], [0.5117], [0.8633], [0.6641], [0.7539], [0.6367], [0.5898], [0.4727], [0.3223], [0.3633]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.3750], [0.2500], [0.3750], [0.4668], [1.0000], [0.7500], [0.2002], [0.6680], [0.7500], [0.3340], [0.3340], [0.5000], [0.0625], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0093994140625 loss: 0.0157470703125 loss: 0.01458740234375 loss: 0.0177001953125 predicted value: tensor([[0.8125], [0.5430], [0.7617], [0.6680], [0.8320], [0.8789], [0.7539], [1.1094], [0.6680], [0.9219], [0.7188], [0.7461], [0.5430], [0.5781], [0.3574], [0.3809]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.2500], [0.6680], [0.5547], [0.6680], [0.8008], [0.4668], [1.0000], [0.2002], [0.8008], [0.5000], [0.7500], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01025390625 loss: 0.01019287109375 loss: 0.01513671875 loss: 0.0091552734375 9%|▉ | 45/492 [23:41<3:53:14, 31.31s/it] {'loss': 0.0553, 'learning_rate': 9.73067511230984e-06, 'epoch': 0.09} 9%|▉ | 45/492 [23:41<3:53:14, 31.31s/it]predicted value: tensor([[0.5977], [0.7617], [0.7344], [0.6953], [0.7266], [0.7188], [0.6016], [1.2734], [0.5430], [0.5625], [0.5273], [1.1484], [0.6445], [0.5664], [0.4199], [0.4062]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.2002], [0.5000], [0.6680], [0.3750], [0.2002], [1.0000], [0.4668], [0.5000], [0.4004], [1.0000], [0.6016], [0.0204], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0186767578125loss: 0.013671875 loss: 0.01544189453125 loss: 0.007568359375 predicted value: tensor([[0.7852], [0.8711], [0.8164], [0.5273], [1.0547], [0.5273], [0.4395], [0.6250], [0.5508], [1.1406], [1.1797], [0.5508], [0.4980], [0.5664], [0.4766], [0.4492]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [0.8008], [0.2500], [1.0000], [0.4668], [0.3340], [0.3750], [0.2002], [1.0000], [1.0000], [0.5000], [0.5000], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00909423828125 loss: 0.0093994140625 loss: 0.0133056640625 loss: 0.01287841796875 predicted value: tensor([[0.6328], [0.6914], [1.2344], [0.6523], [0.6211], [1.2734], [0.5742], [0.8164], [0.8359], [0.5625], [1.2266], [0.7969], [1.1250], [0.6719], [0.4082], [0.3770]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6016], [1.0000], [0.3145], [0.4668], [1.0000], [0.3340], [0.8008], [0.8008], [0.2500], [1.0000], [0.5000], [1.0000], [0.4004], [0.1670], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01287841796875 loss: 0.01251220703125loss: 0.007568359375 loss: 0.0107421875 predicted value: tensor([[0.7031], [0.7695], [0.6094], [0.8438], [0.7461], [0.7461], [0.6367], [0.7773], [1.2578], [0.7148], [0.8359], [0.4902], [0.7227], [0.5156], [0.5234], [0.3184]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.6680], [0.4668], [0.6680], [0.3750], [0.4668], [0.4668], [0.6016], [1.0000], [0.2500], [0.8008], [0.3340], [0.3340], [0.3340], [0.0400], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01104736328125 loss: 0.011474609375 loss: 0.0169677734375loss: 0.0118408203125 9%|▉ | 46/492 [24:12<3:52:39, 31.30s/it] {'loss': 0.0488, 'learning_rate': 9.786858081296164e-06, 'epoch': 0.09} 9%|▉ | 46/492 [24:12<3:52:39, 31.30s/it]predicted value: tensor([[0.3223], [0.3594], [0.5664], [0.3906], [0.4883], [0.4199], [0.5391], [0.4863], [0.3555], [0.4414], [0.3848], [0.3750], [0.3340], [0.2002], [0.2363], [0.2041]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.8008], [0.3340], [0.4668], [0.4668], [0.4668], [0.7500], [0.3340], [0.3340], [0.5000], [0.3340], [0.4004], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00927734375 loss: 0.003173828125loss: 0.00506591796875 loss: 0.00421142578125 predicted value: tensor([[0.4199], [0.9609], [0.4043], [0.4844], [0.4746], [0.8906], [0.2051], [0.4473], [0.3184], [0.4395], [0.3711], [0.2441], [0.4473], [0.2236], [0.1738], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3750], [0.8008], [0.8008], [1.0000], [0.2500], [0.6680], [0.2500], [0.7500], [0.3340], [0.4004], [0.6016], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002716064453125 loss: 0.0069580078125loss: 0.004150390625 loss: 0.00531005859375 predicted value: tensor([[0.5977], [0.5312], [0.3750], [0.4121], [0.8164], [0.4863], [0.9492], [0.5273], [0.5664], [0.9141], [0.5547], [0.3066], [0.3770], [0.3281], [0.6836], [0.2441]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.4668], [0.5547], [1.0000], [0.3145], [1.0000], [0.5000], [0.4668], [1.0000], [0.8008], [0.5000], [0.2500], [0.4004], [0.3750], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00579833984375 loss: 0.0059814453125loss: 0.00543212890625 loss: 0.006103515625 predicted value: tensor([[0.2812], [0.3809], [0.3145], [0.5195], [0.9375], [0.5430], [0.1738], [0.9688], [0.8867], [0.6406], [0.2871], [0.3516], [0.3789], [0.3672], [0.1592], [0.1523]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.3750], [0.3750], [0.5547], [1.0000], [0.3750], [0.2002], [1.0000], [1.0000], [0.5547], [0.4004], [0.4004], [0.6016], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005523681640625 loss: 0.004608154296875loss: 0.00482177734375 loss: 0.0021820068359375 10%|▉ | 47/492 [24:44<3:51:59, 31.28s/it] {'loss': 0.0203, 'learning_rate': 9.841832720226257e-06, 'epoch': 0.1} 10%|▉ | 47/492 [24:44<3:51:59, 31.28s/it]predicted value: tensor([[0.4043], [0.6602], [0.4863], [0.2715], [0.6328], [0.6211], [0.5469], [0.3828], [0.9492], [0.9609], [0.4414], [0.4531], [0.3711], [0.3691], [0.2070], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.5547], [0.3340], [0.8008], [0.8008], [0.6016], [0.7500], [1.0000], [1.0000], [0.7500], [0.4004], [0.1670], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00640869140625loss: 0.0028076171875 loss: 0.0030670166015625 loss: 0.00830078125 predicted value: tensor([[0.2539], [0.7461], [0.9062], [0.7891], [0.4805], [0.6289], [0.4707], [0.5039], [0.5156], [0.5391], [0.8984], [0.5781], [0.3789], [0.3379], [0.2207], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [1.0000], [0.4668], [0.6016], [0.4277], [0.7500], [0.6016], [0.7500], [1.0000], [0.6016], [0.2500], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0047607421875 loss: 0.004913330078125loss: 0.005859375 loss: 0.00286865234375 predicted value: tensor([[0.5703], [0.9219], [0.3730], [0.5391], [0.4258], [0.5391], [0.3945], [0.5781], [0.5664], [0.4336], [0.3457], [0.3398], [0.4902], [0.3633], [0.1396], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3750], [0.4668], [0.4668], [0.8008], [0.5000], [0.4668], [0.8008], [0.7500], [0.5000], [0.0625], [0.4004], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00579833984375 loss: 0.005828857421875loss: 0.0032501220703125 loss: 0.0029144287109375 predicted value: tensor([[0.5039], [0.6484], [0.2168], [0.9180], [0.6484], [0.8242], [0.6367], [0.4355], [0.9375], [0.5000], [0.4395], [0.3906], [0.4551], [0.2930], [0.1816], [0.0918]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.3340], [1.0000], [0.8008], [1.0000], [0.8320], [0.8008], [1.0000], [0.7500], [0.5000], [0.7500], [0.6016], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037384033203125 loss: 0.0025482177734375 loss: 0.009033203125loss: 0.004669189453125 10%|▉ | 48/492 [25:15<3:52:12, 31.38s/it] {'loss': 0.0192, 'learning_rate': 9.895649911916131e-06, 'epoch': 0.1} 10%|▉ | 48/492 [25:15<3:52:12, 31.38s/it]predicted value: tensor([[0.7461], [0.6992], [1.2578], [1.2812], [1.0703], [1.1562], [0.6992], [0.8242], [1.2422], [0.5781], [1.2109], [0.7344], [1.1172], [0.4961], [0.4160], [0.4961]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [1.0000], [1.0000], [1.0000], [1.0000], [0.6016], [0.8008], [1.0000], [0.3340], [1.0000], [0.2500], [1.0000], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006805419921875 loss: 0.020751953125 loss: 0.01324462890625 loss: 0.00872802734375 predicted value: tensor([[0.8516], [1.2188], [0.8555], [0.8203], [0.8398], [0.7539], [0.7617], [0.6680], [0.6836], [0.7891], [0.7852], [0.6758], [0.5938], [0.6523], [0.4727], [0.4570]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.8008], [0.6680], [0.8008], [0.6016], [0.4668], [0.5000], [0.3340], [0.6016], [0.3750], [0.6016], [0.5000], [0.6016], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.014404296875 loss: 0.01055908203125loss: 0.01226806640625 loss: 0.017578125 predicted value: tensor([[1.2734], [0.6016], [0.7734], [0.6562], [0.7109], [1.1484], [0.6836], [0.8203], [1.2031], [0.8594], [0.9336], [0.6172], [0.7578], [0.6055], [0.6719], [0.4551]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.6016], [0.4668], [0.7500], [1.0000], [0.2500], [0.6680], [1.0000], [0.8008], [0.8008], [0.5000], [0.6016], [0.3340], [0.6016], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0107421875loss: 0.01275634765625 loss: 0.00921630859375 loss: 0.0115966796875 predicted value: tensor([[0.6250], [0.8516], [0.5312], [0.9414], [0.3984], [0.7188], [0.7812], [0.5859], [0.7383], [0.7383], [0.6211], [0.7227], [0.5977], [0.6172], [0.5938], [0.4727]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.2500], [0.8320], [0.2002], [0.7500], [0.4668], [0.3340], [0.5000], [0.4668], [0.5000], [0.8008], [0.5000], [0.3340], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.016357421875 loss: 0.015380859375 loss: 0.01336669921875 loss: 0.0111083984375 10%|▉ | 49/492 [25:46<3:50:44, 31.25s/it] {'loss': 0.0512, 'learning_rate': 9.948357391330555e-06, 'epoch': 0.1} 10%|▉ | 49/492 [25:46<3:50:44, 31.25s/it]predicted value: tensor([[0.6367], [0.7383], [1.1875], [0.7539], [0.7656], [0.6562], [0.6602], [0.8594], [0.5938], [1.1562], [0.7109], [0.6562], [0.7422], [0.5039], [0.4395], [0.3730]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6016], [1.0000], [0.6016], [0.6016], [0.2500], [0.6016], [0.8008], [0.4004], [1.0000], [0.6016], [0.3340], [0.7500], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00921630859375loss: 0.0115966796875 loss: 0.01055908203125 loss: 0.0107421875 predicted value: tensor([[0.8359], [0.6719], [0.5703], [0.8555], [0.7148], [0.7227], [0.4414], [0.7383], [0.6914], [0.8516], [1.0547], [0.5078], [0.6406], [0.4062], [0.4316], [0.3887]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [0.6172], [0.6680], [0.8008], [0.0400], [0.4668], [0.5000], [0.8008], [1.0000], [0.3340], [0.4004], [0.0278], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0098876953125 loss: 0.009521484375 loss: 0.01220703125 loss: 0.00848388671875 predicted value: tensor([[0.6055], [0.4531], [0.8125], [1.1641], [0.7344], [0.6445], [0.7812], [0.6953], [0.7891], [0.7266], [1.1484], [0.6523], [0.6133], [0.4961], [0.4238], [0.3691]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2002], [0.3750], [1.0000], [0.6016], [0.3750], [0.7500], [0.3750], [0.8008], [0.7500], [1.0000], [0.2500], [0.3340], [0.2852], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0181884765625 loss: 0.0137939453125loss: 0.0184326171875 loss: 0.01055908203125 predicted value: tensor([[0.7852], [0.8516], [0.5312], [0.7539], [0.8359], [1.1641], [0.5625], [0.7461], [1.3125], [0.6445], [0.6914], [0.6484], [0.6484], [0.5039], [0.4453], [0.4609]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.7148], [0.2500], [0.4668], [0.5547], [1.0000], [0.2500], [0.8008], [1.0000], [0.6016], [0.7500], [0.4277], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01007080078125 loss: 0.01104736328125 loss: 0.01068115234375 loss: 0.0189208984375 10%|█ | 50/492 [26:17<3:50:30, 31.29s/it] {'loss': 0.0485, 'learning_rate': 1e-05, 'epoch': 0.1} 10%|█ | 50/492 [26:17<3:50:30, 31.29s/it]predicted value: tensor([[0.7227], [0.4609], [0.4492], [0.5000], [0.5859], [0.5625], [0.9531], [0.6133], [0.5078], [0.5234], [0.5508], [0.2734], [0.4043], [0.3516], [0.0864], [0.1445]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6680], [0.5547], [0.5547], [0.8008], [0.5703], [1.0000], [0.8008], [0.6016], [0.5000], [0.7500], [0.0400], [0.1670], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0078125 loss: 0.005401611328125loss: 0.002960205078125 loss: 0.0034027099609375 predicted value: tensor([[0.3887], [0.6836], [0.4922], [0.4492], [0.6719], [0.9922], [0.6172], [0.4492], [0.4062], [0.4336], [0.6406], [0.4668], [0.5820], [0.4766], [0.2520], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.4668], [0.3750], [0.8008], [1.0000], [0.8320], [0.6016], [0.3340], [0.5000], [0.6680], [0.4668], [0.5703], [0.7500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003692626953125 loss: 0.003997802734375 loss: 0.004852294921875 loss: 0.003448486328125 predicted value: tensor([[0.4727], [0.4590], [0.6016], [0.5547], [0.4336], [0.3457], [0.5000], [0.8633], [0.2773], [0.5273], [0.4141], [0.4121], [0.4414], [0.3750], [0.2227], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.5547], [0.6680], [0.5547], [0.7500], [0.5000], [1.0000], [0.0400], [0.5000], [0.5000], [0.3340], [0.6016], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0059814453125 loss: 0.00531005859375 loss: 0.00518798828125 loss: 0.003082275390625 predicted value: tensor([[0.4980], [0.5938], [0.3516], [0.8320], [0.2754], [1.0469], [0.5391], [0.9570], [0.4922], [0.5078], [0.6680], [0.4980], [0.5703], [0.3066], [0.1885], [0.2441]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8555], [0.3340], [1.0000], [0.2002], [1.0000], [0.6016], [1.0000], [0.3750], [0.6016], [0.8008], [0.4668], [0.7500], [0.2852], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00665283203125 loss: 0.002899169921875loss: 0.0037994384765625 loss: 0.004669189453125 10%|█ | 51/492 [26:49<3:49:24, 31.21s/it] {'loss': 0.0183, 'learning_rate': 1e-05, 'epoch': 0.1} 10%|█ | 51/492 [26:49<3:49:24, 31.21s/it]predicted value: tensor([[0.4512], [0.3906], [0.2715], [0.5898], [0.9102], [0.5898], [0.9648], [0.5938], [0.6992], [0.9922], [0.6055], [0.3711], [0.1533], [0.0918], [0.1631], [0.2148]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.4668], [1.0000], [0.6680], [1.0000], [0.6016], [0.7500], [1.0000], [0.3750], [0.4004], [0.2500], [0.1670], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00153350830078125 loss: 0.0023651123046875loss: 0.0052490234375 loss: 0.0030670166015625 predicted value: tensor([[0.3867], [1.0781], [0.5820], [0.5430], [0.3867], [0.4062], [0.5039], [0.4707], [0.9922], [0.5938], [0.5117], [0.3711], [0.3262], [0.1582], [0.2295], [0.1592]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8008], [0.4668], [0.4668], [0.3340], [0.3750], [0.7500], [1.0000], [0.8008], [0.5000], [0.4004], [0.3340], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003448486328125loss: 0.0037841796875 loss: 0.0031585693359375 loss: 0.00323486328125 predicted value: tensor([[0.4727], [0.6445], [0.6211], [0.4805], [0.4805], [0.6133], [0.9531], [0.3398], [0.7227], [0.5312], [0.4863], [0.4883], [0.3730], [0.3926], [0.1660], [0.1592]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.8008], [0.4668], [0.4668], [0.8008], [1.0000], [0.3340], [0.8008], [0.3340], [0.4668], [0.5000], [0.6016], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00555419921875 loss: 0.003631591796875 loss: 0.0029144287109375 loss: 0.003875732421875 predicted value: tensor([[0.5195], [0.6484], [0.5586], [0.4492], [0.5547], [0.4688], [0.8672], [0.9648], [0.5117], [0.6016], [0.3574], [0.3594], [0.2656], [0.3398], [0.0845], [0.0542]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.4668], [0.3340], [0.3750], [0.7500], [1.0000], [1.0000], [0.3340], [0.7500], [0.5000], [0.3340], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004364013671875 loss: 0.0029144287109375 loss: 0.003997802734375 loss: 0.00537109375 11%|█ | 52/492 [27:19<3:48:15, 31.13s/it] {'loss': 0.0146, 'learning_rate': 1e-05, 'epoch': 0.11} 11%|█ | 52/492 [27:19<3:48:15, 31.13s/it]predicted value: tensor([[1.2188], [0.5781], [1.1641], [0.6680], [0.4844], [0.7461], [0.5781], [0.7383], [0.7109], [0.6836], [0.5664], [0.6914], [0.6836], [0.4590], [0.5430], [0.3945]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [1.0000], [0.4668], [0.2002], [0.4668], [0.2002], [0.7500], [0.7500], [0.6016], [0.3340], [0.3340], [0.5000], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0167236328125 loss: 0.0113525390625loss: 0.00531005859375 loss: 0.01397705078125 predicted value: tensor([[0.9219], [1.1172], [0.6211], [0.8672], [0.9258], [0.6953], [0.6484], [0.5664], [0.8359], [0.5742], [0.6055], [0.7031], [0.6523], [0.5820], [0.4180], [0.4102]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [1.0000], [0.4668], [0.8008], [0.6680], [0.3750], [0.2002], [0.1670], [0.6016], [0.2002], [0.3340], [0.5000], [0.5000], [0.0400], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0186767578125 loss: 0.0211181640625loss: 0.0159912109375 loss: 0.01104736328125 predicted value: tensor([[1.2422], [1.2344], [1.2578], [0.9258], [0.8359], [0.9062], [0.8242], [0.8359], [0.6289], [0.8750], [0.6484], [0.5977], [0.7539], [0.5742], [0.3945], [0.3984]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [1.0000], [0.8008], [0.8008], [0.8008], [0.8320], [0.7500], [0.2002], [0.8008], [0.5000], [0.4004], [0.6016], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01513671875 loss: 0.00933837890625loss: 0.013916015625 loss: 0.013671875 predicted value: tensor([[1.3203], [0.6719], [1.1406], [0.7734], [1.1406], [0.5312], [0.4062], [0.8672], [1.2188], [1.1797], [0.8086], [0.5547], [0.6094], [0.6172], [0.6445], [0.4199]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [1.0000], [0.4668], [1.0000], [0.2500], [0.2002], [0.5547], [1.0000], [1.0000], [0.6016], [0.3340], [0.3340], [0.5000], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01519775390625 loss: 0.01397705078125 loss: 0.0167236328125 loss: 0.01507568359375 11%|█ | 53/492 [27:51<3:48:03, 31.17s/it] {'loss': 0.0568, 'learning_rate': 1e-05, 'epoch': 0.11} 11%|█ | 53/492 [27:51<3:48:03, 31.17s/it]predicted value: tensor([[1.2188], [0.6836], [0.5977], [1.1016], [0.8281], [1.1641], [0.6484], [0.6875], [0.7422], [0.7070], [0.7656], [0.5742], [0.5234], [0.5039], [0.3340], [0.4785]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.2500], [1.0000], [0.8008], [1.0000], [0.6016], [0.2500], [0.6016], [0.6016], [0.3750], [0.5000], [0.4004], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.010498046875 loss: 0.0125732421875 loss: 0.01171875 loss: 0.0140380859375 predicted value: tensor([[0.5977], [0.8984], [0.9023], [0.6367], [0.7891], [0.6289], [0.7539], [0.6250], [0.9570], [0.6836], [0.6680], [0.6367], [0.5156], [0.3535], [0.4004], [0.4062]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4648], [0.8320], [0.4668], [0.8008], [0.2002], [0.8008], [0.4668], [0.8320], [0.4277], [0.6016], [0.6016], [0.4004], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.010009765625 loss: 0.010986328125 loss: 0.0108642578125 loss: 0.01507568359375 predicted value: tensor([[0.8438], [0.7812], [0.8320], [0.5625], [1.2500], [0.8359], [0.7422], [0.4121], [1.2734], [0.6172], [0.7344], [1.1484], [0.5547], [0.3203], [0.3301], [0.3086]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.6016], [0.8008], [0.4668], [1.0000], [0.8008], [0.7500], [0.2002], [1.0000], [0.5000], [0.6016], [1.0000], [0.4004], [0.3340], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007720947265625loss: 0.0111083984375 loss: 0.015869140625 loss: 0.013671875 predicted value: tensor([[0.9258], [1.3672], [1.1484], [0.5469], [0.7305], [0.5586], [1.1562], [0.6719], [0.7578], [0.7461], [1.2500], [0.6250], [0.6094], [0.5547], [0.3789], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [1.0000], [1.0000], [0.4004], [0.3145], [0.3340], [1.0000], [0.6016], [0.7500], [0.4668], [1.0000], [0.5000], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01287841796875 loss: 0.007232666015625 loss: 0.01300048828125 loss: 0.01239013671875 11%|█ | 54/492 [28:22<3:47:17, 31.14s/it] {'loss': 0.0474, 'learning_rate': 1e-05, 'epoch': 0.11} 11%|█ | 54/492 [28:22<3:47:17, 31.14s/it]predicted value: tensor([[0.3301], [0.5625], [0.6797], [0.4336], [0.9883], [0.5859], [0.3223], [0.9062], [0.5352], [0.5234], [0.4609], [0.5117], [0.4375], [0.1172], [0.2812], [0.1465]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [0.6680], [0.2500], [1.0000], [0.6680], [0.5000], [1.0000], [0.4668], [0.4668], [0.6016], [0.4668], [0.1670], [0.2002], [0.0400], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026397705078125 loss: 0.004364013671875 loss: 0.00201416015625 loss: 0.004364013671875 predicted value: tensor([[0.4043], [0.6562], [0.3633], [0.6562], [0.3730], [0.5664], [0.2793], [0.4883], [0.5273], [0.4902], [0.9414], [0.5117], [0.2852], [0.3789], [0.2754], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.4648], [0.6680], [0.3750], [0.6680], [0.3340], [0.2500], [0.7500], [0.2500], [1.0000], [0.7500], [0.4004], [0.6016], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0052490234375 loss: 0.00299072265625 loss: 0.005279541015625 loss: 0.0026092529296875 predicted value: tensor([[0.7812], [0.4297], [0.4043], [0.4258], [0.5156], [0.4902], [0.8945], [0.6211], [0.3809], [0.4883], [1.0000], [0.4746], [0.2930], [0.1377], [0.1572], [0.1504]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.3750], [0.3340], [0.7500], [0.4668], [0.4668], [1.0000], [0.6016], [0.3750], [0.6016], [1.0000], [0.3340], [0.3340], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032196044921875 loss: 0.0027618408203125loss: 0.00518798828125 loss: 0.002349853515625 predicted value: tensor([[0.5703], [0.9883], [0.7031], [0.5703], [0.6094], [0.9766], [0.5547], [0.4492], [0.4824], [0.5391], [0.3887], [0.6484], [0.4023], [0.1582], [0.1982], [0.3516]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.8008], [0.4668], [0.4668], [1.0000], [0.7500], [0.4668], [0.6016], [0.7500], [0.5000], [0.8008], [0.2500], [0.2002], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004913330078125 loss: 0.00439453125 loss: 0.003021240234375 loss: 0.00396728515625 11%|█ | 55/492 [28:53<3:46:34, 31.11s/it] {'loss': 0.0148, 'learning_rate': 1e-05, 'epoch': 0.11} 11%|█ | 55/492 [28:53<3:46:34, 31.11s/it]predicted value: tensor([[0.4746], [0.6680], [1.0000], [0.6445], [0.7500], [0.5586], [0.6289], [0.3711], [0.6719], [0.4199], [0.3750], [0.3281], [0.3438], [0.1953], [0.1787], [0.1309]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [1.0000], [0.8008], [0.8320], [0.6680], [0.5703], [0.3340], [0.7500], [0.6016], [0.4004], [0.4004], [0.4004], [0.1670], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034637451171875 loss: 0.00201416015625 loss: 0.004913330078125 loss: 0.00154876708984375 predicted value: tensor([[0.5039], [0.5859], [0.3477], [0.9375], [0.6289], [0.5820], [0.5742], [0.4199], [0.5977], [0.6875], [0.3613], [0.4297], [0.2891], [0.4648], [0.1865], [0.2031]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.4668], [1.0000], [0.6680], [0.8320], [0.7500], [0.7500], [0.8008], [0.8008], [0.5000], [0.1670], [0.4004], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023345947265625 loss: 0.0062255859375loss: 0.0031280517578125 loss: 0.005279541015625 predicted value: tensor([[0.4922], [0.9531], [0.5859], [0.5938], [0.3633], [0.6055], [0.4473], [0.5234], [0.6172], [0.4941], [0.4648], [0.5625], [0.3691], [0.2910], [0.1631], [0.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3145], [0.3750], [0.3750], [0.8008], [0.4277], [0.3750], [0.8008], [0.6016], [0.2500], [0.5000], [0.5000], [0.3340], [0.2002], [0.0625]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00445556640625 loss: 0.00555419921875 loss: 0.00567626953125 loss: 0.0045166015625 predicted value: tensor([[0.5664], [1.0234], [0.4570], [1.0547], [0.5625], [0.2031], [0.3320], [0.8828], [0.5117], [0.9102], [0.3301], [0.3105], [0.2812], [0.2559], [0.3770], [0.3828]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [1.0000], [0.6680], [0.0400], [0.3340], [1.0000], [0.6016], [1.0000], [0.3340], [0.3340], [0.2500], [0.4004], [0.0625], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00506591796875 loss: 0.0035858154296875 loss: 0.003204345703125 loss: 0.00494384765625 11%|█▏ | 56/492 [29:24<3:45:42, 31.06s/it] {'loss': 0.0165, 'learning_rate': 1e-05, 'epoch': 0.11} 11%|█▏ | 56/492 [29:24<3:45:42, 31.06s/it]predicted value: tensor([[0.8984], [0.4199], [1.2500], [0.7656], [0.6562], [0.4473], [0.7773], [0.7109], [0.5586], [0.6992], [0.7070], [1.0156], [0.6523], [0.5000], [0.3809], [0.3633]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [1.0000], [0.5000], [0.3750], [0.2500], [0.3750], [0.5000], [0.4004], [0.5000], [0.7500], [1.0000], [0.6016], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01202392578125 loss: 0.0096435546875loss: 0.011962890625 loss: 0.0146484375 predicted value: tensor([[0.8008], [0.7383], [0.7227], [0.7578], [0.7812], [0.7891], [0.8203], [1.2500], [0.6836], [0.8203], [0.5820], [0.6172], [0.5430], [0.6367], [0.3633], [0.4082]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [0.4668], [0.5547], [0.3750], [0.8008], [1.0000], [0.4668], [0.4668], [0.5000], [0.5000], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00970458984375 loss: 0.0126953125loss: 0.01202392578125 loss: 0.004852294921875 predicted value: tensor([[0.7461], [1.1875], [0.8477], [1.2969], [0.6836], [0.8516], [0.7188], [0.6836], [0.7891], [0.6523], [1.2188], [0.6055], [0.5078], [0.6055], [0.3809], [0.3926]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8008], [1.0000], [0.3145], [0.8320], [0.3750], [0.7500], [0.4668], [0.4004], [1.0000], [0.3340], [0.4004], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0106201171875 loss: 0.01275634765625 loss: 0.006683349609375 loss: 0.006927490234375 predicted value: tensor([[0.9922], [0.9102], [0.6484], [0.7773], [1.2734], [0.7227], [1.0781], [1.0859], [0.7734], [0.8555], [0.6406], [0.6523], [0.6484], [0.4551], [0.4043], [0.4336]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8555], [0.3750], [0.4668], [1.0000], [0.7500], [1.0000], [1.0000], [0.6016], [0.6016], [0.2500], [0.7500], [0.2002], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0162353515625 loss: 0.0135498046875 loss: 0.004791259765625 loss: 0.010986328125 12%|█▏ | 57/492 [29:55<3:45:21, 31.08s/it] {'loss': 0.0425, 'learning_rate': 1e-05, 'epoch': 0.12} 12%|█▏ | 57/492 [29:55<3:45:21, 31.08s/it]predicted value: tensor([[0.8594], [0.8945], [1.1797], [0.8164], [1.1797], [0.9219], [0.6055], [0.5117], [0.6328], [0.6328], [0.6953], [0.5820], [0.5508], [0.4160], [0.3496], [0.3086]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [1.0000], [0.8008], [1.0000], [0.8320], [0.2500], [0.2500], [0.5000], [0.2500], [0.6016], [0.5000], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.010498046875 loss: 0.0098876953125 loss: 0.00732421875 loss: 0.00872802734375 predicted value: tensor([[0.5156], [1.0859], [0.8281], [1.2500], [1.2422], [1.1172], [1.1875], [0.5859], [0.6211], [1.2266], [1.2109], [0.5273], [0.5820], [0.3496], [0.3477], [0.4238]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.5547], [1.0000], [1.0000], [1.0000], [1.0000], [0.2500], [0.6016], [1.0000], [1.0000], [0.5000], [0.5000], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01416015625 loss: 0.00848388671875loss: 0.01165771484375 loss: 0.0098876953125 predicted value: tensor([[0.9570], [0.8125], [0.6367], [0.8086], [1.2422], [0.7344], [0.6562], [0.6953], [0.6484], [0.6445], [0.6914], [0.6406], [1.1172], [0.4199], [0.5430], [0.3594]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.4668], [0.8320], [1.0000], [0.4648], [0.3750], [0.6016], [0.7500], [0.5000], [0.4668], [0.5000], [1.0000], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007110595703125 loss: 0.0084228515625loss: 0.01153564453125 loss: 0.0108642578125 predicted value: tensor([[0.6875], [1.1953], [0.6719], [0.6641], [0.5898], [0.7969], [0.8984], [0.8359], [0.6680], [0.6953], [0.8750], [0.5547], [0.5195], [0.3398], [0.3496], [0.3340]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.4668], [0.3750], [0.4668], [0.5547], [0.3750], [0.4668], [0.6016], [0.8008], [0.3340], [0.4004], [0.1670], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.010986328125 loss: 0.009521484375 loss: 0.0133056640625 loss: 0.01397705078125 12%|█▏ | 58/492 [30:27<3:46:00, 31.25s/it] {'loss': 0.0416, 'learning_rate': 1e-05, 'epoch': 0.12} 12%|█▏ | 58/492 [30:27<3:46:00, 31.25s/it]predicted value: tensor([[0.3105], [0.6992], [0.9805], [0.9414], [0.9648], [0.2832], [0.8398], [0.4805], [0.5508], [0.5117], [0.3301], [0.3574], [0.2520], [0.3828], [0.1836], [0.1357]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [1.0000], [1.0000], [1.0000], [0.2002], [1.0000], [0.7500], [0.4668], [0.8008], [0.4004], [0.6016], [0.4004], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030517578125 loss: 0.00537109375loss: 0.004638671875 loss: 0.00189208984375 predicted value: tensor([[0.4082], [0.6055], [0.6562], [0.5859], [0.5859], [0.3945], [0.5977], [0.8555], [0.3340], [0.5508], [0.4395], [0.2832], [0.3691], [0.0674], [0.2578], [0.1973]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5547], [0.8008], [0.8008], [0.8008], [0.4668], [0.6016], [1.0000], [0.3340], [0.6016], [0.2500], [0.5000], [0.4004], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004547119140625 loss: 0.004180908203125 loss: 0.00274658203125 loss: 0.0030975341796875 predicted value: tensor([[0.3379], [0.4941], [0.6836], [0.4121], [0.5039], [0.4707], [1.0156], [0.4766], [0.6094], [0.5938], [0.4785], [0.3730], [0.3418], [0.1650], [0.2871], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.7500], [0.8008], [0.3750], [0.4668], [0.3750], [1.0000], [0.3750], [0.6680], [0.7500], [0.7500], [0.5000], [0.3340], [0.1670], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024566650390625 loss: 0.0036468505859375loss: 0.002655029296875 loss: 0.0050048828125 predicted value: tensor([[0.3711], [0.5469], [0.9375], [0.9570], [0.5391], [0.3301], [0.1118], [0.9023], [0.4785], [0.6133], [0.3887], [0.4902], [0.4551], [0.3867], [0.2178], [0.1729]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [1.0000], [1.0000], [0.6016], [0.2500], [0.0400], [1.0000], [0.3145], [0.8008], [0.3340], [0.2500], [0.5000], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032501220703125 loss: 0.002777099609375 loss: 0.0027008056640625 loss: 0.00183868408203125 12%|█▏ | 59/492 [30:58<3:45:02, 31.18s/it] {'loss': 0.0135, 'learning_rate': 1e-05, 'epoch': 0.12} 12%|█▏ | 59/492 [30:58<3:45:02, 31.18s/it]predicted value: tensor([[1.1328], [0.2129], [0.4805], [0.5469], [0.3281], [0.9609], [0.5938], [1.0234], [0.4629], [0.2754], [0.3125], [0.3672], [0.3223], [0.2402], [0.1914], [0.2266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [0.3750], [0.6680], [0.3145], [1.0000], [0.4668], [1.0000], [0.7500], [0.4004], [0.4004], [0.2500], [0.5000], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033721923828125 loss: 0.004119873046875 loss: 0.0019378662109375 loss: 0.002532958984375 predicted value: tensor([[1.0234], [0.6641], [0.3984], [0.2432], [0.6055], [0.4277], [0.3594], [0.5703], [0.5234], [0.9062], [0.3984], [0.2812], [0.2695], [0.4336], [0.2441], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.4668], [0.2500], [0.8008], [0.2500], [0.5000], [0.4668], [0.5000], [1.0000], [0.4004], [0.3340], [0.2002], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00274658203125 loss: 0.0023651123046875 loss: 0.002349853515625 loss: 0.003814697265625 predicted value: tensor([[1.0781], [0.2812], [0.4902], [0.6602], [0.3398], [0.6250], [0.9727], [0.9648], [0.5195], [0.2559], [0.2949], [0.3457], [0.3340], [0.3730], [0.1758], [0.1816]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.4668], [0.4668], [0.6016], [0.8008], [1.0000], [1.0000], [0.7500], [0.2500], [0.5000], [0.3340], [0.5000], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003143310546875 loss: 0.00238037109375 loss: 0.004364013671875 loss: 0.00323486328125 predicted value: tensor([[0.2432], [0.2949], [0.9453], [0.9609], [0.5938], [0.4277], [1.0000], [0.8789], [0.3203], [0.3242], [0.4746], [0.2441], [0.5039], [0.2891], [0.2158], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.3750], [1.0000], [1.0000], [0.8008], [0.2500], [1.0000], [1.0000], [0.3340], [0.2500], [0.6016], [0.2002], [0.4668], [0.4004], [0.1426], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004241943359375 loss: 0.002288818359375loss: 0.0033721923828125 loss: 0.004302978515625 12%|█▏ | 60/492 [31:29<3:44:28, 31.18s/it] {'loss': 0.0126, 'learning_rate': 1e-05, 'epoch': 0.12} 12%|█▏ | 60/492 [31:29<3:44:28, 31.18s/it]predicted value: tensor([[0.8789], [0.6250], [0.7109], [0.6836], [0.5820], [0.6680], [0.8398], [0.7617], [0.6523], [0.7344], [0.7500], [0.4531], [0.6914], [0.6172], [0.5312], [0.3418]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.5547], [0.4668], [0.3340], [0.4668], [0.7500], [0.7500], [0.7500], [0.7500], [0.3750], [0.2500], [0.3340], [0.7500], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01226806640625 loss: 0.0089111328125 loss: 0.0128173828125 loss: 0.0118408203125 predicted value: tensor([[0.8711], [0.7148], [1.1641], [1.2344], [0.5938], [0.6133], [0.4492], [0.7969], [0.8828], [0.6758], [0.8516], [0.6562], [0.2832], [0.6055], [0.3848], [0.3535]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [1.0000], [1.0000], [0.4668], [0.6016], [0.3340], [0.7500], [0.6016], [0.7500], [0.6680], [0.5000], [0.0400], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01177978515625 loss: 0.009033203125loss: 0.00909423828125 loss: 0.01348876953125 predicted value: tensor([[0.7148], [1.1719], [1.2578], [0.7383], [0.6484], [0.9961], [0.8203], [0.6758], [1.1797], [0.8047], [0.5820], [0.6797], [0.7148], [0.7930], [0.4805], [0.3594]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [1.0000], [1.0000], [0.4668], [0.4668], [0.8008], [0.8008], [0.5000], [1.0000], [0.3750], [0.4004], [0.6016], [0.2002], [0.8008], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00897216796875 loss: 0.01263427734375 loss: 0.0140380859375 loss: 0.01025390625 predicted value: tensor([[0.8047], [0.8750], [0.4316], [0.8047], [0.8359], [1.2031], [1.1016], [0.6602], [0.7656], [0.3945], [0.4941], [0.5078], [0.5820], [0.3125], [0.4258], [0.3340]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.4668], [0.5547], [0.4668], [1.0000], [1.0000], [0.7500], [0.6016], [0.2500], [0.2500], [0.3340], [0.4004], [0.0625], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01092529296875 loss: 0.01080322265625loss: 0.00750732421875 loss: 0.007781982421875 12%|█▏ | 61/492 [32:00<3:43:50, 31.16s/it] {'loss': 0.043, 'learning_rate': 1e-05, 'epoch': 0.12} 12%|█▏ | 61/492 [32:00<3:43:50, 31.16s/it]predicted value: tensor([[0.7383], [0.9844], [0.4512], [0.4316], [0.5000], [0.8320], [1.1719], [0.5742], [0.6797], [0.7461], [1.0703], [0.5352], [0.7070], [0.5625], [0.3750], [0.3418]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.2500], [0.2500], [0.3750], [0.7500], [1.0000], [0.3340], [0.5000], [0.6016], [1.0000], [0.7500], [0.6016], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0059814453125 loss: 0.009765625 loss: 0.009033203125 loss: 0.006500244140625 predicted value: tensor([[0.4941], [1.1094], [0.8711], [0.5820], [0.7539], [0.4004], [0.8047], [1.1328], [0.7070], [1.2344], [0.5781], [0.6133], [0.5586], [0.5781], [0.5000], [0.4199]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.8320], [0.2500], [0.3750], [0.2500], [0.6016], [1.0000], [0.6016], [1.0000], [0.5000], [0.3340], [0.5000], [0.4004], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01055908203125 loss: 0.0107421875loss: 0.0068359375 loss: 0.005340576171875 predicted value: tensor([[1.1562], [1.1094], [0.8125], [0.6914], [0.5195], [0.4004], [0.5039], [1.1953], [0.5273], [0.6875], [0.6875], [0.5508], [0.8281], [0.5742], [0.5156], [0.3828]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.8008], [0.6016], [0.3340], [0.2002], [0.2002], [1.0000], [0.4277], [0.3340], [0.4668], [0.3340], [0.6016], [0.4004], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0093994140625 loss: 0.00982666015625loss: 0.009521484375 loss: 0.0078125 predicted value: tensor([[0.6719], [0.6211], [0.9844], [1.0703], [0.7148], [0.9102], [0.8633], [1.0469], [0.5469], [0.4570], [0.4316], [0.5312], [0.5898], [0.5508], [0.4902], [0.3711]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [0.4668], [0.8320], [1.0000], [0.4668], [0.8008], [0.8008], [1.0000], [0.6016], [0.3340], [0.0400], [0.4004], [0.4004], [0.1670], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0050048828125 loss: 0.004547119140625 loss: 0.00848388671875 loss: 0.0084228515625 13%|█▎ | 62/492 [32:31<3:43:00, 31.12s/it] {'loss': 0.0319, 'learning_rate': 1e-05, 'epoch': 0.13} 13%|█▎ | 62/492 [32:31<3:43:00, 31.12s/it]predicted value: tensor([[0.4688], [0.9258], [0.9570], [0.5156], [0.3477], [0.9297], [0.9141], [0.6172], [0.5039], [0.5117], [0.3477], [0.4062], [0.4453], [0.2676], [0.2080], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8555], [1.0000], [0.5547], [0.2500], [1.0000], [1.0000], [0.8008], [0.6016], [0.7500], [0.4004], [0.6016], [0.3340], [0.4004], [0.0400], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018157958984375 loss: 0.0037689208984375 loss: 0.00299072265625 loss: 0.00262451171875 predicted value: tensor([[0.4414], [0.4141], [0.3711], [0.9180], [0.5000], [0.9219], [0.4238], [0.3418], [0.3672], [0.2949], [0.8945], [0.3789], [0.3984], [0.4668], [0.2578], [0.2236]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [1.0000], [0.3750], [1.0000], [0.2002], [0.2500], [0.5000], [0.2500], [1.0000], [0.7500], [0.4004], [0.7500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003631591796875 loss: 0.005767822265625 loss: 0.006378173828125 loss: 0.006683349609375 predicted value: tensor([[0.7539], [0.4668], [0.3984], [0.6328], [0.7266], [0.5430], [0.9375], [0.5977], [0.3926], [0.5938], [0.9180], [0.3730], [0.3828], [0.4941], [0.1768], [0.1611]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6016], [0.4668], [0.5547], [0.8008], [0.6172], [1.0000], [0.6680], [0.4668], [0.6680], [1.0000], [0.6016], [0.5000], [0.7500], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004730224609375 loss: 0.0031890869140625loss: 0.004547119140625 loss: 0.00439453125 predicted value: tensor([[1.0234], [0.4941], [0.7305], [0.3809], [0.3711], [0.6484], [0.4355], [0.2451], [0.3848], [0.4570], [0.4590], [0.7930], [0.2910], [0.2734], [0.4199], [0.1484]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8320], [0.4668], [0.4668], [0.8008], [0.5547], [0.2500], [0.4668], [0.2002], [0.6016], [1.0000], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00171661376953125 loss: 0.0042724609375 loss: 0.0034027099609375 loss: 0.0037384033203125 13%|█▎ | 63/492 [33:02<3:43:15, 31.23s/it] {'loss': 0.0159, 'learning_rate': 1e-05, 'epoch': 0.13} 13%|█▎ | 63/492 [33:02<3:43:15, 31.23s/it]predicted value: tensor([[0.4531], [0.1582], [0.4590], [0.9141], [0.8984], [0.3105], [0.5547], [0.9648], [0.4238], [0.9570], [0.3125], [0.3398], [0.4023], [0.1689], [0.1797], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.4668], [1.0000], [1.0000], [0.4004], [0.7500], [1.0000], [0.6016], [1.0000], [0.3340], [0.2852], [0.3340], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024566650390625 loss: 0.001983642578125loss: 0.003173828125 loss: 0.004180908203125 predicted value: tensor([[0.3867], [0.5195], [0.9258], [0.7188], [0.5430], [0.9141], [0.3535], [0.2393], [0.3945], [0.4785], [0.5312], [0.4082], [0.3887], [0.1875], [0.2793], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [1.0000], [0.8320], [0.5547], [1.0000], [0.2002], [0.2500], [0.6016], [0.5000], [0.6016], [0.2500], [0.4004], [0.2002], [0.2500], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020599365234375 loss: 0.004547119140625 loss: 0.00262451171875 loss: 0.0031280517578125 predicted value: tensor([[0.3594], [0.2930], [0.9219], [0.9531], [0.7383], [0.9297], [0.3125], [0.4707], [0.3516], [0.8555], [0.4980], [0.4258], [0.4062], [0.3613], [0.2324], [0.1855]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2715], [0.3340], [1.0000], [1.0000], [0.6680], [1.0000], [0.4668], [0.6016], [0.3340], [1.0000], [0.7500], [0.3340], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002655029296875 loss: 0.0025634765625loss: 0.003265380859375 loss: 0.004180908203125 predicted value: tensor([[0.4883], [1.0312], [0.1982], [0.7891], [0.9102], [0.4434], [0.5000], [0.9141], [0.8438], [0.4336], [0.6680], [0.4434], [0.3340], [0.3340], [0.2070], [0.2217]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.2500], [0.8008], [1.0000], [0.7500], [0.7500], [1.0000], [1.0000], [0.4668], [0.8008], [0.6016], [0.4004], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003997802734375loss: 0.00482177734375 loss: 0.003631591796875 loss: 0.0026702880859375 13%|█▎ | 64/492 [33:34<3:42:44, 31.23s/it] {'loss': 0.013, 'learning_rate': 1e-05, 'epoch': 0.13} 13%|█▎ | 64/492 [33:34<3:42:44, 31.23s/it]predicted value: tensor([[0.6953], [0.6172], [0.5742], [0.7070], [0.8750], [0.6328], [0.8555], [0.7227], [1.1953], [1.2734], [0.5664], [1.2188], [0.5000], [0.5352], [0.4160], [0.4414]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.3750], [0.3340], [0.4668], [0.7500], [0.8008], [0.6016], [1.0000], [1.0000], [0.4004], [1.0000], [0.5000], [0.2852], [0.1426], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00567626953125 loss: 0.00750732421875 loss: 0.01165771484375 loss: 0.01068115234375 predicted value: tensor([[1.0391], [0.5781], [0.9805], [0.8242], [1.1406], [1.2656], [1.0391], [0.4531], [1.0859], [0.7539], [0.8477], [0.7578], [0.6445], [0.3809], [0.3027], [0.4160]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.8320], [0.8008], [1.0000], [1.0000], [1.0000], [0.2500], [1.0000], [0.8008], [0.6016], [0.6016], [0.6016], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.008544921875 loss: 0.0118408203125 loss: 0.005523681640625 loss: 0.007568359375 predicted value: tensor([[1.2109], [1.1484], [0.8359], [0.3906], [0.9414], [0.6445], [0.4199], [1.2422], [0.9297], [1.2500], [0.4414], [0.4531], [1.1328], [0.5977], [0.3809], [0.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.8008], [0.3340], [0.8008], [0.7500], [0.2500], [1.0000], [0.8320], [1.0000], [0.2500], [0.4004], [1.0000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007171630859375 loss: 0.0064697265625loss: 0.00927734375 loss: 0.009033203125 predicted value: tensor([[0.4844], [0.6484], [0.4824], [0.3789], [0.6914], [0.6094], [1.0625], [0.6250], [0.7070], [0.6133], [0.6680], [1.0938], [0.5586], [0.5273], [0.3320], [0.3398]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.2500], [0.2002], [0.4668], [0.3750], [1.0000], [0.4668], [0.3145], [0.3340], [0.8008], [1.0000], [0.4004], [0.3340], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.009765625 loss: 0.009521484375 loss: 0.008056640625 loss: 0.0052490234375 13%|█▎ | 65/492 [34:05<3:42:05, 31.21s/it] {'loss': 0.0334, 'learning_rate': 1e-05, 'epoch': 0.13} 13%|█▎ | 65/492 [34:05<3:42:05, 31.21s/it]predicted value: tensor([[0.5938], [0.9766], [0.8633], [0.5430], [0.4707], [0.5039], [1.0859], [0.4785], [0.5469], [0.7617], [0.6602], [0.4902], [0.4688], [0.3359], [0.3633], [0.3594]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.3750], [0.4668], [0.2500], [0.3340], [1.0000], [0.2500], [0.2002], [0.7500], [0.3750], [0.3340], [0.3340], [0.1670], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.009033203125 loss: 0.01177978515625loss: 0.011962890625 loss: 0.0079345703125 predicted value: tensor([[0.7266], [1.1875], [1.1875], [0.5352], [0.4609], [0.4629], [0.5547], [0.6367], [0.7305], [0.7148], [0.4668], [0.4297], [0.4785], [0.4492], [0.3828], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.4668], [0.3340], [0.2500], [0.3750], [0.6016], [0.4668], [0.5000], [0.0400], [0.0400], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.011474609375 loss: 0.0108642578125loss: 0.01092529296875 loss: 0.007080078125 predicted value: tensor([[0.6562], [0.6094], [0.3711], [1.2109], [1.1484], [0.9023], [1.2109], [0.6133], [0.6680], [1.1875], [0.7148], [1.1562], [0.5469], [0.3145], [0.3281], [0.3691]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.3340], [1.0000], [1.0000], [0.8008], [1.0000], [0.3340], [0.4668], [1.0000], [0.6016], [1.0000], [0.4004], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005401611328125 loss: 0.00689697265625 loss: 0.00970458984375 loss: 0.00689697265625 predicted value: tensor([[1.2109], [1.1719], [0.6406], [0.6797], [0.6602], [0.7305], [1.1250], [0.5352], [0.5156], [0.5273], [0.6328], [0.4609], [0.4414], [0.4316], [0.3223], [0.4004]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.7500], [0.4668], [0.6016], [1.0000], [0.3340], [0.2002], [0.2500], [0.5000], [0.4004], [0.2002], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0076904296875 loss: 0.0081787109375loss: 0.005767822265625 loss: 0.01092529296875 13%|█▎ | 66/492 [34:36<3:40:55, 31.12s/it] {'loss': 0.0356, 'learning_rate': 1e-05, 'epoch': 0.13} 13%|█▎ | 66/492 [34:36<3:40:55, 31.12s/it]predicted value: tensor([[0.8320], [0.5625], [0.1553], [0.8711], [0.4082], [0.4805], [0.6445], [0.8516], [0.3203], [0.5781], [0.3926], [0.9023], [0.3359], [0.3633], [0.3105], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.8320], [0.3340], [1.0000], [0.3750], [0.8008], [0.6680], [1.0000], [0.5000], [0.6016], [0.2500], [1.0000], [0.5000], [0.4004], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034637451171875 loss: 0.0059814453125loss: 0.0037384033203125 loss: 0.004730224609375 predicted value: tensor([[0.4727], [0.3086], [0.3242], [0.6562], [0.6094], [0.5859], [0.9531], [0.4863], [0.3398], [0.3691], [0.5039], [0.6445], [0.3887], [0.5781], [0.1768], [0.1475]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.6680], [0.8008], [0.6016], [1.0000], [0.6016], [0.2500], [0.4004], [0.4668], [0.8008], [0.3340], [0.6016], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00238037109375 loss: 0.0023956298828125 loss: 0.0030364990234375 loss: 0.0032196044921875 predicted value: tensor([[0.2363], [0.3125], [0.4629], [0.7617], [0.4512], [0.5156], [0.3770], [0.5156], [0.4844], [0.9375], [0.2715], [0.3203], [0.2334], [0.2617], [0.1777], [0.2100]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.2500], [0.4668], [0.8008], [0.5547], [0.5000], [0.2002], [0.8008], [0.5000], [1.0000], [0.2002], [0.4004], [0.2500], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030517578125loss: 0.00250244140625 loss: 0.004241943359375 loss: 0.0050048828125 predicted value: tensor([[0.7422], [0.2734], [0.4844], [0.7812], [0.6836], [0.4707], [0.4473], [0.5742], [0.5859], [0.6758], [0.2412], [0.1748], [0.2207], [0.5234], [0.1924], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.4668], [0.8320], [0.6680], [0.7500], [0.6016], [0.6016], [0.8008], [0.8320], [0.2002], [0.2500], [0.5000], [0.7500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004913330078125 loss: 0.0052490234375 loss: 0.005706787109375 loss: 0.005279541015625 14%|█▎ | 67/492 [35:07<3:40:09, 31.08s/it] {'loss': 0.0162, 'learning_rate': 1e-05, 'epoch': 0.14} 14%|█▎ | 67/492 [35:07<3:40:09, 31.08s/it]predicted value: tensor([[0.3281], [0.6523], [0.3945], [0.7578], [0.3203], [0.6602], [0.3594], [0.6250], [0.6133], [0.4473], [0.5156], [0.4219], [0.6367], [0.1934], [0.1865], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.4668], [0.8008], [0.2500], [0.8320], [0.3340], [0.4668], [0.4668], [0.6016], [0.6680], [0.4004], [0.8008], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00136566162109375 loss: 0.001373291015625 loss: 0.003082275390625 loss: 0.0023956298828125 predicted value: tensor([[0.2422], [0.6523], [0.6445], [0.7148], [0.5156], [0.6211], [0.7148], [0.3086], [0.6484], [0.4102], [0.5898], [0.3906], [0.1895], [0.1338], [0.1953], [0.1445]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3145], [0.4668], [0.7148], [0.6680], [0.4668], [0.8008], [0.8008], [0.4668], [0.8008], [0.6016], [0.6016], [0.3340], [0.2002], [0.2500], [0.1250], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001953125 loss: 0.0030670166015625loss: 0.00225830078125 loss: 0.0040283203125 predicted value: tensor([[0.4785], [0.7266], [0.7227], [0.9141], [0.1738], [0.8867], [0.4844], [0.5664], [0.9453], [0.2930], [0.3789], [0.3105], [0.1816], [0.2451], [0.1924], [0.1611]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.5547], [1.0000], [0.3340], [1.0000], [0.7500], [0.5000], [1.0000], [0.2500], [0.5000], [0.4004], [0.0278], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00164031982421875 loss: 0.0035858154296875loss: 0.0012969970703125 loss: 0.0033416748046875 predicted value: tensor([[0.4453], [0.4531], [0.5000], [0.5625], [0.9180], [0.4004], [0.9375], [0.9648], [0.6172], [0.5703], [0.1787], [0.4336], [0.5664], [0.1250], [0.3398], [0.1357]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.6680], [0.6016], [1.0000], [0.5000], [1.0000], [1.0000], [0.7500], [0.6016], [0.4004], [0.4004], [0.6016], [0.1670], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001373291015625 loss: 0.0029754638671875 loss: 0.002685546875 loss: 0.0028076171875 14%|█▍ | 68/492 [35:38<3:39:19, 31.04s/it] {'loss': 0.0098, 'learning_rate': 1e-05, 'epoch': 0.14} 14%|█▍ | 68/492 [35:38<3:39:19, 31.04s/it]predicted value: tensor([[0.6367], [1.1406], [1.0938], [0.7227], [1.1641], [1.1719], [0.6133], [0.6680], [0.6484], [0.6328], [0.5508], [0.6289], [0.4492], [0.3477], [0.3711], [0.4043]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [1.0000], [1.0000], [0.4668], [1.0000], [1.0000], [0.5000], [0.6016], [0.7500], [0.4004], [0.5000], [0.4004], [0.3340], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005218505859375 loss: 0.006072998046875 loss: 0.007720947265625 loss: 0.008544921875 predicted value: tensor([[0.6133], [0.7656], [0.4766], [0.8164], [0.6680], [0.3848], [0.7422], [0.6797], [0.7109], [0.8945], [1.0938], [0.5195], [1.1562], [0.3711], [0.3105], [0.3477]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.7148], [0.3340], [0.8008], [0.3750], [0.3340], [0.7500], [0.6016], [0.6016], [0.7500], [1.0000], [0.4004], [1.0000], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007568359375 loss: 0.0089111328125 loss: 0.0042724609375 loss: 0.00439453125 predicted value: tensor([[0.6797], [0.7500], [0.9609], [0.7617], [1.1094], [0.8164], [0.6016], [1.1719], [0.6953], [0.6875], [0.6211], [0.7422], [0.4609], [0.2676], [0.3496], [0.3086]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.6680], [0.8008], [0.3750], [1.0000], [0.5547], [0.3340], [1.0000], [0.6016], [0.5000], [0.3340], [0.8008], [0.2500], [0.2500], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00885009765625 loss: 0.00994873046875 loss: 0.00787353515625 loss: 0.0087890625 predicted value: tensor([[0.3477], [1.0938], [0.4531], [1.0859], [0.7383], [0.6836], [0.5430], [0.7461], [0.5078], [0.8281], [0.6133], [0.4902], [0.2949], [0.5664], [0.3906], [0.3516]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [1.0000], [0.4668], [1.0000], [0.6016], [0.6016], [0.6016], [0.6016], [0.1670], [0.6016], [0.4004], [0.5000], [0.2500], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00970458984375 loss: 0.005126953125 loss: 0.005645751953125 loss: 0.00665283203125 14%|█▍ | 69/492 [36:09<3:39:10, 31.09s/it] {'loss': 0.0288, 'learning_rate': 1e-05, 'epoch': 0.14} 14%|█▍ | 69/492 [36:09<3:39:10, 31.09s/it]predicted value: tensor([[0.6562], [0.4375], [0.9023], [1.0625], [0.6836], [0.7383], [0.6602], [0.6992], [0.9766], [0.7461], [0.7383], [0.4980], [1.1641], [0.5625], [0.3457], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.3340], [0.7148], [1.0000], [0.3340], [0.6680], [0.5547], [0.8008], [1.0000], [0.5000], [0.7500], [0.2500], [1.0000], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007171630859375 loss: 0.0079345703125 loss: 0.006317138671875 loss: 0.01348876953125 predicted value: tensor([[1.1250], [0.8984], [0.8086], [0.7969], [0.8320], [0.4883], [0.8359], [0.6523], [0.8438], [1.1016], [0.6562], [0.4414], [0.4727], [0.5898], [0.5156], [0.3652]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.4668], [0.8008], [0.8008], [0.3750], [0.6016], [0.4668], [0.7500], [1.0000], [0.6016], [0.4004], [0.3340], [0.4004], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00531005859375 loss: 0.00537109375 loss: 0.010986328125 loss: 0.005584716796875 predicted value: tensor([[1.1406], [0.5547], [0.6562], [0.6797], [0.6172], [0.8203], [1.1641], [0.8086], [1.1250], [0.4961], [0.6172], [0.6172], [0.5586], [0.5078], [0.5352], [0.3730]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.3750], [0.4668], [0.4668], [0.8008], [1.0000], [0.5547], [1.0000], [0.3340], [0.4004], [0.6016], [0.5000], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006134033203125 loss: 0.0081787109375 loss: 0.006683349609375 loss: 0.01226806640625 predicted value: tensor([[0.7422], [0.8945], [0.8867], [0.9258], [0.8398], [0.9609], [0.6172], [0.7266], [0.7227], [1.1016], [0.6562], [0.6523], [0.6719], [0.4941], [0.5508], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5703], [0.8320], [0.7500], [0.5547], [0.6680], [0.8320], [0.3750], [0.7500], [0.8008], [1.0000], [0.2500], [0.5000], [0.6016], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0087890625 loss: 0.0054931640625 loss: 0.00836181640625 loss: 0.01165771484375 14%|█▍ | 70/492 [36:40<3:39:05, 31.15s/it] {'loss': 0.0324, 'learning_rate': 1e-05, 'epoch': 0.14} 14%|█▍ | 70/492 [36:40<3:39:05, 31.15s/it]predicted value: tensor([[0.5703], [0.2139], [0.3301], [0.9805], [0.9609], [0.8789], [0.9453], [0.6445], [0.1318], [0.4727], [0.4062], [0.8828], [0.8633], [0.1875], [0.3945], [0.1514]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.2500], [0.3750], [1.0000], [1.0000], [1.0000], [1.0000], [0.8320], [0.2500], [0.6016], [0.4004], [1.0000], [1.0000], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003326416015625 loss: 0.0028228759765625 loss: 0.0020751953125 loss: 0.001861572265625 predicted value: tensor([[0.6562], [0.3477], [0.4297], [0.3945], [0.5859], [0.6602], [0.5156], [0.2168], [0.9727], [0.4102], [0.4922], [0.4297], [0.3008], [0.1709], [0.1167], [0.0703]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [0.4668], [0.4668], [0.8008], [0.8008], [0.5000], [0.2500], [1.0000], [0.2500], [0.7500], [0.6016], [0.5000], [0.0400], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030364990234375 loss: 0.0048828125loss: 0.0029449462890625 loss: 0.002349853515625 predicted value: tensor([[0.9492], [0.8555], [0.7969], [0.6133], [0.5938], [1.0000], [0.6367], [0.2188], [0.5430], [0.6172], [0.5000], [0.4473], [0.3711], [0.4219], [0.1387], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.8320], [0.5547], [0.7500], [1.0000], [0.8008], [0.2500], [0.4668], [0.4668], [0.5547], [0.3340], [0.4004], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0031280517578125 loss: 0.00180816650390625loss: 0.0034942626953125 loss: 0.0038299560546875 predicted value: tensor([[0.4141], [0.3047], [0.3945], [0.3301], [0.3262], [0.9023], [1.0312], [0.6484], [0.4609], [0.9844], [0.2451], [0.6328], [0.4004], [0.4160], [0.1001], [0.1631]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.4668], [0.3750], [0.4668], [1.0000], [1.0000], [0.8008], [0.3750], [1.0000], [0.3340], [0.6016], [0.4668], [0.7500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00408935546875 loss: 0.0036773681640625 loss: 0.0032806396484375loss: 0.0030059814453125 14%|█▍ | 71/492 [37:11<3:39:08, 31.23s/it] {'loss': 0.0124, 'learning_rate': 1e-05, 'epoch': 0.14} 14%|█▍ | 71/492 [37:11<3:39:08, 31.23s/it]predicted value: tensor([[0.6016], [0.2930], [0.4941], [0.5625], [0.2754], [0.1699], [0.5742], [0.2500], [0.5156], [0.2988], [0.4902], [0.2637], [0.3613], [0.4141], [0.1768], [0.1631]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.5547], [0.3340], [0.2500], [0.5000], [0.2500], [0.6016], [0.3750], [0.6016], [0.4004], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000820159912109375 loss: 0.0016021728515625 loss: 0.0020904541015625 loss: 0.003173828125 predicted value: tensor([[0.4141], [0.7695], [0.5195], [0.9258], [0.4941], [0.3086], [0.4062], [0.4023], [0.3887], [0.2441], [0.2314], [0.9375], [0.4199], [0.4238], [0.1221], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.5547], [1.0000], [0.6016], [0.3340], [0.4668], [0.2500], [0.5000], [0.2500], [0.2500], [1.0000], [0.6016], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0031585693359375 loss: 0.0016021728515625loss: 0.00238037109375 loss: 0.004852294921875 predicted value: tensor([[0.6289], [0.2500], [0.9258], [0.6328], [0.4023], [0.6172], [0.6406], [0.3887], [0.9297], [0.4375], [0.4961], [0.3008], [0.2734], [0.3848], [0.1562], [0.1201]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [1.0000], [0.8320], [0.3340], [0.8008], [0.7500], [0.2500], [1.0000], [0.5000], [0.5000], [0.4004], [0.3340], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002777099609375 loss: 0.002685546875loss: 0.0027618408203125 loss: 0.00518798828125 predicted value: tensor([[0.6914], [0.4883], [0.9609], [0.3828], [0.4707], [0.9609], [0.5117], [0.9609], [0.3574], [0.5000], [0.2891], [0.5664], [0.1748], [0.1201], [0.1484], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.5547], [1.0000], [0.3750], [0.4668], [1.0000], [0.8008], [1.0000], [0.7500], [0.6016], [0.3340], [0.6016], [0.2002], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002349853515625 loss: 0.004058837890625 loss: 0.004150390625 loss: 0.00469970703125 15%|█▍ | 72/492 [37:43<3:38:28, 31.21s/it] {'loss': 0.0121, 'learning_rate': 1e-05, 'epoch': 0.15} 15%|█▍ | 72/492 [37:43<3:38:28, 31.21s/it]predicted value: tensor([[0.6875], [0.5078], [0.3516], [1.1719], [0.6641], [0.7852], [0.7461], [0.6367], [0.6719], [0.6602], [0.5352], [0.6289], [0.4199], [0.2344], [0.2852], [0.3301]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.2500], [0.3340], [1.0000], [0.3750], [0.6016], [0.7500], [0.3750], [0.4668], [0.6016], [0.5000], [0.3340], [0.4004], [0.2002], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004638671875 loss: 0.007171630859375 loss: 0.0084228515625 loss: 0.003936767578125 predicted value: tensor([[0.7930], [1.1562], [0.7578], [0.8867], [1.1406], [0.7383], [1.2031], [0.5742], [0.6602], [0.6367], [0.7422], [0.6836], [0.5977], [0.4922], [0.3945], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.5547], [0.8008], [1.0000], [0.4668], [1.0000], [0.4668], [0.6016], [0.6016], [0.6680], [0.6016], [0.3340], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007415771484375 loss: 0.0054931640625loss: 0.00933837890625 loss: 0.004638671875 predicted value: tensor([[0.5859], [0.7305], [1.0938], [0.7500], [0.7578], [0.5586], [0.3477], [0.7109], [1.1875], [0.6172], [0.9805], [0.6719], [0.4395], [0.2812], [0.3379], [0.3281]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [1.0000], [0.3750], [0.6680], [0.2500], [0.2002], [0.5547], [1.0000], [0.5000], [0.8008], [0.3340], [0.2002], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00933837890625 loss: 0.00970458984375loss: 0.00848388671875 loss: 0.007080078125 predicted value: tensor([[0.3652], [0.6172], [0.9219], [0.6836], [0.7539], [1.1719], [1.2500], [0.6836], [0.8164], [0.8164], [0.5820], [0.5000], [0.6797], [0.7695], [0.3027], [0.3477]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.4668], [0.8008], [0.4668], [0.8008], [1.0000], [1.0000], [0.7500], [0.8008], [0.8008], [0.5000], [0.5000], [0.5000], [0.7500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.008544921875 loss: 0.0084228515625 loss: 0.00421142578125 loss: 0.007781982421875 15%|█▍ | 73/492 [38:14<3:37:47, 31.19s/it] {'loss': 0.0287, 'learning_rate': 1e-05, 'epoch': 0.15} 15%|█▍ | 73/492 [38:14<3:37:47, 31.19s/it]predicted value: tensor([[0.8008], [0.8047], [1.1250], [0.6445], [0.7891], [1.2109], [0.7227], [1.1953], [0.5234], [0.3555], [0.3926], [0.5664], [0.5664], [0.2695], [0.5547], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [1.0000], [0.4668], [0.8008], [1.0000], [0.4668], [1.0000], [0.6016], [0.2500], [0.2500], [0.6016], [0.5000], [0.2002], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005218505859375 loss: 0.004669189453125loss: 0.004638671875 loss: 0.005279541015625 predicted value: tensor([[0.6523], [1.0547], [0.6914], [0.8203], [0.9062], [0.5352], [0.6484], [0.4902], [0.6133], [1.2109], [0.6875], [0.6484], [0.5938], [0.2910], [0.3555], [0.3281]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.5547], [0.6680], [0.8008], [0.3750], [0.4668], [0.2500], [0.5000], [1.0000], [0.6016], [0.5000], [0.5000], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006988525390625 loss: 0.005279541015625loss: 0.006591796875 loss: 0.004302978515625 predicted value: tensor([[0.5820], [0.7422], [0.6133], [0.8633], [0.4707], [0.7461], [0.8008], [0.5273], [0.5977], [0.5625], [0.4102], [0.5195], [0.5234], [0.4922], [0.2285], [0.3184]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [0.6680], [0.3340], [0.7148], [0.6016], [0.6016], [0.4668], [0.4004], [0.2002], [0.5000], [0.4004], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006103515625 loss: 0.005126953125 loss: 0.006103515625 loss: 0.004119873046875 predicted value: tensor([[0.4922], [1.1484], [0.4980], [1.2109], [0.7188], [0.5117], [0.6289], [0.6719], [0.5234], [0.7031], [0.6172], [0.6133], [0.4062], [0.2910], [0.3613], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [1.0000], [0.3750], [1.0000], [0.6680], [0.2500], [0.6016], [0.6016], [0.3750], [0.4668], [0.6016], [0.3340], [0.3340], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006011962890625 loss: 0.006500244140625 loss: 0.006622314453125loss: 0.0052490234375 15%|█▌ | 74/492 [38:45<3:36:53, 31.13s/it] {'loss': 0.0222, 'learning_rate': 1e-05, 'epoch': 0.15} 15%|█▌ | 74/492 [38:45<3:36:53, 31.13s/it]predicted value: tensor([[0.9805], [0.5078], [0.6484], [0.5664], [0.3301], [0.3789], [0.4668], [0.5820], [0.5781], [0.4238], [0.3574], [0.3594], [0.8828], [0.1562], [0.1172], [0.1631]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.7148], [0.8008], [0.3340], [0.3750], [0.6016], [0.7500], [0.4668], [0.5000], [0.4668], [0.3340], [1.0000], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004180908203125 loss: 0.0024566650390625loss: 0.004913330078125 loss: 0.0024871826171875 predicted value: tensor([[1.0625], [0.4258], [0.4844], [0.9648], [0.2188], [0.5156], [0.6875], [0.9062], [0.3887], [0.3379], [0.3633], [0.2266], [0.1011], [0.2285], [0.1250], [0.1250]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3145], [0.3750], [1.0000], [0.2500], [0.3750], [0.8008], [1.0000], [0.4668], [0.3340], [0.5000], [0.3340], [0.2500], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028076171875 loss: 0.0028228759765625 loss: 0.004791259765625 loss: 0.00122833251953125 predicted value: tensor([[0.7070], [0.5508], [0.3223], [0.5430], [0.1846], [0.4434], [0.4727], [0.3652], [0.9336], [0.9727], [0.3223], [1.0078], [0.2461], [0.2871], [0.0996], [0.1611]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.4668], [0.4668], [0.2002], [0.2002], [0.4668], [0.3750], [1.0000], [1.0000], [0.2002], [1.0000], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003143310546875 loss: 0.002655029296875loss: 0.0033416748046875 loss: 0.004302978515625 predicted value: tensor([[0.9688], [0.5156], [0.2158], [0.2070], [0.5234], [0.2891], [0.6562], [0.5664], [0.6289], [0.4648], [0.3477], [0.4395], [0.1943], [0.1240], [0.0786], [0.1348]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.2500], [0.4668], [0.5547], [0.4668], [0.7148], [0.6680], [0.8008], [0.6016], [0.4004], [0.5000], [0.3340], [0.0278], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003021240234375 loss: 0.005584716796875 loss: 0.0037078857421875 loss: 0.0037994384765625 15%|█▌ | 75/492 [39:16<3:36:01, 31.08s/it] {'loss': 0.0138, 'learning_rate': 1e-05, 'epoch': 0.15} 15%|█▌ | 75/492 [39:16<3:36:01, 31.08s/it]predicted value: tensor([[0.5156], [0.5117], [0.2197], [0.2832], [0.5625], [0.2676], [0.4648], [0.5469], [0.6211], [0.5000], [0.3184], [0.3809], [0.3984], [0.2617], [0.1406], [0.1348]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.5547], [0.4668], [0.3750], [0.6680], [0.3340], [0.7500], [0.5000], [0.6680], [0.3750], [0.2500], [0.4004], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005950927734375 loss: 0.003814697265625loss: 0.0037384033203125 loss: 0.003021240234375 predicted value: tensor([[0.4141], [0.3945], [0.5234], [0.4395], [0.5664], [0.5938], [0.6016], [0.5742], [0.3984], [0.5469], [0.4570], [0.3945], [0.3379], [0.3887], [0.1582], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.6016], [0.4668], [0.5703], [0.7500], [0.8320], [0.4668], [0.4668], [0.3750], [0.6016], [0.4004], [0.4004], [0.3340], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024871826171875 loss: 0.0030364990234375 loss: 0.004150390625 loss: 0.0042724609375 predicted value: tensor([[0.4258], [0.8906], [0.6055], [0.4785], [0.4453], [0.9844], [0.4238], [0.4453], [0.9570], [0.2148], [0.3926], [0.3027], [0.3203], [0.3477], [0.1758], [0.1553]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.3750], [0.6016], [1.0000], [0.4668], [0.6016], [1.0000], [0.2002], [0.4004], [0.3340], [0.0400], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037384033203125 loss: 0.00347900390625 loss: 0.00750732421875 loss: 0.0026397705078125 predicted value: tensor([[0.3262], [0.4512], [1.0078], [0.9297], [0.4688], [0.2305], [0.6133], [0.3906], [0.6211], [0.8945], [0.4453], [0.5391], [0.8984], [0.3438], [0.3574], [0.1279]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [1.0000], [1.0000], [0.4668], [0.2002], [0.6680], [0.3750], [0.4668], [1.0000], [0.5000], [0.6016], [1.0000], [0.3340], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004547119140625 loss: 0.00165557861328125 loss: 0.00183868408203125 loss: 0.00323486328125 15%|█▌ | 76/492 [39:48<3:37:22, 31.35s/it] {'loss': 0.0148, 'learning_rate': 1e-05, 'epoch': 0.15} 15%|█▌ | 76/492 [39:48<3:37:22, 31.35s/it]predicted value: tensor([[0.9297], [0.8789], [0.7969], [0.8164], [0.6523], [0.7070], [0.6094], [0.9102], [0.6367], [0.5195], [0.5234], [0.5625], [0.6680], [0.7344], [0.3184], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.8320], [0.6680], [0.8008], [0.3750], [0.5000], [0.4668], [0.8008], [0.6016], [0.2500], [0.4004], [0.4004], [0.6016], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005401611328125 loss: 0.00518798828125loss: 0.005218505859375 loss: 0.009521484375 predicted value: tensor([[0.6289], [0.5117], [0.8828], [0.7422], [1.1016], [0.5977], [0.7539], [0.6172], [0.7422], [0.6484], [0.5117], [0.5000], [0.5664], [0.5234], [0.3418], [0.3535]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8008], [0.6016], [1.0000], [0.3750], [0.7500], [0.6016], [0.6680], [0.8008], [0.2002], [0.2500], [0.6016], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003265380859375 loss: 0.005523681640625loss: 0.01007080078125 loss: 0.00567626953125 predicted value: tensor([[0.4336], [0.8477], [0.4473], [0.6406], [0.7344], [1.1562], [0.8477], [0.5703], [0.7188], [0.4961], [0.7109], [0.6602], [0.4492], [0.4844], [0.5547], [0.3789]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.4668], [0.4668], [0.8008], [1.0000], [0.8008], [0.7500], [0.5703], [0.3340], [0.5000], [0.4668], [0.2500], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006195068359375 loss: 0.005157470703125loss: 0.005584716796875 loss: 0.004913330078125 predicted value: tensor([[0.8203], [0.5586], [0.5469], [0.6055], [1.0938], [0.9648], [0.4727], [0.4785], [0.6211], [0.8945], [0.7734], [0.5586], [0.5703], [0.3340], [0.4297], [0.3438]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.3750], [0.4668], [1.0000], [0.8008], [0.2500], [0.2500], [0.4668], [0.8008], [0.7500], [0.5000], [0.3340], [0.2500], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00335693359375 loss: 0.003326416015625 loss: 0.0064697265625 loss: 0.0050048828125 16%|█▌ | 77/492 [40:19<3:36:49, 31.35s/it] {'loss': 0.0225, 'learning_rate': 1e-05, 'epoch': 0.16} 16%|█▌ | 77/492 [40:19<3:36:49, 31.35s/it]predicted value: tensor([[0.6367], [0.3398], [0.5859], [0.6211], [0.5352], [0.7383], [0.6953], [1.0312], [0.9961], [0.4980], [1.1016], [0.3965], [0.5391], [0.4863], [0.3262], [0.3477]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6094], [0.2002], [0.4668], [0.5547], [0.3750], [0.7500], [0.6016], [1.0000], [1.0000], [0.5000], [1.0000], [0.4004], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0059814453125 loss: 0.005523681640625 loss: 0.0034942626953125 loss: 0.0069580078125 predicted value: tensor([[0.5039], [0.5977], [1.1484], [0.4727], [0.6484], [0.6953], [0.5078], [0.6445], [0.6484], [0.5117], [0.5703], [0.6562], [0.5859], [0.2832], [0.3145], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.3750], [1.0000], [0.4668], [0.3750], [0.6680], [0.4668], [0.6680], [0.6016], [0.4004], [0.6016], [0.6016], [0.6016], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0068359375 loss: 0.004058837890625loss: 0.0062255859375 loss: 0.002532958984375 predicted value: tensor([[0.5742], [0.4727], [0.4922], [1.0859], [0.7188], [0.5586], [0.8555], [1.0625], [0.6289], [1.0938], [0.5586], [0.5156], [0.6250], [0.4121], [0.3438], [0.3301]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [1.0000], [0.8008], [0.4668], [0.5547], [1.0000], [0.4004], [1.0000], [0.2852], [0.3340], [0.7500], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003997802734375 loss: 0.005828857421875loss: 0.004486083984375 loss: 0.003692626953125 predicted value: tensor([[0.5273], [0.5703], [0.5586], [0.5195], [0.5898], [0.7461], [0.5430], [1.0625], [0.6992], [0.3809], [0.3828], [0.5781], [0.3340], [0.2812], [0.3066], [0.3770]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3145], [0.4668], [0.3750], [0.5000], [0.5547], [0.3750], [1.0000], [0.6016], [0.2002], [0.2002], [0.4004], [0.0400], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00396728515625 loss: 0.0067138671875 loss: 0.007049560546875 loss: 0.0036163330078125 16%|█▌ | 78/492 [40:51<3:36:37, 31.40s/it] {'loss': 0.0202, 'learning_rate': 1e-05, 'epoch': 0.16} 16%|█▌ | 78/492 [40:51<3:36:37, 31.40s/it]predicted value: tensor([[0.3457], [0.3516], [0.4102], [0.8789], [0.5586], [0.6172], [0.3965], [0.5312], [0.3457], [0.4238], [0.3867], [0.3457], [0.2949], [0.1475], [0.4043], [0.1650]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [1.0000], [0.8008], [0.8008], [0.4668], [0.6680], [0.2500], [0.6016], [0.3340], [0.6016], [0.4004], [0.1670], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030059814453125 loss: 0.004852294921875 loss: 0.00311279296875 loss: 0.0023040771484375 predicted value: tensor([[0.3574], [0.6758], [0.6289], [0.4082], [0.3145], [0.2637], [0.5820], [0.4902], [0.5430], [0.8750], [0.9102], [0.4160], [0.4395], [0.8633], [0.1426], [0.1025]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.8320], [0.3750], [0.4668], [0.2500], [0.6680], [0.4668], [0.6016], [1.0000], [1.0000], [0.5000], [0.5000], [1.0000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003204345703125 loss: 0.0027923583984375loss: 0.00518798828125 loss: 0.004180908203125 predicted value: tensor([[0.8711], [0.4570], [0.1436], [0.5078], [0.7305], [0.9570], [0.3594], [0.8516], [0.9219], [0.3750], [0.3066], [0.4434], [0.8516], [0.2617], [0.1050], [0.1348]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [0.2500], [0.8320], [0.8008], [1.0000], [0.6016], [1.0000], [1.0000], [0.3340], [0.4004], [0.6016], [1.0000], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023345947265625 loss: 0.005706787109375loss: 0.004119873046875 loss: 0.004791259765625 predicted value: tensor([[0.7422], [0.4160], [0.9062], [0.9805], [0.6289], [0.6055], [0.5430], [0.4180], [0.1680], [0.6289], [0.4590], [0.2871], [0.1572], [0.1270], [0.1836], [0.1523]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [1.0000], [1.0000], [0.8008], [0.8008], [0.5000], [0.3750], [0.2002], [0.6680], [0.5000], [0.5000], [0.0400], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029144287109375 loss: 0.003387451171875 loss: 0.0064697265625 loss: 0.0024871826171875 16%|█▌ | 79/492 [41:22<3:35:13, 31.27s/it] {'loss': 0.0152, 'learning_rate': 1e-05, 'epoch': 0.16} 16%|█▌ | 79/492 [41:22<3:35:13, 31.27s/it]predicted value: tensor([[0.7188], [0.7695], [0.8438], [0.4688], [0.6250], [0.4238], [0.5078], [0.3984], [0.3242], [0.4336], [0.4004], [0.4434], [0.4219], [0.1387], [0.1611], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6680], [1.0000], [0.6016], [0.8008], [0.4668], [0.6016], [0.3750], [0.6016], [0.6016], [0.4004], [0.4004], [0.2852], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0038909912109375 loss: 0.006927490234375loss: 0.00372314453125 loss: 0.0027923583984375 predicted value: tensor([[0.6094], [0.5039], [0.5859], [0.5156], [0.4395], [0.5547], [0.5273], [0.3125], [0.3301], [0.5469], [0.3125], [0.3574], [0.3359], [0.3613], [0.1216], [0.1562]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.4668], [0.8008], [0.7500], [0.4668], [0.6680], [0.7500], [0.3340], [0.4004], [0.4668], [0.2002], [0.4004], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00341796875 loss: 0.00457763671875loss: 0.002227783203125 loss: 0.00250244140625 predicted value: tensor([[0.6719], [0.8242], [0.2432], [0.6172], [0.4844], [0.9336], [0.4180], [0.9648], [0.4922], [0.4922], [0.8789], [0.3457], [0.4238], [0.1689], [0.3086], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.3340], [0.8008], [0.5547], [1.0000], [0.2500], [1.0000], [0.3750], [0.6016], [1.0000], [0.4004], [0.4004], [0.2002], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00421142578125 loss: 0.0028533935546875 loss: 0.0031585693359375 loss: 0.0035400390625 predicted value: tensor([[0.6719], [0.7070], [0.8750], [0.9336], [0.4824], [0.2275], [0.9219], [0.4238], [0.4766], [0.4336], [0.0332], [0.5547], [0.4258], [0.1768], [0.1748], [0.1660]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [1.0000], [1.0000], [0.3340], [0.2500], [1.0000], [0.6016], [0.6016], [0.5547], [0.0400], [0.5000], [0.6016], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00604248046875 loss: 0.0029144287109375 loss: 0.0038909912109375 loss: 0.0064697265625 16%|█▋ | 80/492 [41:53<3:34:47, 31.28s/it] {'loss': 0.0158, 'learning_rate': 1e-05, 'epoch': 0.16} 16%|█▋ | 80/492 [41:53<3:34:47, 31.28s/it]predicted value: tensor([[0.8398], [0.7656], [0.5625], [0.9141], [0.4746], [1.0781], [0.6133], [0.7188], [1.1641], [0.6406], [0.6445], [0.5195], [0.5156], [0.4648], [0.3340], [0.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.3750], [0.8008], [0.4668], [1.0000], [0.6016], [0.6680], [1.0000], [0.6016], [0.5000], [0.4004], [0.3750], [0.5000], [0.1426], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005767822265625 loss: 0.0035247802734375 loss: 0.00616455078125 loss: 0.01055908203125 predicted value: tensor([[0.5859], [0.5859], [0.5391], [0.8516], [0.4570], [0.7539], [0.6055], [0.4668], [0.8242], [0.4355], [0.4473], [1.0859], [0.7266], [0.5625], [0.3359], [0.3457]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.7500], [0.8320], [0.3340], [0.8008], [0.3750], [0.2500], [0.7500], [0.2002], [0.3340], [1.0000], [0.8008], [0.7500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029449462890625 loss: 0.0054931640625 loss: 0.004791259765625 loss: 0.006256103515625 predicted value: tensor([[0.7383], [0.5898], [0.4668], [1.0703], [0.6094], [0.6133], [1.1094], [1.1016], [0.6211], [0.4375], [0.5898], [0.4355], [0.5859], [0.5039], [0.3594], [0.3027]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6250], [0.4668], [0.3340], [1.0000], [0.4668], [0.3750], [1.0000], [1.0000], [0.6016], [0.2002], [0.5000], [0.2500], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004638671875 loss: 0.005279541015625loss: 0.003875732421875 loss: 0.007171630859375 predicted value: tensor([[0.5547], [0.5898], [0.7539], [1.0234], [1.1641], [0.4258], [0.5547], [0.7383], [0.7188], [0.6016], [0.5469], [0.3652], [0.6094], [0.5117], [0.3164], [0.5039]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.2500], [0.8008], [1.0000], [1.0000], [0.2500], [0.2500], [0.8008], [0.7500], [0.6016], [0.3340], [0.2500], [0.2500], [0.4004], [0.1670], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00689697265625 loss: 0.0084228515625loss: 0.0052490234375 loss: 0.004669189453125 16%|█▋ | 81/492 [42:24<3:34:33, 31.32s/it] {'loss': 0.0229, 'learning_rate': 1e-05, 'epoch': 0.16} 16%|█▋ | 81/492 [42:24<3:34:33, 31.32s/it]predicted value: tensor([[0.5898], [0.5586], [0.4648], [0.7969], [0.7070], [0.3398], [0.6836], [0.5234], [0.5195], [0.6523], [0.5078], [0.5742], [0.3047], [0.5508], [0.3594], [0.3242]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.8320], [0.8008], [0.2002], [0.4668], [0.4668], [0.7500], [0.6680], [0.6016], [0.5000], [0.2500], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033721923828125 loss: 0.0035552978515625 loss: 0.00885009765625 loss: 0.00225830078125 predicted value: tensor([[0.5273], [0.6445], [0.4883], [0.4609], [0.7734], [0.3027], [0.5781], [0.5977], [0.5742], [0.4375], [0.4609], [0.5312], [0.4609], [0.5117], [0.2949], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.3750], [0.5547], [0.8320], [0.2500], [0.6016], [0.6680], [0.7500], [0.5000], [0.6016], [0.6016], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024871826171875 loss: 0.003997802734375 loss: 0.00360107421875 loss: 0.004150390625 predicted value: tensor([[0.7617], [0.5391], [1.0781], [0.5352], [1.0938], [0.5391], [0.4434], [0.3770], [0.6836], [0.5703], [0.3730], [0.4277], [0.4766], [0.5000], [0.2520], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.3750], [1.0000], [0.3750], [1.0000], [0.2500], [0.2500], [0.2002], [0.4668], [0.4668], [0.0400], [0.3340], [0.3340], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030364990234375 loss: 0.006744384765625loss: 0.0038604736328125 loss: 0.004180908203125 predicted value: tensor([[1.1172], [0.7188], [0.7031], [1.1172], [0.6250], [1.1250], [0.5430], [1.1719], [0.6836], [0.5469], [1.0625], [0.5000], [0.4961], [0.4238], [0.2812], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.5547], [1.0000], [0.3750], [1.0000], [0.2002], [1.0000], [0.7500], [0.2500], [1.0000], [0.3340], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035858154296875 loss: 0.007415771484375 loss: 0.00384521484375 loss: 0.0050048828125 17%|█▋ | 82/492 [42:55<3:33:42, 31.27s/it] {'loss': 0.0175, 'learning_rate': 1e-05, 'epoch': 0.17} 17%|█▋ | 82/492 [42:55<3:33:42, 31.27s/it]predicted value: tensor([[0.6602], [0.2520], [0.3242], [0.1787], [0.9297], [0.4004], [0.4922], [0.6602], [0.9453], [0.8242], [0.3125], [0.4316], [0.1211], [0.1182], [0.1377], [0.0830]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.2500], [0.4668], [0.3340], [1.0000], [0.5000], [0.2002], [0.8320], [1.0000], [1.0000], [0.4004], [0.7500], [0.1670], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037994384765625 loss: 0.00592041015625 loss: 0.003814697265625 loss: 0.004852294921875 predicted value: tensor([[0.4609], [0.9336], [0.7539], [0.5000], [0.4961], [0.6758], [0.2773], [0.3418], [0.8906], [0.5273], [0.3008], [0.3535], [0.1416], [0.1035], [0.2539], [0.0859]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8320], [0.6680], [0.4668], [0.6680], [0.2500], [0.3145], [1.0000], [0.6680], [0.3340], [0.3340], [0.2500], [0.2002], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012969970703125 loss: 0.0020599365234375loss: 0.003997802734375 loss: 0.00390625 predicted value: tensor([[0.1680], [0.3809], [0.2314], [0.2676], [0.5664], [0.2578], [0.5469], [0.6094], [0.4004], [0.9570], [0.3672], [0.3164], [0.3086], [0.1738], [0.1201], [0.1191]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.4668], [0.3340], [0.3340], [0.8008], [0.3340], [0.7500], [0.8320], [0.6016], [1.0000], [0.5000], [0.3340], [0.3340], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005279541015625 loss: 0.00439453125 loss: 0.00439453125 loss: 0.0030670166015625 predicted value: tensor([[0.5938], [0.5000], [0.2119], [0.3027], [0.2793], [1.0547], [0.3340], [0.3438], [0.3574], [0.4375], [0.2461], [0.2539], [0.4004], [0.3184], [0.1074], [0.1768]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.2500], [0.4668], [0.3750], [1.0000], [0.2500], [0.3340], [0.3750], [0.5000], [0.2500], [0.5000], [0.6016], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0036163330078125 loss: 0.002532958984375 loss: 0.0020751953125 loss: 0.003570556640625 17%|█▋ | 83/492 [43:27<3:33:04, 31.26s/it] {'loss': 0.0146, 'learning_rate': 1e-05, 'epoch': 0.17} 17%|█▋ | 83/492 [43:27<3:33:04, 31.26s/it]predicted value: tensor([[0.4824], [0.4160], [0.1738], [0.9141], [0.3828], [0.5195], [0.4277], [0.4863], [0.4336], [0.4551], [0.5508], [0.4434], [0.3574], [0.2754], [0.0972], [0.0981]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.4668], [0.2002], [1.0000], [0.2500], [0.7500], [0.6016], [0.6016], [0.6016], [0.5000], [0.7500], [0.5000], [0.3340], [0.5000], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004425048828125 loss: 0.0048828125 loss: 0.004180908203125 loss: 0.0030517578125 predicted value: tensor([[0.4277], [0.6133], [0.7617], [0.4180], [0.2021], [0.4668], [0.5391], [0.4062], [0.3730], [0.5703], [0.3145], [0.3262], [0.3555], [0.7930], [0.1162], [0.1406]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.8008], [0.3340], [0.2500], [0.4668], [0.5703], [0.3750], [0.5000], [0.8008], [0.7500], [0.4004], [0.5000], [1.0000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029754638671875 loss: 0.006072998046875 loss: 0.003662109375 loss: 0.0038299560546875 predicted value: tensor([[0.8008], [0.4297], [0.7344], [0.6953], [0.2109], [0.3203], [0.6289], [0.6328], [0.2949], [0.9961], [0.3281], [0.4746], [0.5742], [0.1787], [0.1377], [0.1328]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.8320], [0.8008], [0.2500], [0.2500], [0.6016], [0.6016], [0.2500], [1.0000], [0.3340], [0.6016], [0.6016], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006072998046875 loss: 0.000926971435546875 loss: 0.003509521484375 loss: 0.0024566650390625 predicted value: tensor([[0.6484], [0.2148], [0.3379], [0.9609], [0.3047], [0.8867], [0.5391], [0.3672], [0.2656], [0.2871], [0.4004], [0.3047], [0.2891], [0.4082], [0.1621], [0.1011]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.2500], [0.4668], [1.0000], [0.3340], [1.0000], [0.6016], [0.3340], [0.2500], [0.2500], [0.5000], [0.2500], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00360107421875 loss: 0.00168609619140625 loss: 0.004669189453125 loss: 0.00592041015625 17%|█▋ | 84/492 [43:58<3:32:11, 31.20s/it] {'loss': 0.0155, 'learning_rate': 1e-05, 'epoch': 0.17} 17%|█▋ | 84/492 [43:58<3:32:11, 31.20s/it]predicted value: tensor([[0.7578], [1.0703], [1.0625], [0.5117], [0.9023], [0.4961], [0.4434], [0.4844], [0.7227], [0.5234], [0.3789], [0.5508], [0.4102], [0.3574], [0.2871], [0.3164]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [1.0000], [0.3340], [0.8008], [0.3340], [0.2500], [0.2500], [0.6016], [0.2500], [0.3340], [0.5000], [0.4004], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00225830078125 loss: 0.00787353515625 loss: 0.004974365234375 loss: 0.0035400390625 predicted value: tensor([[0.4766], [1.1172], [0.7656], [1.1562], [0.3594], [0.6797], [0.7422], [1.1094], [0.7852], [0.5430], [0.7070], [0.5430], [0.4160], [0.4805], [0.2480], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [1.0000], [0.8008], [1.0000], [0.2500], [0.7500], [0.5703], [1.0000], [0.7500], [0.4004], [0.7500], [0.6016], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004302978515625 loss: 0.0032958984375 loss: 0.00421142578125 loss: 0.0040283203125 predicted value: tensor([[1.1328], [0.6055], [0.8281], [1.1172], [1.1328], [1.1484], [0.7500], [1.1172], [0.8711], [0.4961], [0.4492], [0.5508], [0.4199], [0.2988], [0.3340], [0.3086]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4648], [0.8320], [1.0000], [1.0000], [1.0000], [0.6680], [1.0000], [0.6680], [0.3340], [0.3340], [0.6016], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007354736328125 loss: 0.003875732421875 loss: 0.00335693359375 loss: 0.003387451171875 predicted value: tensor([[0.4590], [0.7266], [0.8516], [1.1250], [0.6211], [0.5820], [0.4668], [0.6211], [0.7617], [0.8516], [0.5273], [0.5352], [0.4941], [0.4473], [0.3105], [0.3516]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8008], [1.0000], [0.6016], [0.4668], [0.3340], [0.6016], [0.4668], [0.8320], [0.4004], [0.3340], [0.4004], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0047607421875 loss: 0.0030517578125 loss: 0.004364013671875 loss: 0.0035552978515625 17%|█▋ | 85/492 [44:29<3:30:59, 31.11s/it] {'loss': 0.017, 'learning_rate': 1e-05, 'epoch': 0.17} 17%|█▋ | 85/492 [44:29<3:30:59, 31.11s/it]predicted value: tensor([[0.5547], [0.8594], [1.1016], [0.4355], [0.5547], [0.4277], [0.6172], [0.8203], [1.0469], [0.6680], [0.5039], [0.3984], [0.5820], [0.3203], [0.6055], [0.2930]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [1.0000], [0.2500], [0.3750], [0.3340], [0.4668], [0.5547], [1.0000], [0.5703], [0.4004], [0.3340], [0.4004], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005828857421875 loss: 0.00494384765625loss: 0.00421142578125 loss: 0.004150390625 predicted value: tensor([[0.5781], [1.0938], [0.5859], [0.3262], [0.7031], [0.6719], [0.6797], [1.1094], [0.4570], [0.7734], [1.1641], [0.7539], [0.5938], [0.3887], [0.2305], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.3340], [0.5000], [0.8008], [0.4648], [1.0000], [0.2500], [0.8008], [1.0000], [0.6016], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002899169921875 loss: 0.0040283203125loss: 0.00537109375 loss: 0.0037384033203125 predicted value: tensor([[0.7188], [0.4785], [0.4062], [0.5547], [1.0859], [0.6094], [0.7305], [0.4883], [0.6211], [0.7734], [0.2988], [0.5039], [0.4941], [0.5742], [0.2773], [0.3398]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.3145], [0.3340], [0.4668], [1.0000], [0.4668], [0.5000], [0.4668], [0.3750], [0.8008], [0.0400], [0.4004], [0.4004], [0.7500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005706787109375 loss: 0.006072998046875 loss: 0.005126953125 loss: 0.004180908203125 predicted value: tensor([[0.9570], [0.4551], [1.1328], [0.7070], [0.4980], [1.1016], [0.3770], [0.5352], [0.5859], [0.4512], [0.3301], [0.3496], [0.4766], [0.4355], [0.3672], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3340], [1.0000], [0.4668], [0.4668], [1.0000], [0.3340], [0.4668], [0.2500], [0.5000], [0.0400], [0.4004], [0.3750], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00396728515625 loss: 0.00506591796875 loss: 0.005615234375 loss: 0.0025787353515625 17%|█▋ | 86/492 [45:00<3:31:39, 31.28s/it] {'loss': 0.0184, 'learning_rate': 1e-05, 'epoch': 0.17} 17%|█▋ | 86/492 [45:00<3:31:39, 31.28s/it]predicted value: tensor([[0.5039], [0.4238], [0.4180], [0.3945], [0.9336], [0.2988], [0.9727], [0.2295], [0.7617], [0.3418], [0.3184], [0.5039], [0.3457], [0.9023], [0.0796], [0.1270]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.6016], [0.4668], [0.4668], [1.0000], [0.3340], [1.0000], [0.2500], [0.8008], [0.4004], [0.5000], [0.7500], [0.7500], [1.0000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00421142578125 loss: 0.00390625 loss: 0.005859375 loss: 0.002777099609375 predicted value: tensor([[0.7656], [0.4902], [0.6367], [0.1865], [0.4512], [0.4395], [0.5469], [0.2969], [0.4824], [0.5625], [0.2832], [0.3633], [0.2695], [0.3242], [0.1631], [0.1104]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.8008], [0.2002], [0.8008], [0.6016], [0.3750], [0.3340], [0.3145], [0.7500], [0.5000], [0.4004], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003997802734375 loss: 0.00616455078125 loss: 0.0040283203125 loss: 0.003997802734375 predicted value: tensor([[0.4316], [0.9492], [0.5391], [0.9766], [0.9297], [0.9609], [0.5391], [0.2930], [0.0464], [0.4805], [0.4668], [0.4648], [0.3672], [0.3418], [0.1426], [0.1475]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8320], [1.0000], [1.0000], [1.0000], [0.8008], [0.4668], [0.0400], [0.5000], [0.7500], [0.5000], [0.4004], [0.6016], [0.0400], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00958251953125 loss: 0.0057373046875 loss: 0.0023956298828125 loss: 0.004638671875 predicted value: tensor([[0.2910], [0.9453], [0.8945], [0.2354], [0.3125], [0.8867], [0.1689], [0.8672], [0.5547], [0.5312], [0.2217], [0.1328], [0.3418], [0.3105], [0.4023], [0.0564]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [0.2500], [0.3340], [1.0000], [0.2500], [1.0000], [0.4668], [0.6680], [0.2002], [0.1670], [0.4004], [0.7500], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004425048828125 loss: 0.005462646484375 loss: 0.0034027099609375 loss: 0.00146484375 18%|█▊ | 87/492 [45:31<3:30:37, 31.20s/it] {'loss': 0.018, 'learning_rate': 1e-05, 'epoch': 0.18} 18%|█▊ | 87/492 [45:31<3:30:37, 31.20s/it]predicted value: tensor([[0.1738], [0.9023], [0.4395], [0.2256], [0.5195], [0.4141], [0.6484], [0.9414], [0.1699], [0.3965], [0.5664], [0.4160], [0.3867], [0.1631], [0.1289], [0.1069]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.5547], [0.2002], [0.7500], [0.3340], [0.8008], [1.0000], [0.2002], [0.4004], [0.7500], [0.5000], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00482177734375 loss: 0.0033416748046875loss: 0.00177764892578125 loss: 0.0021820068359375 predicted value: tensor([[0.2451], [0.5469], [0.5312], [0.2773], [0.4316], [0.7930], [0.5898], [0.4336], [0.3340], [0.3301], [0.4629], [0.3477], [0.3477], [0.1797], [0.1973], [0.1494]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.8008], [0.3340], [0.5000], [0.8008], [0.8008], [0.6016], [0.3340], [0.2500], [0.5000], [0.3340], [0.3340], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005340576171875 loss: 0.00445556640625 loss: 0.00396728515625 loss: 0.00323486328125 predicted value: tensor([[0.3867], [0.3711], [0.6094], [0.5156], [0.3906], [0.8984], [0.8828], [0.7070], [0.4199], [0.5508], [0.5312], [0.3613], [0.2949], [0.1309], [0.1680], [0.1118]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8008], [0.3750], [0.2500], [1.0000], [1.0000], [0.6016], [0.4668], [0.4668], [0.7500], [0.6016], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004791259765625 loss: 0.004638671875loss: 0.00665283203125 loss: 0.0028533935546875 predicted value: tensor([[0.6367], [0.3535], [0.9062], [0.3359], [0.3926], [0.7656], [0.2988], [0.4961], [0.3652], [0.5742], [0.3086], [0.2236], [0.3027], [0.1250], [0.1670], [0.1748]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [1.0000], [0.4668], [0.4668], [0.8008], [0.3340], [0.6016], [0.5000], [0.6016], [0.3340], [0.0625], [0.3340], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004608154296875 loss: 0.00225830078125 loss: 0.0022735595703125 loss: 0.0054931640625 18%|█▊ | 88/492 [46:02<3:29:55, 31.18s/it] {'loss': 0.0157, 'learning_rate': 1e-05, 'epoch': 0.18} 18%|█▊ | 88/492 [46:02<3:29:55, 31.18s/it]predicted value: tensor([[0.6328], [0.4355], [0.9258], [0.8086], [0.4512], [0.7500], [0.8984], [0.5547], [0.6367], [0.5664], [0.6016], [0.4473], [0.6328], [0.5234], [0.5391], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2002], [0.8320], [0.6680], [0.3750], [0.8008], [0.8008], [0.3750], [0.2002], [0.6016], [0.7500], [0.4004], [0.5000], [0.4004], [0.4004], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00640869140625loss: 0.006011962890625 loss: 0.005706787109375 loss: 0.0028076171875 predicted value: tensor([[0.9453], [0.6875], [0.4746], [0.7031], [1.0703], [0.3145], [1.0547], [0.5352], [0.4629], [0.6367], [0.4766], [0.5273], [0.5273], [0.3379], [0.3672], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.3750], [0.6172], [1.0000], [0.2002], [1.0000], [0.6016], [0.2500], [0.4668], [0.4004], [0.4004], [0.2500], [0.2500], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00665283203125 loss: 0.004241943359375loss: 0.00341796875 loss: 0.004302978515625 predicted value: tensor([[0.4961], [0.4883], [0.8242], [0.6875], [1.1094], [1.0781], [0.5117], [0.7422], [1.0859], [0.6797], [1.0859], [0.5820], [0.3008], [0.2988], [0.2871], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.7148], [0.8008], [1.0000], [1.0000], [0.1426], [0.6016], [1.0000], [0.6016], [1.0000], [0.4004], [0.3340], [0.2002], [0.1670], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002960205078125 loss: 0.0035858154296875 loss: 0.004638671875 loss: 0.0021209716796875 predicted value: tensor([[0.7344], [0.5664], [0.7188], [1.0938], [0.7148], [0.9414], [0.8984], [1.1016], [0.6406], [0.7617], [1.1172], [0.6016], [0.4082], [0.4805], [0.5469], [0.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [1.0000], [0.8008], [0.8320], [0.8320], [1.0000], [0.7500], [0.8008], [1.0000], [0.6680], [0.5000], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004119873046875 loss: 0.006011962890625 loss: 0.0030059814453125loss: 0.0059814453125 18%|█▊ | 89/492 [46:34<3:29:11, 31.14s/it] {'loss': 0.018, 'learning_rate': 1e-05, 'epoch': 0.18} 18%|█▊ | 89/492 [46:34<3:29:11, 31.14s/it]predicted value: tensor([[0.7383], [0.3516], [0.4082], [0.6211], [0.7773], [0.8555], [0.4668], [0.6719], [0.5586], [0.6953], [1.0703], [0.5898], [0.5586], [0.5430], [0.3477], [0.3027]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3340], [0.3750], [0.4668], [0.6680], [0.8008], [0.5000], [0.6016], [0.6016], [0.7500], [1.0000], [0.4668], [0.3340], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00323486328125 loss: 0.003021240234375 loss: 0.0025634765625 loss: 0.00543212890625 predicted value: tensor([[1.0391], [0.6250], [0.7148], [1.0156], [0.4766], [0.3848], [0.4531], [1.0156], [0.5352], [0.5625], [0.6289], [0.5898], [0.4004], [0.3496], [0.3438], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [1.0000], [0.4668], [0.2500], [0.2002], [1.0000], [0.2500], [0.5000], [0.4668], [0.4004], [0.3340], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034637451171875 loss: 0.0045166015625loss: 0.004852294921875 loss: 0.0027313232421875 predicted value: tensor([[0.3809], [1.0312], [1.0781], [0.6992], [0.5078], [0.4434], [0.6094], [0.3965], [0.4863], [0.8281], [0.5430], [1.0781], [0.3809], [0.5117], [0.3008], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [1.0000], [1.0000], [0.5703], [0.3750], [0.2500], [0.4668], [0.5000], [0.4004], [0.7500], [0.6016], [1.0000], [0.4004], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003509521484375loss: 0.0027618408203125 loss: 0.0029754638671875 loss: 0.005523681640625 predicted value: tensor([[0.7734], [0.9414], [0.5391], [0.4219], [0.7461], [1.0625], [1.0312], [1.0859], [0.2734], [0.4707], [0.5430], [0.9961], [0.5977], [0.4746], [0.2461], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [0.8320], [0.4668], [0.2500], [0.8008], [1.0000], [1.0000], [1.0000], [0.0625], [0.4004], [0.3340], [1.0000], [0.3750], [0.3340], [0.2500], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0036773681640625 loss: 0.0042724609375 loss: 0.0023193359375 loss: 0.003814697265625 18%|█▊ | 90/492 [47:04<3:28:12, 31.08s/it] {'loss': 0.0147, 'learning_rate': 1e-05, 'epoch': 0.18} 18%|█▊ | 90/492 [47:04<3:28:12, 31.08s/it]predicted value: tensor([[0.3359], [0.2715], [0.4902], [0.8203], [0.5742], [0.4492], [0.7188], [0.6211], [0.5352], [0.5078], [0.9062], [0.2676], [0.3438], [0.0767], [0.3418], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.5547], [1.0000], [0.6016], [0.4668], [0.8008], [0.7500], [0.7500], [0.7500], [1.0000], [0.4004], [0.3340], [0.1670], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003570556640625 loss: 0.007110595703125 loss: 0.00390625 loss: 0.00341796875 predicted value: tensor([[0.2109], [0.8633], [0.6484], [0.8555], [0.9141], [0.8906], [0.5273], [0.5195], [0.2129], [0.9375], [0.4453], [0.9219], [0.2246], [0.1934], [0.1426], [0.1201]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [1.0000], [0.8320], [1.0000], [1.0000], [1.0000], [0.8320], [0.6016], [0.2002], [1.0000], [0.3750], [1.0000], [0.2002], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002532958984375 loss: 0.003631591796875 loss: 0.0033111572265625 loss: 0.004547119140625 predicted value: tensor([[0.8633], [0.8633], [0.4551], [0.7031], [0.1357], [0.8828], [0.3984], [0.4082], [0.5977], [0.3750], [0.4023], [0.3086], [0.3008], [0.1543], [0.1533], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.8320], [0.2002], [1.0000], [0.2500], [0.3750], [0.7500], [0.6016], [0.4668], [0.4004], [0.4004], [0.2002], [0.0400], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034027099609375 loss: 0.003326416015625 loss: 0.0032196044921875 loss: 0.00457763671875 predicted value: tensor([[0.9102], [0.3223], [0.9023], [0.6406], [0.8555], [0.4492], [0.3086], [0.3574], [0.6797], [0.4746], [0.4629], [0.2656], [0.5156], [0.1953], [0.2461], [0.1338]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [1.0000], [0.6680], [1.0000], [0.6016], [0.4668], [0.4668], [0.8008], [0.6016], [0.5000], [0.4004], [0.6016], [0.1670], [0.4004], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00384521484375 loss: 0.002838134765625 loss: 0.001861572265625 loss: 0.003936767578125 18%|█▊ | 91/492 [47:35<3:27:36, 31.06s/it] {'loss': 0.0148, 'learning_rate': 1e-05, 'epoch': 0.18} 18%|█▊ | 91/492 [47:35<3:27:36, 31.06s/it]predicted value: tensor([[0.7852], [0.5469], [0.9062], [0.5508], [0.3125], [0.5742], [0.9258], [0.3691], [0.2695], [0.4102], [0.5703], [0.5781], [0.4590], [0.3359], [0.3965], [0.1650]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.9102], [0.8008], [1.0000], [0.8008], [0.2500], [0.8008], [1.0000], [0.3340], [0.2500], [0.4277], [0.8008], [0.8008], [0.5000], [0.3340], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00238037109375 loss: 0.00506591796875 loss: 0.0027008056640625 loss: 0.00360107421875 predicted value: tensor([[0.5664], [0.6094], [0.0625], [0.6016], [0.6562], [0.9258], [0.5391], [0.8555], [0.5195], [0.5820], [0.5039], [0.2910], [0.2891], [0.3906], [0.1523], [0.1660]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.2002], [0.4648], [0.8320], [1.0000], [0.7500], [1.0000], [0.7500], [0.6016], [0.4668], [0.4004], [0.5000], [0.3340], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033721923828125 loss: 0.004669189453125loss: 0.003265380859375 loss: 0.002532958984375 predicted value: tensor([[0.8086], [0.9141], [0.4062], [0.6953], [0.2041], [0.9062], [0.4043], [0.9219], [0.7148], [0.3223], [0.4023], [0.3477], [0.3535], [0.3477], [0.1162], [0.1465]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [1.0000], [0.5547], [0.6680], [0.2500], [1.0000], [0.6016], [1.0000], [0.5000], [0.2002], [0.6016], [0.5000], [0.4004], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00396728515625 loss: 0.0038299560546875 loss: 0.004180908203125 loss: 0.005889892578125 predicted value: tensor([[0.8672], [0.4375], [0.4199], [0.8945], [0.2148], [0.4785], [0.9961], [0.3242], [0.6133], [0.4082], [0.4609], [0.4590], [0.4414], [0.1719], [0.1387], [0.1035]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [1.0000], [0.2002], [0.6680], [1.0000], [0.4668], [0.8008], [0.5000], [0.4668], [0.6016], [0.2500], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00262451171875 loss: 0.0064697265625 loss: 0.0030364990234375loss: 0.005157470703125 19%|█▊ | 92/492 [48:07<3:28:02, 31.21s/it] {'loss': 0.0157, 'learning_rate': 1e-05, 'epoch': 0.19} 19%|█▊ | 92/492 [48:07<3:28:02, 31.21s/it]predicted value: tensor([[0.5000], [0.6289], [0.2852], [1.1172], [0.7891], [0.6406], [0.7383], [0.6289], [0.5234], [0.7422], [1.1016], [0.5273], [0.4375], [0.4785], [0.2383], [0.2930]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3145], [0.5547], [0.3340], [1.0000], [0.6016], [0.8008], [0.6680], [0.7500], [0.5000], [0.5000], [1.0000], [0.4004], [0.4004], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005157470703125 loss: 0.0038604736328125loss: 0.0038604736328125 loss: 0.003021240234375 predicted value: tensor([[1.1172], [0.4531], [0.4180], [0.4688], [1.0547], [0.4395], [0.7578], [0.5234], [0.5312], [1.1172], [0.4395], [0.6797], [0.4355], [0.3145], [0.3223], [0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.2500], [0.2500], [1.0000], [0.2500], [0.4668], [0.6016], [0.4668], [1.0000], [0.3340], [0.7500], [0.0400], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0074462890625 loss: 0.006927490234375 loss: 0.00634765625 loss: 0.0025787353515625 predicted value: tensor([[1.0234], [0.4668], [0.2773], [0.7852], [1.0625], [1.0469], [1.0547], [1.1250], [0.7500], [0.5898], [0.6953], [0.6250], [0.4258], [0.3086], [0.2500], [0.3242]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.2500], [0.5547], [1.0000], [1.0000], [1.0000], [1.0000], [0.5000], [0.5000], [0.6016], [0.5000], [0.2500], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0038604736328125 loss: 0.003387451171875 loss: 0.0031890869140625 loss: 0.0027923583984375 predicted value: tensor([[0.9492], [0.3359], [0.8320], [0.5586], [0.7578], [1.1719], [0.7422], [0.7578], [0.8047], [0.8125], [0.6289], [0.6328], [0.4199], [1.0625], [0.3184], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2002], [0.8008], [0.4668], [0.8008], [1.0000], [0.8008], [0.8008], [0.8008], [0.6016], [0.4668], [0.7500], [0.3340], [1.0000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004730224609375 loss: 0.0029449462890625 loss: 0.004638671875 loss: 0.0045166015625 19%|█▉ | 93/492 [48:38<3:27:47, 31.25s/it] {'loss': 0.0173, 'learning_rate': 1e-05, 'epoch': 0.19} 19%|█▉ | 93/492 [48:38<3:27:47, 31.25s/it]predicted value: tensor([[1.0938], [0.5938], [0.7461], [0.6680], [1.0938], [0.6289], [0.5703], [1.1328], [0.8125], [1.1172], [0.7500], [0.4922], [0.5781], [0.2910], [0.3164], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.5547], [0.4668], [1.0000], [0.6680], [0.6016], [1.0000], [0.8008], [1.0000], [0.6016], [0.4004], [0.6016], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00469970703125 loss: 0.00360107421875 loss: 0.005584716796875 loss: 0.0025787353515625 predicted value: tensor([[1.0703], [0.5352], [1.1016], [0.8555], [0.6641], [0.3516], [0.5664], [0.8164], [1.1172], [0.3379], [0.6836], [0.3535], [0.4863], [0.4727], [0.2559], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.8320], [0.8008], [0.2500], [0.2500], [0.6016], [1.0000], [0.2002], [0.6016], [0.1670], [0.4004], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002166748046875 loss: 0.0033416748046875 loss: 0.004852294921875 loss: 0.0030670166015625 predicted value: tensor([[1.0938], [0.3711], [0.5078], [1.1172], [0.7773], [1.0938], [0.3789], [0.3047], [1.1328], [1.1016], [1.1016], [0.4961], [0.4668], [0.3223], [0.2363], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.3750], [1.0000], [0.8008], [1.0000], [0.2002], [0.2002], [1.0000], [1.0000], [1.0000], [0.2500], [0.4004], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0040283203125 loss: 0.0034942626953125 loss: 0.003387451171875 loss: 0.00360107421875 predicted value: tensor([[0.4609], [1.0625], [0.6133], [0.5586], [1.0547], [1.1953], [0.8984], [0.3633], [0.6602], [0.7031], [0.3359], [0.7383], [0.5742], [0.2598], [0.2197], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.5547], [0.4668], [1.0000], [1.0000], [0.8008], [0.2500], [0.6016], [0.5000], [0.0625], [0.6016], [0.4004], [0.2500], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032806396484375 loss: 0.00445556640625 loss: 0.00390625 loss: 0.00384521484375 19%|█▉ | 94/492 [49:10<3:27:23, 31.26s/it] {'loss': 0.015, 'learning_rate': 1e-05, 'epoch': 0.19} 19%|█▉ | 94/492 [49:10<3:27:23, 31.26s/it]predicted value: tensor([[0.4766], [0.8945], [0.2305], [0.2295], [0.7656], [0.2061], [0.5234], [0.5000], [0.4688], [0.5625], [0.3281], [0.4199], [0.3359], [0.1357], [0.0991], [0.1045]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.2500], [0.3340], [0.8320], [0.2500], [0.8008], [0.6680], [0.6680], [0.6016], [0.4004], [0.5000], [0.5000], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00421142578125 loss: 0.0029144287109375 loss: 0.004119873046875 loss: 0.00482177734375 predicted value: tensor([[0.2852], [0.9180], [0.4727], [0.9609], [0.7344], [0.1953], [0.3125], [0.3867], [0.2285], [0.8672], [0.3926], [0.2969], [0.3691], [0.3652], [0.1016], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [1.0000], [0.8008], [0.2500], [0.3750], [0.3750], [0.5000], [1.0000], [0.3340], [0.1670], [0.4004], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002899169921875 loss: 0.0034942626953125 loss: 0.0027618408203125 loss: 0.002288818359375 predicted value: tensor([[0.5469], [0.3633], [0.2559], [0.2002], [0.6914], [0.4453], [0.6094], [0.5273], [0.3535], [0.5898], [0.3965], [0.4023], [0.1250], [0.3242], [0.0913], [0.0938]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.4668], [0.6680], [0.6016], [0.7500], [0.7500], [0.6016], [0.6016], [0.4004], [0.5000], [0.2500], [0.4004], [0.1426], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030975341796875 loss: 0.0050048828125loss: 0.0037994384765625 loss: 0.00299072265625 predicted value: tensor([[0.2617], [0.3516], [0.6875], [0.7461], [0.3750], [0.5781], [0.2158], [0.5469], [0.2354], [0.1875], [0.9375], [0.3633], [0.3125], [0.4531], [0.1030], [0.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8008], [0.8008], [0.2500], [0.6680], [0.2500], [0.6016], [0.3340], [0.2500], [1.0000], [0.4004], [0.3340], [0.5000], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001953125 loss: 0.002227783203125 loss: 0.002166748046875loss: 0.00518798828125 19%|█▉ | 95/492 [49:41<3:26:51, 31.26s/it] {'loss': 0.0135, 'learning_rate': 1e-05, 'epoch': 0.19} 19%|█▉ | 95/492 [49:41<3:26:51, 31.26s/it]predicted value: tensor([[0.3965], [0.1582], [0.6289], [0.3789], [0.2207], [0.3008], [0.9805], [0.5117], [0.3125], [0.7148], [0.3477], [0.4297], [0.2871], [0.2969], [0.2539], [0.1504]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3340], [0.4668], [0.4668], [0.2500], [0.2500], [1.0000], [0.3750], [0.2500], [0.8320], [0.4004], [0.5000], [0.0625], [0.3340], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00191497802734375 loss: 0.003631591796875loss: 0.00177001953125 loss: 0.00408935546875 predicted value: tensor([[0.2578], [0.2178], [0.4941], [0.2002], [0.9727], [0.5664], [0.9727], [0.9375], [0.5078], [0.4961], [0.2656], [0.4766], [0.3652], [0.2910], [0.0845], [0.0811]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.4668], [0.2002], [1.0000], [0.6016], [1.0000], [1.0000], [0.6016], [0.6016], [0.2500], [0.7500], [0.5000], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00159454345703125 loss: 0.003692626953125 loss: 0.002288818359375 loss: 0.0034027099609375 predicted value: tensor([[0.3555], [0.8711], [0.4336], [0.9375], [0.9844], [0.9922], [0.4570], [0.3633], [0.4531], [0.6250], [0.3555], [0.3672], [0.2793], [0.3477], [0.1670], [0.0972]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [1.0000], [1.0000], [1.0000], [0.4668], [0.3750], [0.6016], [0.7500], [0.4004], [0.4004], [0.2500], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001953125 loss: 0.00173187255859375 loss: 0.002105712890625 loss: 0.0037078857421875 predicted value: tensor([[0.5117], [0.7070], [0.9609], [0.4629], [0.9648], [0.4238], [0.2500], [0.3828], [0.4102], [0.4102], [0.6562], [0.5234], [0.3809], [0.3574], [0.1494], [0.1196]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [1.0000], [0.5547], [1.0000], [0.3340], [0.3340], [0.3340], [0.2500], [0.5000], [0.8008], [0.6016], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005889892578125 loss: 0.00347900390625 loss: 0.0018463134765625 loss: 0.003326416015625 20%|█▉ | 96/492 [50:12<3:25:49, 31.18s/it] {'loss': 0.0116, 'learning_rate': 1e-05, 'epoch': 0.2} 20%|█▉ | 96/492 [50:12<3:25:49, 31.18s/it]predicted value: tensor([[0.7500], [1.0703], [1.0703], [1.0781], [0.7891], [0.7852], [0.3359], [0.4746], [0.5273], [0.3730], [0.5859], [0.3770], [0.3809], [0.5000], [0.2969], [0.3105]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [1.0000], [1.0000], [1.0000], [0.7500], [0.8008], [0.2002], [0.2500], [0.2500], [0.2500], [0.6016], [0.3340], [0.5000], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032501220703125loss: 0.0050048828125 loss: 0.00193023681640625 loss: 0.0026702880859375 predicted value: tensor([[0.3477], [1.0469], [0.6641], [0.4727], [0.7930], [0.5977], [0.5938], [0.4727], [0.5508], [0.4238], [0.6484], [0.8828], [0.6445], [0.4922], [0.2500], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.1670], [1.0000], [0.4668], [0.4668], [0.8008], [0.4668], [0.2715], [0.2500], [0.4668], [0.6016], [0.2500], [0.8008], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002960205078125 loss: 0.007598876953125 loss: 0.0023040771484375 loss: 0.003936767578125 predicted value: tensor([[0.5469], [0.4023], [0.5430], [0.9414], [0.8633], [0.4863], [1.0312], [1.1328], [0.5664], [0.3125], [0.5430], [0.3789], [0.4785], [0.4590], [0.2383], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.8008], [0.8008], [0.2500], [1.0000], [1.0000], [0.3750], [0.2500], [0.5000], [0.3340], [0.5000], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035552978515625 loss: 0.0025482177734375loss: 0.004364013671875 loss: 0.005889892578125 predicted value: tensor([[0.4473], [0.5742], [0.5273], [0.5547], [0.8203], [1.1172], [0.5195], [0.6406], [1.1250], [0.4609], [0.3887], [0.5781], [0.7188], [0.3652], [0.3066], [0.3027]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.4668], [0.6680], [1.0000], [0.3340], [0.7500], [1.0000], [0.2500], [0.3340], [0.4004], [0.6016], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00384521484375loss: 0.002197265625 loss: 0.00482177734375 loss: 0.004119873046875 20%|█▉ | 97/492 [50:43<3:25:15, 31.18s/it] {'loss': 0.0152, 'learning_rate': 1e-05, 'epoch': 0.2} 20%|█▉ | 97/492 [50:43<3:25:15, 31.18s/it]predicted value: tensor([[0.5938], [0.8398], [1.0781], [1.0078], [0.4355], [0.5391], [0.7578], [0.6836], [0.3516], [0.4785], [0.4941], [0.5859], [0.5195], [0.3008], [0.2070], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [1.0000], [1.0000], [0.2500], [0.3750], [0.8008], [0.7500], [0.3340], [0.2002], [0.4277], [0.6016], [0.3340], [0.1670], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00567626953125 loss: 0.0034332275390625loss: 0.002197265625 loss: 0.005157470703125 predicted value: tensor([[0.6719], [0.8008], [1.0312], [0.5000], [0.4922], [0.7773], [0.8438], [0.3750], [0.8086], [0.4531], [0.5273], [0.5703], [0.5742], [0.5469], [0.2793], [0.5469]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [1.0000], [0.4668], [0.4668], [0.4668], [0.8008], [0.2002], [0.7500], [0.4004], [0.6016], [0.6016], [0.6016], [0.4004], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006256103515625 loss: 0.0032501220703125loss: 0.0036163330078125 loss: 0.005218505859375 predicted value: tensor([[0.6641], [0.8398], [0.4199], [0.7383], [0.7188], [0.3496], [0.9805], [0.7148], [0.6992], [0.9453], [0.6094], [0.5977], [1.0781], [0.4395], [0.2441], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.3340], [0.8008], [0.5547], [0.2500], [1.0000], [0.7500], [0.6016], [1.0000], [0.4668], [0.5000], [1.0000], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00408935546875 loss: 0.00457763671875 loss: 0.0023193359375 loss: 0.0029296875 predicted value: tensor([[0.4746], [0.6406], [0.8516], [0.5781], [1.0938], [0.3574], [0.6328], [1.0391], [1.0703], [1.0078], [0.5195], [0.4492], [0.4277], [0.2617], [0.2480], [0.2891]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8008], [0.4668], [1.0000], [0.2500], [0.6016], [1.0000], [1.0000], [1.0000], [0.4004], [0.2002], [0.4004], [0.2002], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032806396484375 loss: 0.004058837890625 loss: 0.002655029296875 loss: 0.002410888671875 20%|█▉ | 98/492 [51:14<3:24:23, 31.13s/it] {'loss': 0.0153, 'learning_rate': 1e-05, 'epoch': 0.2} 20%|█▉ | 98/492 [51:14<3:24:23, 31.13s/it]predicted value: tensor([[0.3184], [0.7148], [0.2158], [0.9258], [0.1514], [0.6172], [0.6445], [0.9531], [0.5039], [0.3984], [0.4551], [0.3828], [0.2070], [0.1729], [0.1338], [0.1299]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.2500], [1.0000], [0.2002], [0.6680], [0.8008], [1.0000], [0.7500], [0.2500], [0.5000], [0.5000], [0.0400], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004974365234375 loss: 0.0031280517578125 loss: 0.005218505859375 loss: 0.0052490234375 predicted value: tensor([[0.9258], [0.4844], [0.5820], [0.8789], [0.2266], [0.3379], [0.8789], [0.3223], [0.6562], [0.5430], [0.2578], [0.2832], [0.3086], [0.1177], [0.1157], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [1.0000], [0.3750], [0.4668], [1.0000], [0.3145], [0.8008], [0.5000], [0.3750], [0.4004], [0.5000], [0.1670], [0.2002], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003265380859375 loss: 0.004302978515625 loss: 0.0027618408203125 loss: 0.00225830078125 predicted value: tensor([[0.7305], [0.4258], [0.5703], [0.8906], [0.9023], [0.4160], [0.9375], [0.3848], [0.8711], [0.5547], [0.2373], [0.3496], [0.4824], [0.2969], [0.1211], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4648], [0.7500], [1.0000], [1.0000], [0.3750], [1.0000], [0.4004], [1.0000], [0.5000], [0.0625], [0.5000], [0.6016], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003326416015625 loss: 0.0026702880859375 loss: 0.0036773681640625 loss: 0.0027923583984375 predicted value: tensor([[0.3652], [0.2090], [0.4297], [0.8867], [0.5820], [0.1709], [0.4336], [0.5938], [0.5820], [0.3828], [0.5703], [0.4629], [0.2832], [0.3359], [0.1592], [0.1328]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3340], [0.4668], [1.0000], [0.8008], [0.2002], [0.6016], [0.6016], [0.8008], [0.5000], [0.6016], [0.6016], [0.4004], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005889892578125 loss: 0.0078125 loss: 0.0035858154296875loss: 0.004638671875 20%|██ | 99/492 [51:45<3:23:38, 31.09s/it] {'loss': 0.0164, 'learning_rate': 1e-05, 'epoch': 0.2} 20%|██ | 99/492 [51:45<3:23:38, 31.09s/it]predicted value: tensor([[0.7266], [0.4395], [0.6094], [0.7344], [0.8047], [0.3750], [0.6523], [0.4766], [0.1973], [0.9375], [0.3613], [0.3867], [0.2930], [0.1572], [0.1729], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [0.4668], [0.8008], [0.8008], [0.8320], [0.3340], [0.8008], [0.6016], [0.3340], [1.0000], [0.5000], [0.2500], [0.4004], [0.1426], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003814697265625 loss: 0.003082275390625 loss: 0.0024261474609375 loss: 0.0033111572265625 predicted value: tensor([[0.5469], [0.8828], [0.1924], [0.4180], [0.8438], [0.2041], [0.5781], [0.5703], [0.0559], [0.4180], [0.4707], [0.6055], [0.8945], [0.3281], [0.2080], [0.1982]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.2500], [0.4668], [1.0000], [0.3340], [0.8008], [0.6680], [0.0625], [0.4668], [0.6016], [0.6016], [1.0000], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0074462890625 loss: 0.00244140625loss: 0.0027618408203125 loss: 0.006439208984375 predicted value: tensor([[0.6523], [0.5820], [0.4395], [0.1807], [0.3379], [0.5898], [0.3164], [0.8789], [0.5898], [0.3262], [0.3105], [0.5234], [0.5977], [0.4766], [0.2656], [0.1309]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.7148], [0.4668], [0.2500], [0.2002], [0.7500], [0.2500], [1.0000], [0.6016], [0.3340], [0.3340], [0.7500], [0.7500], [0.7500], [0.3340], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004150390625 loss: 0.003936767578125loss: 0.005340576171875 loss: 0.0047607421875 predicted value: tensor([[0.3340], [0.8477], [0.4062], [0.2969], [0.3008], [0.3770], [0.2969], [0.1602], [0.5117], [0.5352], [0.3203], [0.3633], [0.2949], [0.3691], [0.1875], [0.1045]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.2002], [0.3145], [0.4668], [0.3750], [0.2500], [0.6016], [0.7500], [0.5000], [0.3340], [0.3340], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006805419921875 loss: 0.0029296875loss: 0.0023651123046875 loss: 0.002655029296875 20%|██ | 100/492 [52:16<3:23:33, 31.16s/it] {'loss': 0.0162, 'learning_rate': 1e-05, 'epoch': 0.2} 20%|██ | 100/492 [52:16<3:23:33, 31.16s/it]Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 4096} /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( predicted value: tensor([[1.0703], [0.3906], [0.6367], [0.3379], [0.7578], [0.5273], [0.5195], [0.8125], [0.6797], [0.7344], [0.5586], [0.4434], [0.5156], [0.3047], [0.4863], [0.3340]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.4668], [0.2500], [0.8008], [0.6016], [0.4668], [0.6016], [0.6016], [0.7500], [0.5000], [0.5000], [0.4004], [0.2002], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032806396484375 loss: 0.00286865234375loss: 0.0031890869140625 loss: 0.0042724609375 predicted value: tensor([[0.5703], [0.4277], [1.0547], [0.5391], [0.9141], [0.4453], [0.4277], [1.0078], [0.9844], [0.4121], [0.5977], [0.5156], [0.2930], [0.3750], [0.3398], [0.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2002], [1.0000], [0.4668], [0.8320], [0.2500], [0.2002], [1.0000], [1.0000], [0.2002], [0.3340], [0.4004], [0.0278], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004486083984375 loss: 0.006500244140625 loss: 0.0030517578125 loss: 0.00347900390625 predicted value: tensor([[0.5664], [0.6562], [0.5586], [0.5625], [0.7695], [1.0625], [0.4238], [0.7422], [0.6211], [0.8867], [0.4922], [0.4473], [0.5117], [0.4023], [0.3496], [0.5078]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6016], [0.3750], [0.4668], [0.7500], [1.0000], [0.2500], [0.7500], [0.3750], [0.8008], [0.5000], [0.4004], [0.5000], [0.2002], [0.2500], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004119873046875 loss: 0.0031585693359375loss: 0.0027008056640625 loss: 0.002838134765625 predicted value: tensor([[1.0391], [0.4922], [1.0625], [0.4141], [0.5742], [0.6367], [1.0703], [0.4375], [0.7070], [0.6953], [0.3574], [0.5352], [0.4102], [0.6992], [0.2852], [0.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.3750], [0.4668], [0.8008], [1.0000], [0.3340], [0.6016], [0.6680], [0.3340], [0.4668], [0.5000], [0.7500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0054931640625 loss: 0.00160980224609375 loss: 0.003173828125 loss: 0.00146484375 21%|██ | 101/492 [54:23<6:29:50, 59.82s/it] {'loss': 0.0139, 'learning_rate': 1e-05, 'epoch': 0.21} 21%|██ | 101/492 [54:23<6:29:50, 59.82s/it]predicted value: tensor([[0.6719], [0.7656], [0.7266], [1.0156], [0.7227], [0.8984], [0.8281], [0.8867], [1.0234], [0.4883], [0.8633], [0.3477], [0.2275], [0.2871], [0.2520], [0.2949]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.7148], [1.0000], [0.7148], [0.8008], [0.8320], [0.8008], [1.0000], [0.4668], [0.8008], [0.0625], [0.1670], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032806396484375 loss: 0.00262451171875loss: 0.00262451171875 loss: 0.003021240234375 predicted value: tensor([[0.4395], [0.3574], [0.3223], [1.0156], [0.4238], [0.4453], [0.5156], [1.0703], [1.0938], [0.6055], [0.4082], [0.3457], [0.4512], [0.4062], [0.2891], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3340], [0.3340], [1.0000], [0.4668], [0.3750], [0.2002], [1.0000], [1.0000], [0.6016], [0.4004], [0.3340], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00213623046875 loss: 0.003631591796875 loss: 0.002197265625 loss: 0.0034637451171875 predicted value: tensor([[0.6367], [0.5664], [0.8477], [0.4414], [0.5352], [0.2969], [0.7383], [0.7188], [0.7578], [0.5508], [0.5703], [0.5820], [0.5078], [0.4219], [0.3008], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.8320], [0.3340], [0.4668], [0.2500], [0.5547], [0.8008], [0.6016], [0.6016], [0.6016], [0.5000], [0.5000], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002166748046875 loss: 0.0017242431640625loss: 0.00193023681640625 loss: 0.0019073486328125 predicted value: tensor([[0.5938], [0.5391], [1.1172], [0.8281], [0.6406], [0.5625], [0.8359], [0.8438], [0.6758], [0.6992], [0.7031], [0.7852], [0.5117], [0.3984], [0.2480], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [1.0000], [0.8008], [0.4668], [0.7500], [0.8008], [0.6016], [0.3340], [0.8008], [0.6016], [0.6680], [0.5000], [0.5000], [0.0625], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026092529296875 loss: 0.00531005859375 loss: 0.00173187255859375 loss: 0.005401611328125 21%|██ | 102/492 [54:55<5:33:49, 51.36s/it] {'loss': 0.0114, 'learning_rate': 1e-05, 'epoch': 0.21} 21%|██ | 102/492 [54:55<5:33:49, 51.36s/it]predicted value: tensor([[0.6133], [0.2988], [0.9141], [0.6172], [0.6016], [0.6914], [0.2812], [0.4805], [0.5586], [0.6445], [0.3887], [0.3535], [0.3828], [0.2432], [0.1348], [0.1611]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.2500], [1.0000], [0.8008], [0.6680], [0.6016], [0.3750], [0.5000], [0.8008], [0.7500], [0.4004], [0.5000], [0.3340], [0.0400], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004364013671875loss: 0.00127410888671875 loss: 0.003692626953125 loss: 0.00421142578125 predicted value: tensor([[0.3047], [0.3848], [0.7148], [0.0825], [0.9609], [0.5352], [0.9375], [0.6211], [0.3906], [0.4805], [0.5586], [0.4570], [0.3223], [0.3535], [0.1602], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5547], [0.8008], [0.2002], [1.0000], [0.6680], [1.0000], [0.4648], [0.4668], [0.6016], [0.4668], [0.6016], [0.5000], [0.5000], [0.2002], [0.0400]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004852294921875 loss: 0.0034942626953125 loss: 0.00494384765625 loss: 0.0037384033203125 predicted value: tensor([[0.5859], [0.3027], [0.6016], [0.3516], [0.9570], [0.3809], [0.3906], [0.3613], [0.2109], [0.4121], [0.5820], [0.8906], [0.5664], [0.3535], [0.2520], [0.2061]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.8008], [0.3340], [1.0000], [0.3750], [0.3750], [0.3750], [0.2002], [0.4004], [0.5000], [1.0000], [0.8008], [0.4004], [0.2852], [0.0400]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025177001953125 loss: 0.0023956298828125 loss: 0.0067138671875 loss: 0.004302978515625 predicted value: tensor([[0.3750], [0.7422], [0.2812], [0.3184], [0.4941], [0.9531], [0.3652], [0.4375], [0.6367], [0.5508], [0.7148], [0.4727], [0.3633], [0.1621], [0.0811], [0.1426]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.4668], [0.4668], [0.8008], [1.0000], [0.2500], [0.4277], [0.8320], [0.6016], [0.8008], [0.7500], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004364013671875 loss: 0.0035247802734375 loss: 0.0054931640625 loss: 0.00159454345703125 21%|██ | 103/492 [55:26<4:54:05, 45.36s/it] {'loss': 0.0154, 'learning_rate': 1e-05, 'epoch': 0.21} 21%|██ | 103/492 [55:26<4:54:05, 45.36s/it]predicted value: tensor([[0.4180], [0.3457], [0.4512], [0.2598], [0.9688], [0.1768], [0.4395], [0.4453], [0.6250], [0.4590], [0.8750], [0.3047], [0.3730], [0.1514], [0.3398], [0.1270]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.5547], [0.2500], [1.0000], [0.2002], [0.5000], [0.6016], [0.6016], [0.5000], [1.0000], [0.3340], [0.2852], [0.1670], [0.4004], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027313232421875 loss: 0.0023040771484375 loss: 0.005218505859375 loss: 0.0029144287109375 predicted value: tensor([[0.4531], [0.9375], [0.3574], [0.6172], [0.3477], [0.9219], [0.4980], [0.8906], [0.2812], [0.6602], [0.3027], [0.3867], [0.3945], [0.1836], [0.1387], [0.1982]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.7500], [0.8320], [0.4668], [1.0000], [0.6016], [1.0000], [0.2500], [0.6680], [0.4004], [0.5000], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0047607421875 loss: 0.0042724609375 loss: 0.00445556640625loss: 0.0022735595703125 predicted value: tensor([[0.3926], [0.6484], [0.9062], [0.3750], [0.2734], [0.9453], [0.9258], [0.4297], [0.9453], [0.3145], [0.3359], [0.3906], [0.3066], [0.2314], [0.1108], [0.1494]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [1.0000], [0.3750], [0.4668], [1.0000], [1.0000], [0.3750], [1.0000], [0.2002], [0.3340], [0.6016], [0.4004], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003509521484375 loss: 0.00390625 loss: 0.003326416015625 loss: 0.003662109375 predicted value: tensor([[0.5469], [0.3008], [0.8320], [0.9727], [0.9102], [0.9023], [0.9688], [0.2031], [0.1924], [0.5898], [0.9727], [0.4414], [0.2539], [0.1553], [0.1582], [0.1572]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3145], [0.8555], [1.0000], [1.0000], [1.0000], [1.0000], [0.2002], [0.0400], [0.8008], [1.0000], [0.7500], [0.4004], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00469970703125 loss: 0.004669189453125 loss: 0.003265380859375 loss: 0.00225830078125 21%|██ | 104/492 [55:57<4:25:25, 41.05s/it] {'loss': 0.0146, 'learning_rate': 1e-05, 'epoch': 0.21} 21%|██ | 104/492 [55:57<4:25:25, 41.05s/it]predicted value: tensor([[0.7656], [1.0781], [0.5820], [0.7031], [0.6836], [0.6914], [1.0625], [0.5781], [1.1016], [0.3848], [0.6914], [0.3828], [0.5977], [0.2734], [0.2930], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.4668], [0.5547], [0.5547], [0.8008], [1.0000], [0.5000], [1.0000], [0.3340], [0.7500], [0.2002], [0.6016], [0.0278], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00408935546875 loss: 0.00421142578125 loss: 0.003021240234375 loss: 0.0033721923828125 predicted value: tensor([[0.5547], [0.5039], [0.4434], [1.1172], [0.5977], [0.7344], [0.5469], [1.0469], [0.5273], [0.6250], [0.5078], [0.5391], [0.5703], [0.4336], [0.5820], [0.3672]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.3750], [1.0000], [0.3750], [0.8008], [0.7500], [1.0000], [0.6016], [0.3145], [0.4004], [0.3340], [0.6016], [0.5000], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00457763671875 loss: 0.00506591796875loss: 0.006439208984375 loss: 0.0030670166015625 predicted value: tensor([[0.7383], [0.6133], [0.5156], [0.7109], [0.7422], [1.1250], [0.3184], [0.2578], [0.3594], [0.7070], [0.5508], [0.3418], [0.4023], [0.4668], [0.2793], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.5547], [0.4668], [0.6680], [0.8008], [1.0000], [0.3340], [0.2500], [0.2500], [0.8008], [0.5000], [0.4004], [0.4004], [0.4668], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0038604736328125 loss: 0.002227783203125 loss: 0.0010528564453125 loss: 0.005035400390625 predicted value: tensor([[0.6367], [0.3848], [0.4297], [0.6836], [0.4785], [1.0938], [0.9336], [0.5273], [0.7188], [0.7070], [0.6875], [0.7031], [0.4277], [0.3066], [0.2695], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2002], [0.4668], [0.6680], [0.3340], [1.0000], [0.8008], [0.3750], [0.6680], [0.6680], [0.3750], [0.6016], [0.4004], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00274658203125 loss: 0.0025177001953125 loss: 0.003936767578125 loss: 0.0034027099609375 21%|██▏ | 105/492 [56:29<4:07:25, 38.36s/it] {'loss': 0.0147, 'learning_rate': 1e-05, 'epoch': 0.21} 21%|██▏ | 105/492 [56:29<4:07:25, 38.36s/it]predicted value: tensor([[0.5234], [0.6289], [0.5703], [1.1172], [0.6914], [0.5977], [1.0859], [0.7617], [0.6602], [0.2715], [0.2988], [0.5938], [0.7227], [0.2285], [0.1943], [0.2402]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [1.0000], [0.6680], [0.4668], [1.0000], [0.8008], [0.5547], [0.0400], [0.2500], [0.6016], [0.7500], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004302978515625 loss: 0.0023345947265625loss: 0.0019989013671875 loss: 0.004638671875 predicted value: tensor([[0.7070], [0.4434], [0.7656], [0.5781], [0.3574], [0.8320], [0.3008], [0.5078], [0.6836], [0.5898], [0.4688], [0.4355], [0.4824], [0.7500], [0.4141], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.3750], [0.6680], [0.4668], [0.2500], [0.8008], [0.2500], [0.2500], [0.8008], [0.5000], [0.4004], [0.4004], [0.4004], [0.8008], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034332275390625 loss: 0.0025177001953125loss: 0.00156402587890625 loss: 0.005035400390625 predicted value: tensor([[0.4375], [0.5039], [0.4727], [0.4824], [0.4609], [1.0156], [0.4668], [1.1016], [0.8945], [0.8438], [0.4258], [0.5000], [0.4941], [0.5000], [0.2637], [0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [0.3750], [0.4668], [0.2500], [1.0000], [0.2500], [1.0000], [0.8008], [0.8008], [0.4004], [0.5000], [0.6016], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030670166015625 loss: 0.0035552978515625 loss: 0.0026397705078125 loss: 0.00274658203125 predicted value: tensor([[0.8711], [0.4941], [0.5781], [0.2158], [0.5703], [0.6250], [0.5156], [0.5820], [0.8867], [1.0859], [0.4551], [0.5078], [0.4316], [0.4512], [0.4941], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.4668], [0.0625], [0.6016], [0.4668], [0.4668], [0.6016], [0.8008], [1.0000], [0.5000], [0.6016], [0.4004], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00170135498046875 loss: 0.0048828125 loss: 0.003570556640625 loss: 0.003509521484375 22%|██▏ | 106/492 [57:00<3:52:29, 36.14s/it] {'loss': 0.0129, 'learning_rate': 1e-05, 'epoch': 0.22} 22%|██▏ | 106/492 [57:00<3:52:29, 36.14s/it]predicted value: tensor([[0.4004], [0.2852], [0.7031], [0.2393], [0.3066], [0.6172], [0.3164], [0.2383], [0.2344], [0.3164], [0.6094], [0.2793], [0.1836], [0.3301], [0.0513], [0.1055]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5000], [0.8008], [0.3340], [0.3340], [0.6680], [0.3340], [0.2500], [0.1426], [0.4004], [0.7500], [0.4004], [0.3340], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020599365234375 loss: 0.0030059814453125 loss: 0.0032501220703125 loss: 0.004425048828125 predicted value: tensor([[0.4062], [0.7266], [0.9922], [0.6758], [0.9609], [0.1768], [0.4258], [0.9531], [0.4141], [0.3164], [0.3066], [0.3145], [0.1660], [0.1079], [0.0898], [0.1250]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [1.0000], [0.6680], [1.0000], [0.2500], [0.4668], [1.0000], [0.7500], [0.2500], [0.4004], [0.4004], [0.2500], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019683837890625 loss: 0.003448486328125loss: 0.00311279296875 loss: 0.0024261474609375 predicted value: tensor([[0.5000], [0.3281], [0.2100], [0.1963], [0.4141], [0.2158], [0.5039], [0.3086], [0.3652], [0.4512], [0.3574], [0.4316], [0.5820], [0.1328], [0.0942], [0.1147]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.2002], [0.2500], [0.5547], [0.3340], [0.5000], [0.2002], [0.2500], [0.5703], [0.4004], [0.5000], [0.5000], [0.1670], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006103515625 loss: 0.003875732421875 loss: 0.00185394287109375 loss: 0.00244140625 predicted value: tensor([[0.3555], [0.5977], [0.6602], [0.5898], [0.3711], [0.2539], [0.4355], [0.9961], [0.2812], [0.6875], [0.2559], [0.3867], [0.2305], [0.1475], [0.0879], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [0.8008], [0.8008], [0.4668], [0.2500], [0.4668], [1.0000], [0.4668], [0.8008], [0.3340], [0.5000], [0.0625], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028076171875 loss: 0.00701904296875 loss: 0.00250244140625 loss: 0.00323486328125 22%|██▏ | 107/492 [57:31<3:42:32, 34.68s/it] {'loss': 0.0134, 'learning_rate': 1e-05, 'epoch': 0.22} 22%|██▏ | 107/492 [57:31<3:42:32, 34.68s/it]predicted value: tensor([[0.4668], [0.3730], [0.9531], [0.4727], [0.4297], [0.2578], [0.4297], [0.1699], [0.2734], [0.4160], [0.5078], [0.3281], [0.5352], [0.0957], [0.1055], [0.0947]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [1.0000], [0.3750], [0.6016], [0.3340], [0.4668], [0.2500], [0.2500], [0.6016], [0.6016], [0.3340], [0.5000], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006011962890625 loss: 0.004119873046875 loss: 0.002899169921875 loss: 0.0027008056640625 predicted value: tensor([[0.3789], [0.2812], [0.2393], [0.2363], [0.3926], [0.4414], [0.5352], [0.3418], [0.3945], [0.3906], [0.3691], [0.1631], [0.3086], [0.3066], [0.1094], [0.1328]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [0.3340], [0.3340], [0.3750], [0.5547], [0.7500], [0.3340], [0.4668], [0.4668], [0.5000], [0.0625], [0.3340], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037384033203125 loss: 0.0026397705078125 loss: 0.004302978515625 loss: 0.0031890869140625 predicted value: tensor([[0.9609], [0.6406], [0.6719], [0.5352], [0.6523], [0.6055], [0.4883], [0.5430], [0.4043], [0.5352], [0.0425], [0.4062], [0.4141], [0.1738], [0.1328], [0.1396]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.8008], [0.5547], [0.8008], [0.6680], [0.4668], [0.7500], [0.5000], [0.6680], [0.0625], [0.4668], [0.4668], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016021728515625 loss: 0.00274658203125 loss: 0.004241943359375 loss: 0.003387451171875 predicted value: tensor([[0.9805], [0.3711], [0.6250], [0.4316], [0.3730], [0.5742], [0.6641], [0.9727], [0.3359], [0.9258], [0.3945], [0.3047], [0.9531], [0.3203], [0.2910], [0.0806]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.8008], [0.4668], [0.4668], [0.8320], [0.8008], [1.0000], [0.3750], [1.0000], [0.6016], [0.5000], [1.0000], [0.2002], [0.0400], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003936767578125 loss: 0.004730224609375loss: 0.005035400390625 loss: 0.00274658203125 22%|██▏ | 108/492 [58:03<3:35:17, 33.64s/it] {'loss': 0.0145, 'learning_rate': 1e-05, 'epoch': 0.22} 22%|██▏ | 108/492 [58:03<3:35:17, 33.64s/it]predicted value: tensor([[0.6680], [0.4980], [0.6328], [0.7656], [0.7109], [0.5273], [0.6680], [0.3750], [0.6367], [0.5938], [0.4648], [0.4688], [0.2334], [0.2295], [0.2539], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.6016], [0.6680], [0.8008], [0.4668], [0.3750], [0.2500], [0.6016], [0.6016], [0.4004], [0.5000], [0.2002], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00160980224609375 loss: 0.0023193359375 loss: 0.00421142578125 loss: 0.002777099609375 predicted value: tensor([[0.3867], [1.0938], [0.5039], [0.6055], [0.6562], [0.4121], [0.5469], [0.6680], [0.5078], [1.0469], [0.6602], [0.4141], [0.6484], [0.4258], [0.2891], [0.2393]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [1.0000], [0.4668], [0.3750], [0.6016], [0.3340], [0.4668], [0.5000], [0.3750], [1.0000], [0.4668], [0.4004], [0.7500], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00262451171875 loss: 0.004608154296875 loss: 0.0030670166015625 loss: 0.0020904541015625 predicted value: tensor([[0.5273], [0.4844], [0.3789], [0.7383], [0.7891], [0.7930], [0.3555], [0.5195], [0.4688], [0.7812], [0.2734], [0.5156], [0.4434], [0.2520], [0.3281], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.2002], [0.6172], [0.8008], [0.8008], [0.3340], [0.5000], [0.2500], [0.7500], [0.2500], [0.7500], [0.5000], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0057373046875 loss: 0.003082275390625 loss: 0.00579833984375 loss: 0.0050048828125 predicted value: tensor([[0.6211], [0.7188], [0.8750], [0.3652], [1.1016], [0.5781], [0.5430], [0.5703], [1.0938], [0.7500], [0.3281], [0.5156], [0.4258], [0.5195], [0.2275], [0.2891]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [0.8008], [0.2500], [1.0000], [0.4668], [0.4668], [0.7500], [1.0000], [0.8008], [0.0400], [0.5000], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0038604736328125 loss: 0.00299072265625 loss: 0.0031890869140625 loss: 0.0030364990234375 22%|██▏ | 109/492 [58:34<3:29:53, 32.88s/it] {'loss': 0.014, 'learning_rate': 1e-05, 'epoch': 0.22} 22%|██▏ | 109/492 [58:34<3:29:53, 32.88s/it]predicted value: tensor([[0.5547], [1.0469], [0.4277], [0.6953], [0.6289], [0.4844], [0.5977], [1.0234], [0.3027], [0.8125], [1.0391], [0.7617], [0.3848], [0.2891], [0.4668], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.2002], [0.8008], [0.5547], [0.2500], [0.6016], [1.0000], [0.3340], [0.8008], [1.0000], [0.5000], [0.2002], [0.1670], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002532958984375 loss: 0.00347900390625 loss: 0.0045166015625 loss: 0.003143310546875 predicted value: tensor([[0.7891], [1.0625], [0.4062], [0.8984], [0.3105], [0.7383], [0.7344], [0.7109], [0.5391], [0.8086], [1.0391], [0.3418], [0.2773], [0.4277], [0.4395], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.3750], [0.8320], [0.2500], [0.3750], [0.7500], [0.8008], [0.6016], [0.8008], [1.0000], [0.2500], [0.3340], [0.3340], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034637451171875 loss: 0.00299072265625loss: 0.0025482177734375 loss: 0.00146484375 predicted value: tensor([[0.7305], [0.3359], [0.4668], [0.8398], [0.5625], [0.6523], [0.8633], [0.5156], [0.6719], [0.5000], [0.4766], [0.6211], [0.4355], [0.4238], [0.2852], [0.3066]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.1670], [0.4668], [0.8008], [0.3750], [0.6016], [0.8008], [0.5000], [0.7500], [0.5000], [0.5000], [0.5000], [0.4004], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004730224609375 loss: 0.002166748046875 loss: 0.004058837890625 loss: 0.0015869140625 predicted value: tensor([[0.5977], [0.8281], [0.6758], [0.5508], [0.7266], [0.5977], [0.5430], [0.5234], [0.7852], [0.2490], [0.4277], [0.7734], [0.5547], [0.7070], [0.5352], [0.2451]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.5547], [0.3750], [0.4668], [0.6016], [0.6016], [0.6016], [0.4668], [0.0625], [0.2500], [0.7500], [0.4277], [0.7500], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030975341796875 loss: 0.0023651123046875 loss: 0.0031585693359375 loss: 0.0052490234375 22%|██▏ | 110/492 [59:05<3:27:01, 32.52s/it] {'loss': 0.0126, 'learning_rate': 1e-05, 'epoch': 0.22} 22%|██▏ | 110/492 [59:05<3:27:01, 32.52s/it]predicted value: tensor([[0.3281], [0.1055], [0.9688], [0.4941], [0.5781], [0.6250], [0.8828], [0.9023], [0.5312], [0.3496], [0.5508], [0.3359], [0.3027], [0.1279], [0.3242], [0.1216]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.0625], [1.0000], [0.6016], [0.4668], [0.8008], [1.0000], [1.0000], [0.4668], [0.4668], [0.5000], [0.4004], [0.4004], [0.2500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002166748046875 loss: 0.00396728515625 loss: 0.00225830078125 loss: 0.00244140625 predicted value: tensor([[0.6484], [0.6250], [0.1562], [0.3926], [0.3535], [0.1387], [0.3066], [0.3242], [0.4141], [0.6172], [0.9609], [0.7695], [0.1699], [0.2734], [0.0811], [0.1221]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.8008], [0.2500], [0.3750], [0.4668], [0.2002], [0.4668], [0.4668], [0.4668], [0.8008], [1.0000], [1.0000], [0.2500], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032958984375 loss: 0.004180908203125loss: 0.00421142578125 loss: 0.0022430419921875 predicted value: tensor([[0.3477], [0.3281], [0.4316], [0.8945], [0.5742], [0.9062], [0.5625], [0.3711], [0.9180], [0.1836], [0.3418], [0.2236], [0.5469], [0.2451], [0.1216], [0.1025]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [0.4668], [1.0000], [0.6680], [1.0000], [0.5000], [0.3750], [1.0000], [0.2500], [0.5000], [0.3340], [0.6016], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0042724609375 loss: 0.0020294189453125 loss: 0.0019989013671875 loss: 0.007720947265625 predicted value: tensor([[0.4902], [0.6680], [0.6406], [0.3906], [0.7578], [0.5938], [0.6836], [0.3633], [0.3516], [0.9141], [0.3594], [0.5117], [0.9180], [0.1162], [0.1279], [0.1523]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.8008], [0.3750], [0.8320], [0.4004], [0.8008], [0.3750], [0.3750], [1.0000], [0.4668], [0.7500], [1.0000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033111572265625 loss: 0.0031280517578125 loss: 0.00162506103515625 loss: 0.00323486328125 23%|██▎ | 111/492 [59:37<3:24:13, 32.16s/it] {'loss': 0.013, 'learning_rate': 1e-05, 'epoch': 0.23} 23%|██▎ | 111/492 [59:37<3:24:13, 32.16s/it]predicted value: tensor([[0.4258], [0.7578], [0.3945], [0.1807], [0.1562], [0.9609], [0.9062], [0.5703], [0.1338], [0.3105], [0.5859], [0.3223], [0.1738], [0.3379], [0.3496], [0.1426]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.3750], [0.2500], [0.2500], [1.0000], [1.0000], [0.8008], [0.2002], [0.4004], [0.6680], [0.4668], [0.0400], [0.2852], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024871826171875loss: 0.0020904541015625 loss: 0.004425048828125 loss: 0.00262451171875 predicted value: tensor([[0.4277], [0.3555], [0.9570], [0.7188], [0.6445], [0.9297], [0.2061], [0.3633], [0.5000], [0.4844], [0.6211], [0.3301], [0.4238], [0.1309], [0.3828], [0.1416]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [1.0000], [0.8008], [0.8008], [1.0000], [0.2500], [0.4668], [0.6016], [0.5000], [0.8008], [0.5000], [0.5000], [0.1670], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034027099609375 loss: 0.0025634765625 loss: 0.0028076171875 loss: 0.00335693359375 predicted value: tensor([[0.2676], [0.8867], [0.3125], [0.2559], [0.9375], [0.9219], [0.3965], [0.3965], [0.9297], [0.4121], [0.3535], [0.3945], [0.1445], [0.3047], [0.1279], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [1.0000], [0.4668], [0.3340], [1.0000], [1.0000], [0.5000], [0.3750], [1.0000], [0.3340], [0.3340], [0.5000], [0.1426], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002899169921875 loss: 0.00164031982421875 loss: 0.003662109375 loss: 0.0028076171875 predicted value: tensor([[0.9102], [0.3887], [0.3223], [0.4141], [0.4004], [0.2598], [0.4043], [0.5898], [0.1328], [0.3438], [0.4141], [0.6016], [0.4746], [0.1963], [0.1689], [0.1631]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.4668], [0.4668], [0.4668], [0.2500], [0.5547], [0.8008], [0.2002], [0.2500], [0.4668], [0.8008], [0.7500], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0068359375 loss: 0.0017852783203125 loss: 0.0038909912109375 loss: 0.0029296875 23%|██▎ | 112/492 [1:00:08<3:22:04, 31.91s/it] {'loss': 0.0126, 'learning_rate': 1e-05, 'epoch': 0.23} 23%|██▎ | 112/492 [1:00:08<3:22:04, 31.91s/it]predicted value: tensor([[0.6523], [0.5430], [0.5742], [0.5312], [1.1094], [0.4980], [0.5078], [1.0703], [1.0625], [0.5117], [0.4844], [0.2656], [0.4219], [0.7305], [0.4941], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.7500], [0.4668], [1.0000], [0.3750], [0.3340], [1.0000], [1.0000], [0.6016], [0.3340], [0.0278], [0.4004], [0.7500], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00182342529296875 loss: 0.003448486328125 loss: 0.0035552978515625 loss: 0.00457763671875 predicted value: tensor([[1.1094], [0.3105], [0.6289], [0.3398], [0.2637], [0.3066], [1.0234], [0.5156], [0.3398], [0.1953], [0.4414], [1.0625], [0.5547], [0.4785], [0.3125], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.5547], [0.3340], [0.1670], [0.2500], [1.0000], [0.4668], [0.2500], [0.0278], [0.5000], [1.0000], [0.4004], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035552978515625 loss: 0.001800537109375loss: 0.002044677734375 loss: 0.0033111572265625 predicted value: tensor([[0.5469], [0.8203], [1.0703], [0.4980], [0.7969], [0.7617], [0.3496], [0.8203], [0.5117], [1.1016], [0.3203], [0.5234], [0.4668], [0.2988], [0.2871], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [1.0000], [0.3340], [0.8008], [0.8008], [0.2002], [0.8008], [0.3340], [1.0000], [0.2002], [0.4004], [0.4004], [0.0625], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034332275390625loss: 0.003204345703125 loss: 0.001373291015625 loss: 0.003997802734375 predicted value: tensor([[1.0781], [0.8203], [0.6328], [0.3633], [0.7930], [0.4297], [0.6250], [0.7891], [0.4473], [0.3906], [0.6094], [0.4199], [0.6367], [0.3770], [0.4277], [0.3066]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.8008], [0.2500], [0.8008], [0.4668], [0.5547], [0.8008], [0.2002], [0.4004], [0.6016], [0.5000], [0.7500], [0.4004], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024261474609375 loss: 0.0022430419921875 loss: 0.004730224609375 loss: 0.0037841796875 23%|██▎ | 113/492 [1:00:39<3:20:31, 31.74s/it] {'loss': 0.0123, 'learning_rate': 1e-05, 'epoch': 0.23} 23%|██▎ | 113/492 [1:00:39<3:20:31, 31.74s/it]predicted value: tensor([[0.4980], [0.6328], [1.0391], [0.8359], [0.6250], [1.0547], [0.5977], [0.5156], [0.4844], [0.6484], [0.6797], [0.5898], [0.4766], [0.5000], [0.2178], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6016], [0.8008], [1.0000], [0.8320], [0.3750], [1.0000], [0.3340], [0.3340], [0.3340], [0.5547], [0.6016], [0.6016], [0.6016], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019378662109375 loss: 0.00433349609375 loss: 0.00109100341796875 loss: 0.00131988525390625 predicted value: tensor([[0.8164], [0.2852], [1.0781], [0.4785], [0.7500], [0.3984], [1.0547], [0.8867], [0.3340], [0.6055], [0.6094], [0.5625], [0.5234], [0.4316], [0.3086], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [1.0000], [0.3750], [0.8008], [0.3340], [1.0000], [0.6680], [0.2500], [0.3750], [0.5000], [0.6016], [0.5000], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00151824951171875 loss: 0.00250244140625loss: 0.0034942626953125 loss: 0.00299072265625 predicted value: tensor([[1.0938], [1.0469], [0.8906], [0.4473], [0.6562], [1.0625], [1.0781], [0.3848], [0.5430], [0.6562], [0.6641], [0.5664], [0.2559], [0.3555], [0.2598], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.8555], [0.4668], [0.5547], [1.0000], [1.0000], [0.2500], [0.2500], [0.6016], [0.6016], [0.4668], [0.2002], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002899169921875 loss: 0.00262451171875 loss: 0.0035552978515625 loss: 0.0036773681640625 predicted value: tensor([[0.5156], [0.4785], [1.0703], [0.7500], [0.2314], [0.8828], [0.3750], [0.6953], [0.2676], [0.5781], [0.3574], [0.5312], [0.6602], [0.4961], [0.2676], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.7148], [0.2002], [0.8320], [0.2002], [0.4668], [0.2002], [0.6016], [0.2500], [0.4004], [0.7500], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00189208984375 loss: 0.0023956298828125loss: 0.0015716552734375 loss: 0.0033111572265625 23%|██▎ | 114/492 [1:01:10<3:18:37, 31.53s/it] {'loss': 0.0103, 'learning_rate': 1e-05, 'epoch': 0.23} 23%|██▎ | 114/492 [1:01:10<3:18:37, 31.53s/it]predicted value: tensor([[0.3457], [0.9492], [0.6602], [0.4473], [0.3008], [0.1533], [0.5430], [0.5391], [0.9492], [0.9023], [0.4102], [0.2910], [0.9336], [0.5117], [0.1299], [0.1377]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.8320], [0.6016], [0.3750], [0.2500], [0.5547], [0.6680], [1.0000], [1.0000], [0.4004], [0.5000], [1.0000], [0.7500], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011749267578125 loss: 0.003448486328125 loss: 0.004638671875 loss: 0.0054931640625 predicted value: tensor([[0.3789], [0.9023], [0.9297], [0.4121], [0.5000], [0.5156], [0.7500], [0.3145], [0.9570], [0.5156], [0.3887], [0.6758], [0.2520], [0.2373], [0.1777], [0.1182]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [0.4668], [0.7500], [0.5547], [0.8320], [0.4668], [1.0000], [0.7500], [0.4668], [0.6680], [0.3340], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005157470703125 loss: 0.004119873046875 loss: 0.00335693359375 loss: 0.005523681640625 predicted value: tensor([[0.6523], [0.8359], [0.4434], [0.7148], [0.4590], [0.6875], [0.3223], [0.3730], [0.9766], [0.4414], [0.2539], [0.2246], [0.6602], [0.3516], [0.1240], [0.1455]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8750], [0.4668], [0.8008], [0.3750], [0.8008], [0.3750], [0.5000], [1.0000], [0.5000], [0.2500], [0.3340], [0.8008], [0.3340], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0047607421875 loss: 0.00188446044921875 loss: 0.00164794921875 loss: 0.005950927734375 predicted value: tensor([[0.6289], [0.3418], [0.6445], [0.8867], [0.8984], [0.5469], [0.5117], [0.3184], [0.6172], [0.4023], [0.4121], [0.4180], [0.3027], [0.1660], [0.1187], [0.1338]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [1.0000], [1.0000], [0.8008], [0.5547], [0.4668], [0.7500], [0.5000], [0.3750], [0.3340], [0.2500], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019989013671875 loss: 0.003448486328125 loss: 0.00244140625 loss: 0.002410888671875 23%|██▎ | 115/492 [1:01:42<3:17:29, 31.43s/it] {'loss': 0.0144, 'learning_rate': 1e-05, 'epoch': 0.23} 23%|██▎ | 115/492 [1:01:42<3:17:29, 31.43s/it]predicted value: tensor([[0.5234], [0.5586], [0.9336], [0.6055], [0.7305], [0.4375], [0.3828], [0.5742], [0.9570], [0.4805], [0.4590], [0.9766], [0.2949], [0.2539], [0.1826], [0.1240]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [0.5547], [1.0000], [0.6016], [0.6680], [0.5547], [0.4668], [0.5000], [1.0000], [0.5000], [0.5000], [1.0000], [0.3340], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026092529296875 loss: 0.00193023681640625 loss: 0.004791259765625 loss: 0.002593994140625 predicted value: tensor([[0.4551], [0.5273], [0.3164], [0.4980], [0.3809], [0.5742], [0.3750], [0.3125], [0.2412], [0.3750], [0.5547], [0.2793], [0.3105], [0.3633], [0.1592], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.5547], [0.4668], [0.8008], [0.4668], [0.3750], [0.2500], [0.7500], [0.8008], [0.4004], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00185394287109375 loss: 0.00567626953125loss: 0.0024566650390625 loss: 0.0038909912109375 predicted value: tensor([[0.3418], [0.5703], [0.9453], [0.1270], [0.6719], [0.1895], [0.9180], [0.6289], [0.5234], [0.6523], [0.1641], [0.2393], [0.3184], [0.3438], [0.3301], [0.1680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [1.0000], [0.2500], [0.8008], [0.2002], [1.0000], [0.8008], [0.8008], [0.8008], [0.0625], [0.3340], [0.4004], [0.5000], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00311279296875 loss: 0.0045166015625loss: 0.0024871826171875 loss: 0.0037841796875 predicted value: tensor([[0.2949], [0.6523], [0.3926], [0.4082], [1.0078], [0.6406], [0.3379], [0.4375], [0.4590], [0.3887], [0.4355], [0.5820], [0.2363], [0.4160], [0.1855], [0.1221]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.4668], [0.5547], [1.0000], [0.8008], [0.4668], [0.7500], [0.7500], [0.6016], [0.4668], [0.6016], [0.2002], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004547119140625 loss: 0.0018768310546875 loss: 0.0050048828125 loss: 0.005645751953125 24%|██▎ | 116/492 [1:02:13<3:17:21, 31.49s/it] {'loss': 0.0142, 'learning_rate': 1e-05, 'epoch': 0.24} 24%|██▎ | 116/492 [1:02:13<3:17:21, 31.49s/it]predicted value: tensor([[0.6055], [1.0781], [1.0781], [0.5938], [0.4570], [0.2773], [1.0781], [0.5898], [0.3887], [0.5273], [0.4922], [0.6250], [0.4102], [0.3496], [0.2676], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.4668], [0.4668], [0.2500], [1.0000], [0.4668], [0.4004], [0.4668], [0.4004], [0.7500], [0.3340], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003387451171875 loss: 0.0047607421875 loss: 0.00150299072265625 loss: 0.0026092529296875 predicted value: tensor([[0.8672], [0.3770], [1.1172], [1.0938], [1.0391], [0.4688], [0.9023], [0.6719], [0.8203], [0.3672], [0.3770], [0.4805], [0.4414], [0.5234], [0.4043], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3340], [1.0000], [1.0000], [1.0000], [0.3340], [0.8008], [0.4668], [0.8008], [0.3340], [0.2500], [0.3340], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002532958984375 loss: 0.0026397705078125 loss: 0.0059814453125 loss: 0.003448486328125 predicted value: tensor([[0.5273], [0.5352], [1.0781], [1.0547], [0.7109], [0.9023], [0.5977], [0.4023], [0.5156], [0.4395], [0.4727], [0.4277], [0.4824], [0.4883], [0.2891], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [1.0000], [0.8008], [0.8008], [0.6016], [0.2500], [0.4668], [0.5000], [0.4004], [0.3340], [0.5000], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020294189453125 loss: 0.00180816650390625 loss: 0.000949859619140625 loss: 0.0032806396484375 predicted value: tensor([[0.6992], [0.5391], [0.5859], [0.6172], [1.1016], [0.7695], [0.3828], [0.4062], [0.5430], [0.6602], [0.4531], [0.5742], [0.5312], [0.4688], [0.2637], [0.3652]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.3750], [0.4668], [1.0000], [0.8008], [0.6016], [0.2500], [0.5000], [0.5000], [0.5000], [0.5000], [0.4277], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023956298828125 loss: 0.0037689208984375loss: 0.0028839111328125 loss: 0.0027923583984375 24%|██▍ | 117/492 [1:02:45<3:17:07, 31.54s/it] {'loss': 0.0117, 'learning_rate': 1e-05, 'epoch': 0.24} 24%|██▍ | 117/492 [1:02:45<3:17:07, 31.54s/it]predicted value: tensor([[0.5703], [0.6172], [0.3574], [0.6602], [0.4629], [0.8242], [0.4316], [0.7578], [0.6641], [0.6719], [0.4082], [0.6406], [1.0234], [0.2676], [0.3301], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6016], [0.2500], [0.7500], [0.3750], [0.7148], [0.2500], [0.6016], [0.6016], [0.7500], [0.2500], [0.6016], [1.0000], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00183868408203125 loss: 0.0024871826171875 loss: 0.00103759765625 loss: 0.00347900390625 predicted value: tensor([[0.5586], [0.4902], [0.8516], [0.3027], [0.8594], [0.5625], [1.0547], [1.1172], [0.6992], [0.7148], [1.0469], [0.5469], [0.4609], [0.5117], [0.3203], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.6250], [0.2500], [0.8320], [0.4668], [1.0000], [1.0000], [0.5547], [0.6016], [1.0000], [0.4004], [0.3340], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019683837890625 loss: 0.0017852783203125 loss: 0.0026397705078125 loss: 0.00225830078125 predicted value: tensor([[0.6211], [0.5625], [0.4395], [0.5898], [1.0938], [1.0469], [0.7148], [1.0859], [0.6680], [0.7539], [0.3828], [0.5469], [0.4395], [0.3945], [0.2236], [0.2490]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3145], [0.3750], [0.4668], [0.3145], [1.0000], [1.0000], [0.8008], [1.0000], [0.6680], [0.8008], [0.2002], [0.6016], [0.4004], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001556396484375 loss: 0.004547119140625loss: 0.00421142578125 loss: 0.00152587890625 predicted value: tensor([[0.6094], [0.6445], [1.0312], [0.2090], [1.0625], [1.1172], [0.3203], [0.3301], [0.5898], [1.0078], [0.6133], [0.4414], [0.5039], [0.4277], [0.3027], [0.4551]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [1.0000], [0.3340], [1.0000], [1.0000], [0.2002], [0.4004], [0.4668], [1.0000], [0.7500], [0.5000], [0.5000], [0.3340], [0.2500], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004425048828125 loss: 0.003448486328125 loss: 0.0037689208984375 loss: 0.00182342529296875 24%|██▍ | 118/492 [1:03:16<3:15:33, 31.37s/it] {'loss': 0.0107, 'learning_rate': 1e-05, 'epoch': 0.24} 24%|██▍ | 118/492 [1:03:16<3:15:33, 31.37s/it]predicted value: tensor([[0.4727], [0.1465], [0.3652], [0.4277], [0.6094], [0.1230], [0.5391], [0.3613], [0.9414], [0.6641], [0.0481], [0.2695], [0.1514], [0.1494], [0.1260], [0.1011]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.4668], [0.3340], [0.6680], [0.2002], [0.7500], [0.5000], [1.0000], [0.8008], [0.0625], [0.3340], [0.2500], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005035400390625 loss: 0.0024566650390625 loss: 0.0027618408203125 loss: 0.002410888671875 predicted value: tensor([[0.8672], [0.2266], [0.4570], [0.3711], [0.9102], [0.5859], [0.2422], [0.8555], [0.3047], [0.6914], [0.9023], [0.3047], [0.3945], [0.1787], [0.1738], [0.1367]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.5547], [0.3750], [1.0000], [0.8008], [0.2002], [1.0000], [0.5000], [0.8008], [1.0000], [0.2500], [0.3340], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023193359375 loss: 0.0028839111328125 loss: 0.00150299072265625 loss: 0.00164794921875 predicted value: tensor([[0.7969], [0.9336], [0.2539], [0.6484], [0.1875], [0.2910], [0.4453], [0.4375], [0.9219], [0.1465], [0.7617], [0.3340], [0.3770], [0.1445], [0.1504], [0.1621]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.2500], [0.6680], [0.3340], [0.4668], [0.5000], [0.6016], [1.0000], [0.3340], [0.8008], [0.4004], [0.4004], [0.1670], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034027099609375 loss: 0.002166748046875loss: 0.0030364990234375 loss: 0.0057373046875 predicted value: tensor([[0.9062], [0.8867], [0.1235], [0.4551], [0.8594], [0.9102], [0.6992], [0.3105], [0.2471], [0.3848], [0.5039], [0.3984], [0.1338], [0.1758], [0.1396], [0.1914]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.2002], [0.4668], [1.0000], [1.0000], [0.8320], [0.3750], [0.3340], [0.6016], [0.4668], [0.4004], [0.1670], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002166748046875 loss: 0.005096435546875 loss: 0.0026397705078125 loss: 0.002410888671875 24%|██▍ | 119/492 [1:03:47<3:15:03, 31.38s/it] {'loss': 0.0119, 'learning_rate': 1e-05, 'epoch': 0.24} 24%|██▍ | 119/492 [1:03:47<3:15:03, 31.38s/it]predicted value: tensor([[0.4238], [0.4316], [0.5977], [0.5859], [0.8867], [0.4238], [0.6016], [0.2676], [0.4863], [0.5156], [0.3633], [0.4062], [0.3555], [0.3027], [0.3105], [0.2031]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8008], [0.8320], [1.0000], [0.7500], [0.6680], [0.2002], [0.6016], [0.7500], [0.5000], [0.4004], [0.3340], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002197265625 loss: 0.00640869140625 loss: 0.004669189453125 loss: 0.0023040771484375 predicted value: tensor([[0.8828], [0.3301], [0.9141], [0.2227], [0.5469], [0.5430], [0.5664], [0.8867], [0.6562], [0.5547], [0.1816], [0.3477], [0.4512], [0.3828], [0.1216], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.2002], [0.7500], [0.4648], [0.7500], [1.0000], [0.8008], [0.7500], [0.2500], [0.3340], [0.5000], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00433349609375 loss: 0.003173828125 loss: 0.0028839111328125 loss: 0.0037078857421875 predicted value: tensor([[0.4746], [0.1582], [0.3848], [0.2793], [0.8789], [0.6680], [0.4707], [0.2910], [0.4961], [0.4316], [0.4141], [0.4121], [0.2891], [0.1299], [0.1445], [0.1523]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.4668], [0.3340], [1.0000], [0.5547], [0.6016], [0.3145], [0.6016], [0.5000], [0.6016], [0.5000], [0.2852], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019073486328125 loss: 0.0021209716796875 loss: 0.00323486328125 loss: 0.005645751953125 predicted value: tensor([[0.9258], [0.6406], [0.6016], [0.5625], [0.2871], [0.5156], [0.1196], [0.2695], [0.5430], [0.6250], [0.6992], [0.9180], [0.3926], [0.1235], [0.1943], [0.1523]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.8008], [0.8008], [0.2500], [0.6016], [0.0400], [0.3340], [0.7500], [0.8320], [0.8008], [1.0000], [0.5000], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032806396484375 loss: 0.0021820068359375 loss: 0.001678466796875 loss: 0.00421142578125 24%|██▍ | 120/492 [1:04:19<3:14:22, 31.35s/it] {'loss': 0.0135, 'learning_rate': 1e-05, 'epoch': 0.24} 24%|██▍ | 120/492 [1:04:19<3:14:22, 31.35s/it]predicted value: tensor([[0.6328], [0.3652], [0.8711], [0.6250], [0.6719], [0.2969], [0.6836], [1.0156], [0.4062], [0.6055], [0.4043], [0.6680], [0.5117], [0.4160], [0.2539], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.8320], [0.3750], [0.6680], [0.2500], [0.6680], [1.0000], [0.2002], [0.7500], [0.3340], [0.6016], [0.3340], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023345947265625 loss: 0.003143310546875loss: 0.00408935546875 loss: 0.0038909912109375 predicted value: tensor([[0.6094], [0.6367], [0.7344], [0.4551], [0.3145], [1.0156], [0.7148], [0.6055], [0.4629], [0.4941], [0.5273], [0.3242], [0.4453], [0.3652], [0.2910], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4648], [0.6680], [0.3750], [0.2500], [1.0000], [0.6680], [0.5000], [0.2002], [0.5000], [0.4004], [0.2002], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032501220703125 loss: 0.0030059814453125 loss: 0.0012359619140625 loss: 0.0029449462890625 predicted value: tensor([[0.3711], [0.5586], [0.4961], [0.9883], [0.3398], [0.5586], [0.2656], [0.3809], [0.7852], [0.7305], [0.6914], [0.7344], [0.4160], [0.2539], [0.3125], [0.2041]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.4668], [0.4668], [1.0000], [0.3340], [0.3750], [0.2500], [0.2500], [0.8008], [0.6016], [0.7500], [0.7500], [0.4004], [0.0204], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001251220703125 loss: 0.002655029296875 loss: 0.002044677734375 loss: 0.002593994140625 predicted value: tensor([[0.5703], [0.4590], [0.8672], [0.7969], [0.5742], [0.6562], [0.2930], [0.4121], [0.7031], [0.4727], [1.0391], [0.4590], [0.1484], [0.5312], [0.2676], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8008], [0.5703], [0.4668], [0.5547], [0.2500], [0.3340], [0.4668], [0.3750], [1.0000], [0.3340], [0.0400], [0.4004], [0.2500], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003082275390625 loss: 0.004241943359375 loss: 0.0033111572265625 loss: 0.0036163330078125 25%|██▍ | 121/492 [1:04:49<3:13:01, 31.22s/it] {'loss': 0.0117, 'learning_rate': 1e-05, 'epoch': 0.25} 25%|██▍ | 121/492 [1:04:49<3:13:01, 31.22s/it]predicted value: tensor([[0.6289], [1.0312], [0.4863], [0.4004], [1.0312], [0.4863], [1.0391], [0.4414], [0.7656], [0.5078], [1.0078], [0.6367], [0.3926], [0.5859], [0.2754], [0.2383]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3750], [0.2500], [1.0000], [0.6680], [1.0000], [0.4004], [0.8008], [0.6016], [1.0000], [0.7500], [0.2500], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020599365234375loss: 0.00180816650390625 loss: 0.00213623046875 loss: 0.00087738037109375 predicted value: tensor([[0.5547], [0.4902], [0.6289], [0.6953], [0.3262], [0.4785], [0.8359], [0.3496], [0.6367], [0.4688], [0.6055], [0.3789], [0.3594], [0.2871], [0.2695], [0.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.5547], [0.8008], [0.2500], [0.3340], [0.8008], [0.2002], [0.6016], [0.5000], [0.5000], [0.3340], [0.2002], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024566650390625 loss: 0.002166748046875 loss: 0.0020904541015625 loss: 0.003387451171875 predicted value: tensor([[0.5977], [0.5586], [1.0000], [0.5742], [0.4180], [1.0391], [0.4824], [0.6523], [1.0000], [1.0391], [0.6875], [0.3906], [0.6406], [0.2734], [0.3184], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [1.0000], [0.5547], [0.3340], [1.0000], [0.4668], [0.6016], [1.0000], [1.0000], [0.6680], [0.1670], [0.4668], [0.2002], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025634765625loss: 0.0020751953125 loss: 0.002105712890625 loss: 0.0018463134765625 predicted value: tensor([[0.4746], [0.5781], [0.5273], [0.5664], [0.6719], [0.5586], [0.5000], [0.4570], [0.6875], [1.0156], [0.7031], [0.3574], [0.6875], [0.4414], [0.2578], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3145], [0.5547], [0.4668], [0.5547], [0.8008], [0.2500], [0.3750], [0.3750], [0.8008], [1.0000], [0.6016], [0.2500], [0.7500], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0017852783203125 loss: 0.00113677978515625 loss: 0.0035247802734375 loss: 0.0036163330078125 25%|██▍ | 122/492 [1:05:21<3:13:00, 31.30s/it] {'loss': 0.0089, 'learning_rate': 1e-05, 'epoch': 0.25} 25%|██▍ | 122/492 [1:05:21<3:13:00, 31.30s/it]predicted value: tensor([[0.6016], [0.8945], [0.9492], [0.3203], [0.6016], [0.5625], [0.5469], [0.6094], [0.3809], [0.4043], [0.6406], [0.4180], [0.2197], [0.3574], [0.1514], [0.3223]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [1.0000], [1.0000], [0.4668], [0.8008], [0.7500], [0.7500], [0.8008], [0.5000], [0.5000], [0.8008], [0.4004], [0.2500], [0.4004], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003631591796875 loss: 0.00396728515625loss: 0.002410888671875 loss: 0.0023651123046875 predicted value: tensor([[0.9062], [0.3867], [0.7031], [0.9336], [0.3438], [0.3477], [0.5469], [0.9570], [0.9414], [0.2812], [0.5664], [0.4316], [0.9297], [0.1914], [0.3574], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.6680], [1.0000], [0.4668], [0.2500], [0.6016], [1.0000], [1.0000], [0.2500], [0.6016], [0.5000], [1.0000], [0.2500], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002044677734375 loss: 0.00110626220703125loss: 0.00299072265625 loss: 0.00445556640625 predicted value: tensor([[0.9023], [0.3359], [0.1924], [0.4238], [0.2070], [0.6055], [0.6445], [0.2598], [0.9062], [0.4473], [0.5391], [0.4258], [0.4219], [0.3984], [0.1807], [0.1035]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3340], [0.4668], [0.2500], [0.7500], [0.8008], [0.2002], [1.0000], [0.6016], [0.5000], [0.2500], [0.3340], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005615234375 loss: 0.003265380859375 loss: 0.004547119140625 loss: 0.0013580322265625 predicted value: tensor([[0.3613], [0.8945], [0.7227], [0.2812], [0.3848], [0.3145], [0.3203], [0.3887], [0.9336], [0.4238], [0.3184], [0.1523], [0.1138], [0.1807], [0.3398], [0.1543]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.8320], [0.3340], [0.4668], [0.2500], [0.4668], [0.3750], [1.0000], [0.6016], [0.3340], [0.7500], [0.0400], [0.2002], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00286865234375 loss: 0.007354736328125 loss: 0.003814697265625 loss: 0.002960205078125 25%|██▌ | 123/492 [1:05:52<3:12:01, 31.22s/it] {'loss': 0.0137, 'learning_rate': 1e-05, 'epoch': 0.25} 25%|██▌ | 123/492 [1:05:52<3:12:01, 31.22s/it]predicted value: tensor([[0.3418], [0.2812], [0.6211], [0.5352], [0.5742], [0.6992], [0.2832], [0.5625], [0.6328], [0.4297], [0.2285], [0.3477], [0.3945], [0.1846], [0.3281], [0.1650]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [0.8320], [0.8008], [0.4668], [0.6680], [0.3340], [0.6016], [0.8008], [0.4668], [0.2500], [0.3340], [0.4004], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0031585693359375 loss: 0.0022735595703125 loss: 0.00616455078125 loss: 0.002777099609375 predicted value: tensor([[0.6875], [0.4297], [0.3105], [0.4746], [0.5117], [0.1260], [0.4883], [0.9219], [0.3887], [0.2910], [0.3965], [0.4062], [0.5156], [0.3730], [0.1680], [0.1377]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.3750], [0.5000], [0.3750], [0.2500], [0.4668], [1.0000], [0.2500], [0.2002], [0.4004], [0.5000], [0.6016], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00186920166015625loss: 0.0038909912109375 loss: 0.004241943359375 loss: 0.003631591796875 predicted value: tensor([[0.6367], [0.4160], [0.5938], [0.7617], [0.3672], [0.6523], [0.5039], [0.2148], [0.5898], [0.4062], [0.5820], [0.2656], [0.3359], [0.5000], [0.1855], [0.1328]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.8320], [0.8008], [0.4668], [0.8008], [0.6016], [0.3340], [0.7500], [0.6016], [0.6016], [0.3340], [0.3340], [0.4277], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00164031982421875 loss: 0.003631591796875 loss: 0.004058837890625 loss: 0.0013885498046875 predicted value: tensor([[0.9648], [0.9609], [0.4980], [0.9297], [0.9375], [0.1992], [0.3320], [0.3691], [0.9688], [0.4512], [0.5781], [0.2363], [0.9648], [0.2188], [0.0503], [0.3262]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.5547], [1.0000], [1.0000], [0.2500], [0.2500], [0.4668], [1.0000], [0.3750], [0.5000], [0.2500], [1.0000], [0.2002], [0.0400], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004547119140625 loss: 0.003082275390625 loss: 0.00384521484375 loss: 0.000843048095703125 25%|██▌ | 124/492 [1:06:24<3:12:02, 31.31s/it] {'loss': 0.0128, 'learning_rate': 1e-05, 'epoch': 0.25} 25%|██▌ | 124/492 [1:06:24<3:12:02, 31.31s/it]predicted value: tensor([[1.1094], [0.8281], [0.6992], [0.5938], [1.0781], [1.0703], [0.7031], [0.8945], [0.5508], [0.8281], [0.5508], [0.4902], [0.5391], [0.4746], [0.3008], [0.2363]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [0.4668], [0.5547], [1.0000], [1.0000], [0.4668], [0.8008], [0.5000], [0.5703], [0.3340], [0.5000], [0.4004], [0.4004], [0.1426], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026092529296875 loss: 0.0052490234375loss: 0.0029449462890625 loss: 0.0034332275390625 predicted value: tensor([[0.6523], [0.3008], [0.5664], [1.0703], [0.7852], [0.8672], [0.2949], [0.7031], [0.6602], [0.5742], [0.4199], [0.5195], [0.3184], [0.3789], [0.2393], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.2002], [0.3750], [1.0000], [0.8008], [0.8008], [0.2500], [0.4668], [0.6016], [0.6016], [0.2500], [0.6016], [0.0400], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0031585693359375 loss: 0.0038604736328125 loss: 0.003753662109375 loss: 0.00286865234375 predicted value: tensor([[0.6562], [0.7109], [0.8711], [0.8320], [0.7617], [0.4863], [1.0469], [0.6797], [0.5156], [0.3984], [0.7539], [0.5000], [0.3105], [0.2910], [0.3047], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6680], [0.8320], [0.8008], [0.8008], [0.2500], [1.0000], [0.6016], [0.6016], [0.3340], [0.3750], [0.7500], [0.2002], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030364990234375 loss: 0.005218505859375 loss: 0.0023345947265625 loss: 0.00144195556640625 predicted value: tensor([[0.5352], [0.7500], [0.4766], [0.7656], [0.4980], [0.2871], [1.0859], [0.4844], [0.5000], [0.4805], [0.8320], [0.4688], [0.2910], [0.2578], [0.2539], [0.3164]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.7148], [0.4668], [0.8008], [0.3750], [0.2500], [1.0000], [0.4668], [0.2500], [0.4004], [0.8008], [0.2500], [0.2002], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030517578125loss: 0.00244140625 loss: 0.0029754638671875 loss: 0.0040283203125 25%|██▌ | 125/492 [1:06:55<3:11:42, 31.34s/it] {'loss': 0.0131, 'learning_rate': 1e-05, 'epoch': 0.25} 25%|██▌ | 125/492 [1:06:55<3:11:42, 31.34s/it]predicted value: tensor([[1.0703], [1.1562], [0.2773], [0.5195], [0.4785], [0.5195], [0.5625], [0.5820], [0.6875], [0.6875], [0.4805], [0.4746], [0.5703], [0.2168], [0.1992], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.2002], [0.4668], [0.4668], [0.4668], [0.5000], [0.6016], [0.6016], [0.6016], [0.4004], [0.4004], [0.2500], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028839111328125 loss: 0.0028076171875loss: 0.004119873046875 loss: 0.004302978515625 predicted value: tensor([[0.5234], [0.5938], [1.1172], [1.1406], [0.8555], [1.1016], [0.7461], [0.7109], [1.1016], [1.1094], [1.0859], [0.3926], [0.5352], [0.3809], [0.2490], [0.2119]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [1.0000], [1.0000], [0.8008], [1.0000], [0.8008], [0.7500], [1.0000], [1.0000], [1.0000], [0.2852], [0.4004], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003570556640625 loss: 0.0019378662109375 loss: 0.002655029296875 loss: 0.0012054443359375 predicted value: tensor([[0.8711], [0.4121], [0.8047], [0.3438], [0.8789], [0.3867], [0.5977], [0.5547], [1.1641], [0.4160], [0.6875], [0.4199], [0.4980], [0.4453], [0.2773], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3340], [0.8008], [0.2002], [0.8008], [0.2500], [0.5547], [0.5000], [1.0000], [0.5000], [0.6016], [0.4004], [0.3340], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027923583984375 loss: 0.00213623046875 loss: 0.00518798828125 loss: 0.003570556640625 predicted value: tensor([[0.4922], [1.0625], [0.5078], [0.8086], [1.0938], [0.8477], [0.5469], [0.6289], [0.3145], [0.5547], [0.6367], [0.6836], [0.4648], [0.5000], [0.2432], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.8008], [1.0000], [0.8008], [0.4668], [0.6016], [0.2002], [0.3340], [0.5000], [0.5000], [0.4004], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004241943359375 loss: 0.005950927734375loss: 0.002471923828125 loss: 0.0035400390625 26%|██▌ | 126/492 [1:07:26<3:10:12, 31.18s/it] {'loss': 0.0133, 'learning_rate': 1e-05, 'epoch': 0.26} 26%|██▌ | 126/492 [1:07:26<3:10:12, 31.18s/it]predicted value: tensor([[0.4609], [0.2090], [0.4570], [0.3926], [0.3340], [0.1865], [0.6445], [0.2598], [0.4727], [0.9570], [0.3555], [0.5039], [0.3535], [0.2070], [0.1230], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.4668], [0.4668], [0.4668], [0.2002], [0.6016], [0.3340], [0.6016], [1.0000], [0.4004], [0.5000], [0.5000], [0.3340], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037384033203125loss: 0.00186920166015625 loss: 0.001983642578125 loss: 0.00506591796875 predicted value: tensor([[0.4395], [0.3926], [0.4629], [0.9727], [0.6797], [0.9844], [0.3047], [0.6875], [0.3770], [0.4414], [0.7500], [0.2949], [0.3359], [0.1719], [0.3789], [0.1167]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.5547], [1.0000], [0.8008], [1.0000], [0.6016], [0.8008], [0.3145], [0.5000], [0.6016], [0.4004], [0.3340], [0.2500], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003021240234375loss: 0.002716064453125 loss: 0.0038299560546875 loss: 0.002349853515625 predicted value: tensor([[0.7148], [0.2988], [0.9805], [0.2930], [0.9062], [0.4043], [0.3203], [0.3652], [0.3633], [0.8984], [0.3906], [0.3066], [0.3281], [0.2832], [0.1367], [0.1406]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [1.0000], [0.2715], [1.0000], [0.5000], [0.4668], [0.6016], [0.4277], [1.0000], [0.4004], [0.4004], [0.4004], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023651123046875 loss: 0.0028839111328125 loss: 0.003387451171875 loss: 0.0024566650390625 predicted value: tensor([[0.4766], [0.4727], [0.3691], [0.9453], [0.9844], [0.2344], [0.9766], [0.9648], [0.4883], [0.2539], [0.2910], [0.2949], [0.3535], [0.4453], [0.3828], [0.0898]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [1.0000], [1.0000], [0.2500], [1.0000], [1.0000], [0.3750], [0.3340], [0.2002], [0.4004], [0.4004], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012969970703125 loss: 0.00194549560546875 loss: 0.001068115234375 loss: 0.0028839111328125 26%|██▌ | 127/492 [1:07:57<3:09:29, 31.15s/it] {'loss': 0.0107, 'learning_rate': 1e-05, 'epoch': 0.26} 26%|██▌ | 127/492 [1:07:57<3:09:29, 31.15s/it]predicted value: tensor([[0.4199], [0.9297], [0.4004], [0.9414], [0.9297], [0.7891], [0.6680], [0.2363], [0.3984], [0.1836], [0.5703], [0.1533], [0.2031], [0.1426], [0.1543], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [1.0000], [1.0000], [0.8008], [0.7500], [0.2500], [0.2500], [0.0400], [0.6016], [0.1670], [0.1670], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00102996826171875 loss: 0.0013580322265625loss: 0.0020904541015625 loss: 0.0028839111328125 predicted value: tensor([[0.9648], [0.6914], [0.6289], [0.9648], [0.3750], [0.4434], [0.4668], [0.2617], [0.5469], [0.2041], [0.4531], [0.6133], [0.4043], [0.1572], [0.1328], [0.1357]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.6680], [1.0000], [0.3750], [0.3750], [0.5547], [0.2500], [0.4668], [0.2500], [0.5000], [0.2500], [0.3340], [0.1670], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015106201171875 loss: 0.0032958984375loss: 0.00154876708984375 loss: 0.0034027099609375 predicted value: tensor([[0.5156], [0.2520], [0.7070], [0.1377], [0.6445], [0.5000], [0.0938], [0.1953], [0.5820], [0.3828], [0.2949], [0.3691], [0.4199], [0.1465], [0.2061], [0.1348]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.8008], [0.2002], [0.4668], [0.6016], [0.0278], [0.3340], [0.6016], [0.6016], [0.5000], [0.4004], [0.4004], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015106201171875 loss: 0.002899169921875loss: 0.0029449462890625 loss: 0.0018768310546875 predicted value: tensor([[0.6445], [0.3867], [0.3809], [0.3516], [0.4844], [0.6172], [0.9727], [0.5859], [0.5586], [0.3613], [0.3750], [0.3496], [0.3359], [0.2188], [0.1787], [0.1182]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.3750], [0.3750], [0.4668], [0.8008], [1.0000], [0.7500], [0.3340], [0.4004], [0.4004], [0.4004], [0.5000], [0.2500], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027618408203125loss: 0.00086212158203125 loss: 0.0032806396484375 loss: 0.0018310546875 26%|██▌ | 128/492 [1:08:28<3:09:32, 31.24s/it] {'loss': 0.0088, 'learning_rate': 1e-05, 'epoch': 0.26} 26%|██▌ | 128/492 [1:08:28<3:09:32, 31.24s/it]predicted value: tensor([[0.5312], [1.0938], [1.0859], [0.3203], [0.7227], [0.8594], [0.3652], [1.0703], [0.6758], [0.4141], [0.3027], [0.8555], [0.4688], [0.2930], [0.2656], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [0.2500], [0.6680], [0.8008], [0.2500], [1.0000], [0.5000], [0.2002], [0.2002], [0.8008], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00433349609375 loss: 0.0024566650390625 loss: 0.00701904296875 loss: 0.00323486328125 predicted value: tensor([[0.8398], [0.7969], [0.5664], [1.0938], [0.7383], [1.1328], [0.8984], [0.8047], [0.5742], [1.0781], [0.6172], [0.5430], [0.5273], [0.3281], [0.2334], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [0.4668], [1.0000], [0.6680], [1.0000], [0.8320], [0.4668], [0.4648], [1.0000], [0.6016], [0.3340], [0.4668], [0.2500], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004119873046875 loss: 0.00396728515625 loss: 0.0017547607421875 loss: 0.004608154296875 predicted value: tensor([[1.0469], [0.5938], [0.6250], [0.4434], [0.6016], [0.6367], [0.3145], [0.2891], [0.6680], [0.7930], [0.6328], [0.5781], [0.4414], [0.5664], [0.3027], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.5703], [0.4668], [0.3145], [0.7500], [0.3340], [0.2500], [0.4668], [0.6680], [0.7500], [0.5000], [0.3340], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004730224609375 loss: 0.00347900390625 loss: 0.003997802734375 loss: 0.0024566650390625 predicted value: tensor([[0.4688], [0.5938], [0.7500], [0.3652], [0.7500], [0.7891], [0.8477], [0.6953], [0.6289], [0.5195], [0.4883], [0.7969], [0.4941], [0.2656], [0.2656], [0.3027]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8320], [0.2500], [0.3750], [0.4004], [0.8320], [0.6016], [0.6016], [0.4668], [0.4004], [0.8008], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002105712890625 loss: 0.002532958984375 loss: 0.00494384765625 loss: 0.005584716796875 26%|██▌ | 129/492 [1:08:59<3:08:42, 31.19s/it] {'loss': 0.0153, 'learning_rate': 1e-05, 'epoch': 0.26} 26%|██▌ | 129/492 [1:08:59<3:08:42, 31.19s/it]predicted value: tensor([[0.4980], [0.4707], [1.0234], [0.5664], [0.8203], [0.8281], [0.3066], [0.7734], [0.3418], [0.5898], [0.6914], [0.5352], [0.3906], [0.4238], [0.3008], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [1.0000], [0.4668], [0.6680], [0.8320], [0.3340], [0.6016], [0.2500], [0.6016], [0.6016], [0.5000], [0.2500], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003387451171875 loss: 0.0022125244140625loss: 0.003021240234375 loss: 0.003936767578125 predicted value: tensor([[0.4160], [1.0156], [1.0547], [0.7461], [0.7734], [1.0703], [1.0391], [1.0234], [1.0312], [0.7461], [0.7500], [0.6680], [0.5156], [0.4922], [0.2988], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [1.0000], [0.6680], [0.6016], [1.0000], [1.0000], [1.0000], [1.0000], [0.8008], [0.8008], [0.6680], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002197265625loss: 0.003814697265625 loss: 0.0032806396484375 loss: 0.0020751953125 predicted value: tensor([[0.5039], [0.6719], [0.7812], [1.0391], [0.7617], [0.4805], [0.8203], [0.7188], [0.5117], [0.4902], [0.8750], [0.4629], [0.5156], [0.5469], [0.5117], [0.2295]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [0.8320], [1.0000], [0.6680], [0.4668], [0.8008], [0.6016], [0.4668], [0.4277], [0.8008], [0.5000], [0.5000], [0.5000], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004486083984375 loss: 0.00089263916015625 loss: 0.001007080078125 loss: 0.001190185546875 predicted value: tensor([[0.7617], [0.5625], [1.0469], [0.5195], [0.7148], [0.8320], [0.4863], [0.9766], [1.0312], [0.5586], [0.5039], [0.5742], [0.5742], [0.4980], [0.5625], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6016], [1.0000], [0.4668], [0.6016], [0.8320], [0.2500], [1.0000], [1.0000], [0.7500], [0.3340], [0.4004], [0.5000], [0.3340], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020751953125 loss: 0.001007080078125 loss: 0.0033416748046875loss: 0.0031585693359375 26%|██▋ | 130/492 [1:09:31<3:08:27, 31.24s/it] {'loss': 0.0103, 'learning_rate': 1e-05, 'epoch': 0.26} 26%|██▋ | 130/492 [1:09:31<3:08:27, 31.24s/it]predicted value: tensor([[0.4316], [0.9102], [0.8242], [0.1953], [0.2188], [0.2012], [0.5000], [0.2539], [0.6250], [0.4980], [0.3477], [0.3203], [0.3320], [0.3496], [0.1709], [0.1816]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.2500], [0.3340], [0.2500], [0.8008], [0.3340], [0.6016], [0.5000], [0.2852], [0.2002], [0.3340], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005767822265625 loss: 0.0030059814453125 loss: 0.00173187255859375 loss: 0.0030059814453125 predicted value: tensor([[0.3555], [0.4746], [0.6602], [0.4785], [0.5000], [0.3691], [0.2793], [0.8945], [0.5352], [0.2695], [0.9453], [0.2812], [0.4609], [0.2812], [0.3535], [0.1875]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8008], [0.5547], [0.6016], [0.4668], [0.3340], [1.0000], [0.5000], [0.2002], [1.0000], [0.3340], [0.2500], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003997802734375 loss: 0.0026702880859375 loss: 0.0020751953125 loss: 0.002685546875 predicted value: tensor([[0.4961], [0.3555], [0.5234], [0.2041], [0.1348], [0.5664], [0.8867], [0.8711], [0.4824], [0.5547], [0.3145], [0.4258], [0.3477], [0.3652], [0.2656], [0.1504]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8008], [0.2002], [0.2002], [0.6016], [1.0000], [1.0000], [0.7500], [0.6016], [0.4004], [0.4004], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016326904296875 loss: 0.004241943359375 loss: 0.006805419921875 loss: 0.0030517578125 predicted value: tensor([[0.9180], [0.4219], [0.4824], [0.3926], [0.4844], [0.9570], [0.2578], [0.5078], [0.2949], [0.5156], [0.8828], [0.2988], [0.3691], [0.1963], [0.1650], [0.2451]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4648], [0.4668], [0.6680], [1.0000], [0.2002], [0.7500], [0.3340], [0.6680], [1.0000], [0.4668], [0.5000], [0.0625], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003387451171875 loss: 0.00372314453125 loss: 0.0059814453125 loss: 0.0011444091796875 27%|██▋ | 131/492 [1:10:02<3:07:25, 31.15s/it] {'loss': 0.0137, 'learning_rate': 1e-05, 'epoch': 0.27} 27%|██▋ | 131/492 [1:10:02<3:07:25, 31.15s/it]predicted value: tensor([[0.4961], [0.2480], [0.3398], [0.4043], [0.5664], [0.3086], [0.3809], [0.4043], [0.6406], [0.6797], [0.9688], [0.3828], [0.4863], [0.1787], [0.3574], [0.1875]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.4668], [0.3750], [0.5547], [0.3750], [0.6016], [0.6680], [0.5000], [0.6680], [1.0000], [0.5000], [0.6016], [0.2500], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012969970703125 loss: 0.003387451171875loss: 0.001373291015625 loss: 0.0019073486328125 predicted value: tensor([[0.3574], [0.3359], [0.9297], [0.4258], [0.6562], [0.9102], [0.5977], [0.9375], [0.9102], [0.8906], [0.5195], [0.3516], [0.4004], [0.2637], [0.1992], [0.1748]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [1.0000], [0.4668], [0.8320], [1.0000], [0.8008], [1.0000], [1.0000], [1.0000], [0.6016], [0.7500], [0.3340], [0.7500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00836181640625loss: 0.00225830078125 loss: 0.00176239013671875 loss: 0.0023956298828125 predicted value: tensor([[0.3320], [0.3184], [0.9062], [0.6016], [0.5859], [0.2275], [0.4551], [0.4590], [0.4609], [0.6055], [0.5898], [0.3906], [0.0103], [0.5195], [0.3809], [0.1660]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3340], [1.0000], [0.8008], [0.4668], [0.3340], [0.4668], [0.4668], [0.5000], [0.6680], [0.7500], [0.5000], [0.0278], [0.6016], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00201416015625 loss: 0.001983642578125 loss: 0.004241943359375 loss: 0.0027618408203125 predicted value: tensor([[0.6641], [0.4219], [0.9414], [0.5391], [0.9414], [0.2559], [0.1021], [0.3652], [0.4980], [0.4707], [0.3945], [0.3789], [0.3691], [0.3906], [0.2285], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [1.0000], [0.5547], [1.0000], [0.3340], [0.2500], [0.4668], [0.5000], [0.3750], [0.4668], [0.3340], [0.4004], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0036773681640625 loss: 0.0030364990234375 loss: 0.00131988525390625 loss: 0.0028228759765625 27%|██▋ | 132/492 [1:10:33<3:08:03, 31.34s/it] {'loss': 0.0112, 'learning_rate': 1e-05, 'epoch': 0.27} 27%|██▋ | 132/492 [1:10:33<3:08:03, 31.34s/it]predicted value: tensor([[0.5273], [0.6797], [0.2832], [0.3672], [0.5195], [0.6875], [0.6484], [0.6445], [0.5547], [0.5117], [0.6836], [0.4551], [0.4609], [0.3066], [0.3008], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5547], [0.2500], [0.2500], [0.4668], [0.6016], [0.7500], [0.6016], [0.5000], [0.3340], [0.5000], [0.3340], [0.2002], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00408935546875 loss: 0.00164031982421875 loss: 0.003875732421875 loss: 0.0028228759765625 predicted value: tensor([[0.6250], [0.6406], [1.0625], [0.5586], [0.7031], [0.8164], [0.3184], [0.4316], [0.6836], [0.2676], [0.4121], [0.5195], [1.0391], [0.4785], [0.3184], [0.2891]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [1.0000], [0.4648], [0.5000], [0.8320], [0.2500], [0.7500], [0.6016], [0.3340], [0.3340], [0.5000], [1.0000], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0036773681640625 loss: 0.0035400390625 loss: 0.0032806396484375 loss: 0.00168609619140625 predicted value: tensor([[0.9141], [0.3516], [1.1016], [0.6250], [0.8867], [1.0625], [0.6328], [0.7266], [1.0156], [1.0703], [0.6719], [0.4863], [0.5156], [0.3184], [0.3438], [0.4629]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3340], [1.0000], [0.7500], [0.8320], [1.0000], [0.7500], [0.5547], [1.0000], [1.0000], [0.5000], [0.4004], [0.4004], [0.2002], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016326904296875 loss: 0.0027618408203125 loss: 0.0037994384765625 loss: 0.00148773193359375 predicted value: tensor([[0.5117], [1.0938], [0.5078], [0.5039], [0.5156], [0.8438], [0.6758], [1.0625], [0.3887], [0.5703], [1.0625], [0.4570], [0.6055], [0.2734], [0.2715], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.4668], [0.5000], [0.8008], [0.6016], [1.0000], [0.3340], [0.4668], [1.0000], [0.4004], [0.6016], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00286865234375 loss: 0.0017242431640625loss: 0.002227783203125 loss: 0.00131988525390625 27%|██▋ | 133/492 [1:11:05<3:07:22, 31.32s/it] {'loss': 0.0106, 'learning_rate': 1e-05, 'epoch': 0.27} 27%|██▋ | 133/492 [1:11:05<3:07:22, 31.32s/it]predicted value: tensor([[0.6055], [1.0703], [0.7812], [0.5508], [1.0547], [0.3574], [1.0000], [0.7383], [0.7969], [0.2910], [0.6367], [1.0391], [0.4746], [0.3613], [0.3066], [0.3184]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8008], [0.4668], [1.0000], [0.2500], [1.0000], [0.8008], [0.8008], [0.2500], [0.6016], [1.0000], [0.7500], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020751953125 loss: 0.002197265625loss: 0.00177001953125 loss: 0.002838134765625 predicted value: tensor([[1.1250], [1.0625], [0.5195], [0.7383], [0.5195], [0.5273], [0.5312], [1.0781], [0.7930], [0.6133], [0.5508], [0.4082], [0.4453], [0.5391], [0.4727], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.6680], [0.4668], [0.4668], [0.4668], [1.0000], [0.8008], [0.6016], [0.3750], [0.5000], [0.3340], [0.5000], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030517578125 loss: 0.00341796875 loss: 0.001922607421875 loss: 0.002044677734375 predicted value: tensor([[0.5586], [1.0547], [0.5195], [1.0703], [1.0938], [1.0547], [0.3242], [0.6484], [1.0938], [0.4844], [0.3691], [0.5156], [0.3984], [0.2852], [0.4277], [0.3086]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [1.0000], [1.0000], [1.0000], [0.3340], [0.4668], [1.0000], [0.2002], [0.2500], [0.6016], [0.2852], [0.1670], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00262451171875 loss: 0.006072998046875 loss: 0.003753662109375 loss: 0.00262451171875 predicted value: tensor([[0.7852], [0.5586], [0.5352], [0.7109], [0.6484], [0.2988], [0.7812], [0.3359], [0.5391], [1.0469], [0.5352], [0.7812], [0.4180], [0.5078], [0.2520], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.3750], [0.4668], [0.6680], [0.5547], [0.2500], [0.6680], [0.3340], [0.3750], [1.0000], [0.5000], [0.5703], [0.3340], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003753662109375 loss: 0.0027008056640625 loss: 0.0029449462890625loss: 0.0015716552734375 27%|██▋ | 134/492 [1:11:35<3:05:50, 31.15s/it] {'loss': 0.0113, 'learning_rate': 1e-05, 'epoch': 0.27} 27%|██▋ | 134/492 [1:11:35<3:05:50, 31.15s/it]predicted value: tensor([[1.0078], [0.4102], [0.5820], [0.9609], [0.5469], [0.3672], [0.2637], [0.5508], [0.5156], [0.3984], [0.2871], [0.3125], [0.3984], [0.1953], [0.1699], [0.1582]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [1.0000], [0.4648], [0.3750], [0.2500], [0.7500], [0.7500], [0.5000], [0.2500], [0.4004], [0.5000], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00103759765625 loss: 0.0030670166015625loss: 0.0024566650390625 loss: 0.0026397705078125 predicted value: tensor([[0.6992], [0.2285], [0.5820], [0.5039], [0.4551], [0.3711], [0.2891], [0.3887], [0.6836], [0.5859], [0.5352], [0.2871], [0.8828], [0.1396], [0.3223], [0.1475]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.3340], [0.8008], [0.6680], [0.4668], [0.4668], [0.2500], [0.4668], [0.6680], [0.8008], [0.5000], [0.4004], [1.0000], [0.1670], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009765625 loss: 0.003082275390625 loss: 0.003173828125 loss: 0.00189971923828125 predicted value: tensor([[0.4902], [0.9688], [0.9297], [0.4785], [0.6602], [0.4980], [0.5664], [0.6016], [0.4863], [0.3965], [0.5156], [0.1299], [0.3516], [0.1797], [0.1670], [0.1533]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [1.0000], [1.0000], [0.4648], [0.8008], [0.6016], [0.8320], [0.7500], [0.5000], [0.6016], [0.4668], [0.0625], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00286865234375 loss: 0.0028839111328125 loss: 0.0021820068359375 loss: 0.0027618408203125 predicted value: tensor([[0.2354], [0.4141], [0.5156], [0.5312], [1.0000], [0.6445], [0.9141], [0.7031], [0.9492], [0.5703], [0.4648], [0.9844], [0.3730], [0.1660], [0.1680], [0.1855]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.4668], [0.6016], [0.8320], [1.0000], [0.7500], [1.0000], [0.8008], [1.0000], [0.7500], [0.6016], [1.0000], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00299072265625loss: 0.0016937255859375 loss: 0.00238037109375 loss: 0.003021240234375 27%|██▋ | 135/492 [1:12:06<3:05:06, 31.11s/it] {'loss': 0.0098, 'learning_rate': 1e-05, 'epoch': 0.27} 27%|██▋ | 135/492 [1:12:06<3:05:06, 31.11s/it]predicted value: tensor([[0.4668], [0.4102], [0.4141], [0.7500], [0.5742], [0.2520], [0.5312], [0.5430], [0.1172], [0.5352], [0.2930], [0.4785], [0.3340], [0.1445], [0.1660], [0.2168]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.8008], [0.7500], [0.2500], [0.7500], [0.5000], [0.0625], [0.7500], [0.4004], [0.6016], [0.4004], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00445556640625 loss: 0.0030670166015625loss: 0.00136566162109375 loss: 0.0013580322265625 predicted value: tensor([[0.7852], [0.7461], [0.3750], [0.4805], [1.0078], [0.9727], [0.9648], [0.5781], [0.7500], [0.5312], [0.3711], [0.3789], [0.3398], [0.1738], [0.1494], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [0.3750], [0.5547], [1.0000], [1.0000], [1.0000], [0.6016], [0.8008], [0.3750], [0.4004], [0.5000], [0.3340], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001251220703125 loss: 0.000896453857421875loss: 0.0022430419921875 loss: 0.0028533935546875 predicted value: tensor([[0.9336], [1.0234], [0.7695], [0.3984], [0.7422], [0.3887], [0.3691], [0.5547], [0.3770], [0.9531], [0.5977], [0.4824], [0.3809], [0.3457], [0.1787], [0.1553]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.8320], [0.4668], [0.8008], [0.3340], [0.4668], [0.6016], [0.3750], [1.0000], [0.6016], [0.7500], [0.2500], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024871826171875 loss: 0.002105712890625 loss: 0.00115966796875 loss: 0.0014495849609375 predicted value: tensor([[0.5625], [0.3848], [0.9844], [0.1680], [0.4414], [0.4375], [0.6641], [0.9648], [0.5234], [0.4141], [0.4004], [0.2969], [0.4648], [0.1426], [0.1104], [0.1553]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.2500], [0.5547], [0.4668], [0.6680], [1.0000], [0.6016], [0.6016], [0.4004], [0.3340], [0.5000], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016632080078125 loss: 0.00262451171875 loss: 0.00482177734375 loss: 0.00164794921875 28%|██▊ | 136/492 [1:12:38<3:04:34, 31.11s/it] {'loss': 0.0089, 'learning_rate': 1e-05, 'epoch': 0.28} 28%|██▊ | 136/492 [1:12:38<3:04:34, 31.11s/it]predicted value: tensor([[0.5625], [0.7227], [0.6406], [0.5781], [1.0625], [0.5195], [0.8242], [0.7500], [0.7070], [0.5469], [0.6758], [0.5820], [0.5391], [0.4688], [0.4297], [0.4199]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.4668], [0.6016], [1.0000], [0.4668], [0.6680], [0.5000], [0.7500], [0.4004], [0.7500], [0.7500], [0.5000], [0.3340], [0.4004], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004638671875 loss: 0.003509521484375loss: 0.00347900390625 loss: 0.00167083740234375 predicted value: tensor([[0.7344], [0.5117], [0.8203], [0.8945], [0.4473], [1.0938], [0.3359], [0.3320], [0.7852], [0.6289], [0.5078], [0.7617], [0.4609], [0.5820], [0.4785], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.8008], [0.8008], [0.3750], [1.0000], [0.2002], [0.2500], [0.6016], [0.5547], [0.5000], [0.8008], [0.4004], [0.7500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010986328125 loss: 0.0028228759765625 loss: 0.0028076171875 loss: 0.0024871826171875 predicted value: tensor([[0.5703], [0.4961], [0.8398], [1.1562], [1.0703], [0.6875], [0.6953], [0.4570], [0.8008], [0.7109], [0.7070], [0.4980], [1.0781], [0.3066], [0.3633], [0.3066]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [1.0000], [1.0000], [0.7500], [0.6016], [0.5000], [0.8008], [0.7500], [0.7500], [0.3340], [1.0000], [0.3340], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002777099609375 loss: 0.0018768310546875 loss: 0.002655029296875 loss: 0.00127410888671875 predicted value: tensor([[0.5703], [0.3633], [0.7500], [0.4961], [0.8555], [0.5352], [0.4551], [0.3887], [0.4473], [1.0312], [0.4902], [0.6602], [0.5508], [0.3809], [0.2754], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3340], [0.6016], [0.4668], [0.8008], [0.4668], [0.3750], [0.3340], [0.2383], [1.0000], [0.5000], [0.5000], [0.2500], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00201416015625 loss: 0.0035552978515625 loss: 0.0024261474609375 loss: 0.0035247802734375 28%|██▊ | 137/492 [1:13:10<3:05:46, 31.40s/it] {'loss': 0.0107, 'learning_rate': 1e-05, 'epoch': 0.28} 28%|██▊ | 137/492 [1:13:10<3:05:46, 31.40s/it]predicted value: tensor([[1.1328], [0.5703], [0.7031], [0.4961], [0.8594], [0.5625], [0.6523], [0.5703], [1.0391], [0.5234], [1.0703], [0.6133], [0.4375], [0.5156], [0.2373], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.4668], [0.3750], [0.8008], [0.3145], [0.6016], [0.4668], [1.0000], [0.4668], [1.0000], [0.2002], [0.4668], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00225830078125 loss: 0.00567626953125loss: 0.0030975341796875 loss: 0.001708984375 predicted value: tensor([[0.4941], [0.5117], [1.1172], [1.0469], [0.3301], [0.3672], [0.7266], [0.5977], [0.6250], [0.7500], [0.6562], [1.1094], [0.4277], [0.4395], [0.2676], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [1.0000], [0.2002], [0.2500], [0.6680], [0.7500], [0.4668], [0.8008], [0.6016], [1.0000], [0.5000], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021514892578125 loss: 0.0020294189453125 loss: 0.0086669921875 loss: 0.002044677734375 predicted value: tensor([[0.5156], [0.3828], [0.2812], [0.4609], [1.1406], [0.4805], [0.5352], [0.7852], [0.2969], [0.3789], [1.0938], [0.2236], [0.4727], [0.4590], [0.2871], [0.3262]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2002], [0.2500], [0.3750], [1.0000], [0.3750], [0.4668], [0.8320], [0.2500], [0.2500], [1.0000], [0.0278], [0.3340], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00167083740234375 loss: 0.003021240234375 loss: 0.0029449462890625 loss: 0.00164031982421875 predicted value: tensor([[0.6484], [0.5430], [0.8945], [0.9023], [0.5273], [0.8672], [0.5469], [1.0625], [0.5703], [0.4844], [0.7109], [0.4492], [0.3203], [0.3105], [0.6172], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8320], [0.8008], [0.4668], [0.8008], [0.4668], [1.0000], [0.5000], [0.3340], [0.7500], [0.4004], [0.0400], [0.2500], [0.6016], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003021240234375 loss: 0.002227783203125 loss: 0.0029144287109375 loss: 0.0027008056640625 28%|██▊ | 138/492 [1:13:41<3:04:31, 31.28s/it] {'loss': 0.0119, 'learning_rate': 1e-05, 'epoch': 0.28} 28%|██▊ | 138/492 [1:13:41<3:04:31, 31.28s/it]predicted value: tensor([[0.4512], [0.3965], [0.3770], [0.9336], [0.6523], [0.6562], [0.3926], [0.5508], [0.7656], [0.4609], [0.9570], [0.5781], [0.3398], [0.3379], [0.3242], [0.1582]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.4668], [1.0000], [0.8008], [0.8008], [0.4668], [0.5000], [0.8320], [0.6016], [1.0000], [0.6016], [0.3340], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00170135498046875 loss: 0.0015716552734375 loss: 0.00185394287109375 loss: 0.00118255615234375 predicted value: tensor([[0.3496], [0.8008], [0.4492], [0.4434], [0.4141], [0.9102], [0.1611], [0.2598], [0.2539], [0.3047], [0.2275], [0.3574], [0.1904], [0.1719], [0.1914], [0.1611]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.3340], [0.3750], [0.4668], [1.0000], [0.2500], [0.1426], [0.2500], [0.4668], [0.2002], [0.4004], [0.2002], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00171661376953125 loss: 0.0013275146484375loss: 0.002227783203125 loss: 0.00130462646484375 predicted value: tensor([[0.4883], [0.1904], [0.4551], [0.9531], [0.1621], [0.9453], [0.9414], [0.2236], [0.9297], [0.4727], [0.5742], [0.3926], [0.3105], [0.1118], [0.1562], [0.1855]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.6680], [1.0000], [0.2500], [1.0000], [1.0000], [0.3340], [1.0000], [0.6016], [0.6016], [0.4668], [0.4004], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030975341796875 loss: 0.005828857421875 loss: 0.0019378662109375 loss: 0.0015716552734375 predicted value: tensor([[0.4902], [0.9219], [0.4922], [0.4141], [0.1582], [0.9297], [0.6211], [0.5547], [0.6523], [0.3770], [0.2969], [0.6680], [0.3809], [0.1387], [0.1846], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4648], [0.5547], [0.3340], [1.0000], [0.6680], [0.6680], [0.7500], [0.6016], [0.4668], [0.6016], [0.4004], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00159454345703125 loss: 0.00145721435546875 loss: 0.00182342529296875 loss: 0.0031280517578125 28%|██▊ | 139/492 [1:14:12<3:04:09, 31.30s/it] {'loss': 0.0083, 'learning_rate': 1e-05, 'epoch': 0.28} 28%|██▊ | 139/492 [1:14:12<3:04:09, 31.30s/it]predicted value: tensor([[0.4707], [0.9609], [0.4668], [0.9766], [0.9727], [0.5586], [0.4688], [0.6328], [0.9336], [0.9961], [0.3535], [0.4043], [0.2246], [0.1787], [0.1475], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3145], [1.0000], [1.0000], [0.7500], [0.4004], [0.8008], [1.0000], [1.0000], [0.4004], [0.5000], [0.3340], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00225830078125 loss: 0.00250244140625 loss: 0.0020904541015625 loss: 0.00140380859375 predicted value: tensor([[0.9414], [0.4141], [0.5703], [0.3887], [0.6172], [0.6289], [0.3965], [0.7266], [0.3574], [0.1245], [0.4551], [0.9258], [0.2988], [0.3574], [0.1523], [0.1504]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.8008], [0.4668], [0.5703], [0.7500], [0.2002], [0.6680], [0.2500], [0.3340], [0.5000], [1.0000], [0.2002], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003875732421875 loss: 0.0028076171875 loss: 0.0037078857421875 loss: 0.003143310546875 predicted value: tensor([[0.3691], [0.4414], [0.9258], [0.7148], [0.4219], [0.9805], [0.9492], [0.5742], [0.9297], [0.3086], [0.3223], [0.9375], [0.4453], [0.2090], [0.3613], [0.1377]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.8008], [0.4668], [1.0000], [1.0000], [0.7500], [1.0000], [0.4004], [0.4004], [1.0000], [0.4004], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023040771484375 loss: 0.00118255615234375 loss: 0.00140380859375 loss: 0.0026092529296875 predicted value: tensor([[0.4141], [0.7930], [0.9883], [0.1816], [0.9570], [0.6562], [0.7383], [0.9531], [0.6211], [0.3594], [0.4141], [0.3828], [0.2656], [0.4238], [0.2988], [0.1680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [1.0000], [0.2500], [1.0000], [0.7500], [0.8320], [1.0000], [0.6016], [0.4004], [0.3340], [0.4004], [0.3340], [0.4004], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00127410888671875 loss: 0.0024566650390625 loss: 0.00225830078125 loss: 0.000858306884765625 28%|██▊ | 140/492 [1:14:43<3:03:19, 31.25s/it] {'loss': 0.009, 'learning_rate': 1e-05, 'epoch': 0.28} 28%|██▊ | 140/492 [1:14:43<3:03:19, 31.25s/it]predicted value: tensor([[0.6523], [0.8203], [1.0312], [0.6172], [0.7773], [0.5781], [0.7266], [0.6133], [0.5938], [0.5586], [1.0156], [0.4863], [0.4453], [0.6719], [0.3027], [0.5195]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [1.0000], [0.4668], [0.8008], [0.4668], [0.6016], [0.8008], [0.8008], [0.4668], [1.0000], [0.3750], [0.5000], [0.6016], [0.2500], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004180908203125 loss: 0.003082275390625 loss: 0.002899169921875 loss: 0.0021209716796875 predicted value: tensor([[1.0625], [0.3574], [0.8320], [0.6289], [1.0703], [0.8281], [0.7539], [0.8203], [0.4004], [0.4629], [0.3105], [0.4434], [0.4688], [0.2969], [0.2793], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [0.8008], [0.4668], [1.0000], [0.8008], [0.8008], [0.6680], [0.2002], [0.4004], [0.0625], [0.6016], [0.4004], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000873565673828125 loss: 0.003326416015625loss: 0.0027923583984375 loss: 0.0026397705078125 predicted value: tensor([[0.8789], [0.4824], [0.4746], [0.5938], [0.8516], [0.3555], [0.6836], [0.6406], [0.4863], [0.4609], [1.0547], [0.3398], [0.5859], [0.5234], [0.4863], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.3750], [0.5547], [0.8008], [0.2500], [0.6016], [0.6016], [0.4004], [0.5000], [1.0000], [0.6680], [0.5000], [0.5000], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030059814453125 loss: 0.002960205078125loss: 0.004547119140625 loss: 0.0017852783203125 predicted value: tensor([[0.5273], [0.5078], [1.0859], [0.5430], [0.6172], [0.9258], [1.0781], [0.4492], [0.6055], [1.0312], [0.8203], [0.4453], [0.4297], [0.2402], [0.3066], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [1.0000], [0.4668], [0.4668], [0.8008], [1.0000], [0.3340], [0.6680], [1.0000], [0.8008], [0.4004], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002838134765625 loss: 0.003753662109375 loss: 0.0013275146484375 loss: 0.001983642578125 29%|██▊ | 141/492 [1:15:14<3:02:32, 31.20s/it] {'loss': 0.011, 'learning_rate': 1e-05, 'epoch': 0.29} 29%|██▊ | 141/492 [1:15:14<3:02:32, 31.20s/it]predicted value: tensor([[0.8828], [0.5742], [0.8398], [0.5273], [0.8359], [0.2949], [0.6914], [0.8672], [0.7461], [0.4336], [0.6758], [0.5117], [0.5352], [0.2949], [0.3125], [0.2930]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.3750], [0.8008], [0.4668], [0.8008], [0.2002], [0.7500], [0.8008], [0.5000], [0.4004], [0.5000], [0.5000], [0.5000], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025177001953125 loss: 0.0034942626953125loss: 0.0016632080078125 loss: 0.004180908203125 predicted value: tensor([[0.9023], [0.6367], [0.6445], [0.8555], [0.4609], [0.7344], [0.4844], [1.0391], [0.4922], [0.8203], [0.2891], [0.5430], [0.6367], [0.5195], [0.4863], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.5547], [0.7500], [0.8008], [0.3750], [0.6016], [0.4668], [1.0000], [0.4668], [0.6680], [0.0400], [0.6016], [0.6016], [0.5000], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021209716796875 loss: 0.0021820068359375 loss: 0.0030975341796875 loss: 0.00138092041015625 predicted value: tensor([[0.5586], [1.0547], [0.3457], [0.3574], [0.5039], [0.8086], [0.8477], [0.7031], [0.6875], [0.3906], [0.6836], [0.4707], [0.2393], [0.3047], [0.5742], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.2500], [0.2500], [0.3750], [0.7500], [0.8008], [0.6016], [0.6016], [0.4668], [0.6016], [0.4004], [0.0625], [0.0400], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0052490234375 loss: 0.00335693359375 loss: 0.00341796875 loss: 0.0038604736328125 predicted value: tensor([[0.5352], [1.0703], [0.4707], [0.9180], [0.5312], [0.8672], [0.5625], [0.7109], [0.6367], [0.6094], [0.4355], [0.8164], [0.2520], [0.5117], [0.2559], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [0.8320], [0.4668], [0.8008], [0.6016], [0.6680], [0.3750], [0.6016], [0.3340], [0.8008], [0.2500], [0.7500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00390625 loss: 0.0038604736328125 loss: 0.0029754638671875loss: 0.00323486328125 29%|██▉ | 142/492 [1:15:46<3:02:52, 31.35s/it] {'loss': 0.0126, 'learning_rate': 1e-05, 'epoch': 0.29} 29%|██▉ | 142/492 [1:15:46<3:02:52, 31.35s/it]predicted value: tensor([[0.6055], [0.2275], [0.6875], [0.4043], [0.4199], [0.4102], [0.5938], [0.9180], [0.3633], [0.9219], [0.7031], [0.3262], [0.3535], [0.2812], [0.3457], [0.1689]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.2002], [0.8008], [0.3750], [0.4668], [0.3750], [0.6016], [1.0000], [0.4668], [1.0000], [0.8008], [0.4004], [0.4004], [0.5000], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020599365234375 loss: 0.0037384033203125loss: 0.002410888671875 loss: 0.0022430419921875 predicted value: tensor([[0.7891], [0.9141], [0.4180], [0.2256], [0.6328], [0.9609], [0.4824], [0.3887], [0.3633], [0.4258], [0.3125], [0.3945], [0.1836], [0.3086], [0.1387], [0.1787]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.4668], [0.3340], [0.6680], [1.0000], [0.5547], [0.4668], [0.3750], [0.4004], [0.4004], [0.4004], [0.2500], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025787353515625 loss: 0.00093841552734375loss: 0.001495361328125 loss: 0.00274658203125 predicted value: tensor([[0.4590], [0.3652], [0.6719], [0.4473], [0.9453], [0.9336], [0.5039], [0.3262], [0.9766], [0.4727], [0.1523], [0.6250], [0.9102], [0.2021], [0.1235], [0.1826]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.3750], [0.6680], [0.4668], [1.0000], [1.0000], [0.5000], [0.2002], [1.0000], [0.7500], [0.0400], [0.6016], [1.0000], [0.2002], [0.2002], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026702880859375 loss: 0.0023651123046875loss: 0.001983642578125 loss: 0.001800537109375 predicted value: tensor([[0.9453], [0.3633], [0.5117], [0.9648], [0.1367], [0.5703], [0.2227], [0.4766], [0.9375], [0.5000], [0.3633], [0.3535], [0.3320], [0.5352], [0.3145], [0.1953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [1.0000], [0.2002], [0.7500], [0.3340], [0.6016], [1.0000], [0.4277], [0.4004], [0.4004], [0.3340], [0.7500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00323486328125 loss: 0.004119873046875 loss: 0.0023040771484375loss: 0.00119781494140625 29%|██▉ | 143/492 [1:16:17<3:02:00, 31.29s/it] {'loss': 0.0095, 'learning_rate': 1e-05, 'epoch': 0.29} 29%|██▉ | 143/492 [1:16:17<3:02:00, 31.29s/it]predicted value: tensor([[0.7422], [0.4062], [0.9531], [0.1201], [0.9492], [0.3516], [0.2773], [0.9414], [0.9766], [0.5039], [0.4492], [0.3125], [0.4316], [0.2168], [0.5000], [0.2188]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [1.0000], [0.2500], [1.0000], [0.4668], [0.3340], [1.0000], [1.0000], [0.7500], [0.7500], [0.4004], [0.4004], [0.2002], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0017242431640625 loss: 0.00189208984375 loss: 0.003387451171875 loss: 0.0030517578125 predicted value: tensor([[0.4922], [0.4902], [0.9727], [0.5156], [0.4043], [0.6484], [0.4863], [0.0293], [0.9922], [0.2480], [0.5703], [0.3613], [0.5898], [0.3867], [0.2061], [0.1553]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [1.0000], [0.4648], [0.3750], [0.6680], [0.6016], [0.0278], [1.0000], [0.2500], [0.6016], [0.5000], [0.6016], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029296875 loss: 0.000789642333984375 loss: 0.00213623046875 loss: 0.00244140625 predicted value: tensor([[0.4316], [0.3594], [0.4922], [0.6797], [0.9492], [0.1162], [0.7617], [0.4824], [0.5508], [0.5156], [0.4785], [0.5000], [0.3418], [0.3477], [0.1611], [0.3828]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.5547], [0.6680], [1.0000], [0.2500], [0.8008], [0.6016], [0.6016], [0.6016], [0.6016], [0.6016], [0.3340], [0.5000], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00164794921875 loss: 0.0019989013671875 loss: 0.0025634765625 loss: 0.0010528564453125 predicted value: tensor([[0.9375], [0.4160], [0.7930], [0.2158], [0.7852], [0.4395], [0.4336], [0.6484], [0.3945], [0.2246], [0.2168], [0.3633], [0.6992], [0.5508], [0.1406], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8320], [0.3340], [0.6680], [0.4668], [0.5547], [0.8008], [0.4668], [0.3340], [0.0400], [0.4004], [0.6016], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00179290771484375 loss: 0.0016937255859375 loss: 0.0021820068359375 loss: 0.0019378662109375 29%|██▉ | 144/492 [1:16:48<3:01:28, 31.29s/it] {'loss': 0.0083, 'learning_rate': 1e-05, 'epoch': 0.29} 29%|██▉ | 144/492 [1:16:48<3:01:28, 31.29s/it]predicted value: tensor([[0.5977], [0.9297], [0.6523], [0.2754], [0.5156], [0.8477], [0.7031], [0.8086], [0.7812], [0.5508], [0.4746], [0.6719], [0.4375], [0.5039], [0.3047], [0.3086]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.8008], [0.6016], [0.2002], [0.3750], [0.6680], [0.6016], [0.6680], [0.6680], [0.4668], [0.3340], [0.5000], [0.4004], [0.5000], [0.2500], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004425048828125 loss: 0.002838134765625loss: 0.0037384033203125 loss: 0.0022735595703125 predicted value: tensor([[1.0703], [0.4531], [0.3672], [0.4062], [0.6133], [1.0547], [0.3379], [0.7617], [0.8047], [1.0625], [0.3652], [0.5039], [0.7656], [0.3848], [0.2812], [0.3184]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.2500], [0.2002], [0.5547], [1.0000], [0.3340], [0.3340], [0.6016], [1.0000], [0.7500], [0.4004], [0.6016], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001251220703125 loss: 0.00799560546875loss: 0.0033721923828125 loss: 0.0040283203125 predicted value: tensor([[0.8750], [0.5273], [0.5508], [0.5156], [0.7109], [0.8711], [1.0547], [1.1328], [0.6875], [0.5078], [0.7266], [0.6836], [0.5039], [0.4492], [0.4863], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.3750], [0.4668], [0.7500], [0.8008], [1.0000], [1.0000], [0.7500], [0.4004], [0.7500], [0.6016], [0.5000], [0.3340], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002227783203125 loss: 0.001953125 loss: 0.00171661376953125 loss: 0.00360107421875 predicted value: tensor([[0.4707], [0.5117], [0.5234], [0.5078], [0.8945], [0.4277], [0.8203], [1.0469], [0.5234], [0.6094], [0.3340], [0.6328], [0.4629], [0.2910], [0.2754], [0.3203]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.3750], [0.4668], [0.7500], [0.3145], [0.8008], [1.0000], [0.3750], [0.5000], [0.3340], [0.6016], [0.4004], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00173187255859375 loss: 0.005401611328125loss: 0.005157470703125 loss: 0.0019989013671875 29%|██▉ | 145/492 [1:17:21<3:02:36, 31.58s/it] {'loss': 0.0134, 'learning_rate': 1e-05, 'epoch': 0.29} 29%|██▉ | 145/492 [1:17:21<3:02:36, 31.58s/it]predicted value: tensor([[0.4199], [0.4785], [0.6016], [1.0703], [1.1094], [1.0859], [1.0625], [0.7891], [0.5156], [0.8633], [0.5781], [0.5195], [1.0625], [0.2598], [0.2754], [0.2969]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.3750], [0.4668], [1.0000], [1.0000], [1.0000], [1.0000], [0.6016], [0.4668], [0.5547], [0.5000], [0.4004], [1.0000], [0.2500], [0.1426], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002410888671875 loss: 0.001861572265625 loss: 0.00433349609375 loss: 0.0030975341796875 predicted value: tensor([[0.5195], [0.4961], [0.4824], [1.1094], [0.7188], [0.5117], [0.3828], [0.7148], [0.8516], [0.4512], [0.4238], [0.6797], [0.4629], [0.3652], [0.2773], [0.2598]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.2051], [0.3750], [1.0000], [0.5547], [0.4668], [0.3340], [0.7500], [0.8008], [0.3340], [0.2002], [0.6016], [0.5000], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00154876708984375 loss: 0.0040283203125loss: 0.0030975341796875 loss: 0.00194549560546875 predicted value: tensor([[0.5469], [1.0625], [1.0859], [0.4805], [0.7734], [0.5117], [0.4707], [0.7031], [0.5977], [0.4922], [0.5508], [0.4902], [0.4824], [0.2656], [0.5078], [0.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [1.0000], [0.4668], [0.4668], [0.6680], [0.7500], [0.7500], [0.5547], [0.4668], [0.5000], [0.4004], [0.4004], [0.1670], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029144287109375 loss: 0.00469970703125 loss: 0.0037689208984375 loss: 0.00168609619140625 predicted value: tensor([[1.0859], [0.4492], [0.4551], [0.4863], [0.5625], [1.0859], [0.5039], [0.6875], [0.3594], [0.6094], [0.7344], [0.6367], [0.4863], [0.4746], [0.2832], [0.4141]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.3750], [0.4668], [1.0000], [0.4668], [0.6016], [0.3340], [0.5000], [0.6016], [0.5000], [0.4004], [0.3340], [0.2002], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002685546875 loss: 0.003082275390625 loss: 0.0015106201171875 loss: 0.002105712890625 30%|██▉ | 146/492 [1:17:52<3:01:39, 31.50s/it] {'loss': 0.0112, 'learning_rate': 1e-05, 'epoch': 0.3} 30%|██▉ | 146/492 [1:17:52<3:01:39, 31.50s/it]predicted value: tensor([[0.7734], [0.7227], [0.9961], [0.3906], [0.4316], [0.4121], [0.4277], [0.1875], [0.5703], [0.4180], [0.4883], [0.3613], [0.5156], [0.3867], [0.1289], [0.3730]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.8008], [1.0000], [0.4668], [0.4668], [0.4668], [0.4668], [0.2500], [0.5000], [0.5000], [0.6016], [0.6016], [0.7500], [0.5000], [0.2002], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00160980224609375 loss: 0.003173828125 loss: 0.0015716552734375 loss: 0.0012054443359375 predicted value: tensor([[0.2539], [0.4883], [0.7773], [0.3320], [0.9570], [0.4727], [0.9531], [0.6172], [0.0659], [1.0156], [0.4219], [0.3750], [0.4238], [0.5547], [0.1865], [0.2021]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.4668], [0.8008], [0.4668], [1.0000], [0.6016], [1.0000], [0.6016], [0.0625], [1.0000], [0.4004], [0.3340], [0.5000], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00193023681640625 loss: 0.00079345703125 loss: 0.00168609619140625 loss: 0.0029144287109375 predicted value: tensor([[0.7422], [0.4082], [0.5977], [0.6992], [0.4355], [0.9609], [0.3809], [1.0000], [0.3574], [0.6172], [0.7461], [0.5703], [0.3086], [0.4121], [0.1777], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.6172], [0.5547], [0.8008], [0.4668], [1.0000], [0.4668], [1.0000], [0.3750], [0.6016], [0.6680], [0.7500], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002044677734375 loss: 0.0019683837890625 loss: 0.00244140625 loss: 0.0020599365234375 predicted value: tensor([[0.4160], [0.1768], [0.4941], [0.9727], [0.4844], [0.2656], [0.7930], [0.0874], [0.3242], [0.4766], [0.9219], [0.5547], [0.6445], [0.3926], [0.1484], [0.1855]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2002], [0.8008], [1.0000], [0.5547], [0.2500], [0.8320], [0.2500], [0.2715], [0.6016], [1.0000], [0.7500], [0.7500], [0.6016], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00579833984375 loss: 0.00311279296875 loss: 0.004150390625 loss: 0.00115203857421875 30%|██▉ | 147/492 [1:18:23<3:00:07, 31.32s/it] {'loss': 0.0094, 'learning_rate': 1e-05, 'epoch': 0.3} 30%|██▉ | 147/492 [1:18:23<3:00:07, 31.32s/it]predicted value: tensor([[0.3359], [0.8047], [0.5469], [0.2715], [0.4043], [0.9922], [0.6562], [0.3984], [0.9609], [0.7617], [0.9805], [0.5938], [0.3516], [0.3047], [0.1719], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.8008], [0.2002], [0.4668], [1.0000], [0.6680], [0.4668], [1.0000], [0.6680], [1.0000], [0.7500], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00154876708984375 loss: 0.0019989013671875loss: 0.0025482177734375 loss: 0.0030059814453125 predicted value: tensor([[0.9805], [0.3848], [0.3594], [0.7383], [0.1992], [0.3027], [0.3477], [0.2578], [0.5078], [0.5000], [0.6992], [0.3711], [0.5820], [0.3477], [0.2100], [0.1631]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3750], [0.8320], [0.2002], [0.2002], [0.4668], [0.2500], [0.5000], [0.6016], [0.8008], [0.5000], [0.6016], [0.5000], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035552978515625 loss: 0.001861572265625 loss: 0.00159454345703125 loss: 0.005096435546875 predicted value: tensor([[0.4082], [0.7578], [0.3105], [0.2139], [0.6484], [0.3594], [0.9492], [0.5664], [0.6992], [0.5938], [0.6562], [0.3594], [0.3281], [0.3809], [0.1953], [0.1680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8555], [0.4668], [0.3340], [0.8008], [0.4668], [1.0000], [0.5000], [0.8008], [0.6016], [0.6680], [0.4004], [0.4004], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000492095947265625 loss: 0.001983642578125loss: 0.0022430419921875 loss: 0.00115966796875 predicted value: tensor([[0.4043], [0.9766], [0.5508], [0.3887], [0.1553], [0.4707], [0.9805], [0.9648], [0.9297], [0.6797], [0.5117], [0.4180], [0.1865], [0.3477], [0.2012], [0.2393]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.5547], [0.3750], [0.2500], [0.5547], [1.0000], [1.0000], [1.0000], [0.8008], [0.4668], [0.5000], [0.2002], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005340576171875 loss: 0.0052490234375 loss: 0.0023651123046875 loss: 0.0008392333984375 30%|███ | 148/492 [1:18:55<3:00:13, 31.44s/it] {'loss': 0.0102, 'learning_rate': 1e-05, 'epoch': 0.3} 30%|███ | 148/492 [1:18:55<3:00:13, 31.44s/it]predicted value: tensor([[1.0781], [0.5391], [0.5586], [0.5039], [0.5234], [0.4785], [0.6367], [0.6641], [0.4570], [1.0781], [0.7695], [0.5352], [0.6289], [0.3105], [0.2617], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.4668], [0.4668], [0.3750], [0.5000], [0.6016], [0.3750], [1.0000], [0.6680], [0.5000], [0.7500], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021209716796875 loss: 0.00179290771484375 loss: 0.001800537109375 loss: 0.004547119140625 predicted value: tensor([[0.6133], [0.7773], [0.7852], [0.5078], [0.7148], [0.7812], [0.7422], [0.3750], [1.0703], [1.0781], [0.5625], [0.4512], [0.4902], [0.5156], [0.2676], [0.3184]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6680], [0.8008], [0.4668], [0.8008], [0.7500], [0.7500], [0.2002], [1.0000], [1.0000], [0.5000], [0.5000], [0.3340], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002777099609375 loss: 0.0019683837890625 loss: 0.0030517578125 loss: 0.00494384765625 predicted value: tensor([[1.0703], [1.0781], [1.0391], [0.5273], [1.0859], [0.2520], [0.8203], [0.6914], [0.7305], [0.6367], [0.4609], [0.4590], [0.5117], [0.6484], [0.3340], [0.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [1.0000], [0.4668], [1.0000], [0.2002], [0.8008], [0.6016], [0.6016], [0.8008], [0.4004], [0.3340], [0.5000], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037841796875 loss: 0.0023193359375loss: 0.002166748046875 loss: 0.0018157958984375 predicted value: tensor([[1.0938], [1.1016], [0.6289], [0.5781], [0.5703], [0.5391], [0.3945], [0.5820], [0.7695], [0.3203], [0.8203], [0.5312], [0.6328], [0.3848], [0.3281], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.5547], [0.4648], [0.5547], [0.4668], [0.3340], [0.3750], [0.8008], [0.2002], [0.8008], [0.3340], [0.5000], [0.3340], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0017547607421875 loss: 0.0028228759765625 loss: 0.0029754638671875 loss: 0.00335693359375 30%|███ | 149/492 [1:19:26<2:58:55, 31.30s/it] {'loss': 0.011, 'learning_rate': 1e-05, 'epoch': 0.3} 30%|███ | 149/492 [1:19:26<2:58:55, 31.30s/it]predicted value: tensor([[1.0781], [0.5977], [0.5391], [0.5312], [1.1094], [1.0703], [0.7383], [0.5625], [0.3652], [0.5586], [0.6641], [0.7109], [0.8008], [0.5156], [0.4492], [0.2246]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.4668], [1.0000], [1.0000], [0.6016], [0.5000], [0.3340], [0.2500], [0.7500], [0.6680], [0.6680], [0.4004], [0.4004], [0.0278]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032958984375 loss: 0.00390625 loss: 0.0018157958984375 loss: 0.0025482177734375 predicted value: tensor([[0.8164], [0.5508], [0.4746], [0.5234], [1.0625], [0.3750], [0.8750], [1.0781], [0.4980], [0.7266], [1.0312], [0.7422], [0.4199], [0.4648], [1.0078], [0.3027]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.4668], [0.4668], [1.0000], [0.2002], [0.8008], [1.0000], [0.2002], [0.6016], [1.0000], [0.5000], [0.3340], [0.3340], [1.0000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00408935546875 loss: 0.0038604736328125loss: 0.00133514404296875 loss: 0.00151824951171875 predicted value: tensor([[0.6055], [0.5078], [0.3789], [1.0859], [0.3848], [0.2949], [0.5586], [0.4863], [0.5938], [0.7266], [0.6914], [1.0547], [0.4902], [0.4414], [0.3125], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.2500], [1.0000], [0.2002], [0.2002], [0.4668], [0.3750], [0.5000], [0.7500], [0.6016], [1.0000], [0.4004], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035858154296875 loss: 0.0022430419921875 loss: 0.002197265625 loss: 0.00341796875 predicted value: tensor([[0.7734], [1.0234], [1.0938], [0.7461], [1.0781], [1.0859], [0.4297], [0.7344], [0.6875], [0.6289], [0.3652], [0.4629], [0.5586], [0.2285], [0.2773], [0.2891]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [1.0000], [1.0000], [0.8008], [1.0000], [1.0000], [0.3340], [0.8320], [0.6016], [0.7500], [0.2500], [0.3340], [0.5000], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002471923828125 loss: 0.00238037109375 loss: 0.0019683837890625 loss: 0.0022430419921875 30%|███ | 150/492 [1:19:57<2:58:09, 31.26s/it] {'loss': 0.0107, 'learning_rate': 1e-05, 'epoch': 0.3} 30%|███ | 150/492 [1:19:57<2:58:09, 31.26s/it]predicted value: tensor([[0.5938], [0.2363], [0.5977], [0.2090], [0.6211], [0.7383], [0.9141], [0.9648], [0.5586], [0.3770], [0.3398], [0.4453], [0.3242], [0.1758], [0.1631], [0.1943]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7344], [0.2002], [0.6680], [0.2500], [0.6680], [0.8008], [1.0000], [1.0000], [0.7500], [0.3340], [0.4004], [0.2500], [0.4004], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00142669677734375 loss: 0.003265380859375 loss: 0.0020599365234375 loss: 0.0033416748046875 predicted value: tensor([[0.7148], [0.7656], [0.4414], [0.2461], [0.7070], [0.3594], [0.7383], [0.5820], [0.2676], [0.6289], [0.4199], [0.4785], [0.3809], [0.1699], [0.2100], [0.1426]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8320], [0.4668], [0.2500], [0.8008], [0.3750], [0.8008], [0.7500], [0.3340], [0.6016], [0.5000], [0.5000], [0.4004], [0.0278], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00128173828125 loss: 0.00150299072265625loss: 0.00390625 loss: 0.00171661376953125 predicted value: tensor([[0.4082], [0.9688], [0.6992], [0.3965], [0.7383], [0.7969], [0.4941], [0.4551], [0.3945], [0.9609], [0.4141], [0.3242], [0.3438], [0.8711], [0.1797], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8008], [0.3750], [0.8008], [0.8320], [0.6016], [0.3750], [0.3750], [1.0000], [0.4004], [0.4004], [0.4004], [1.0000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00311279296875 loss: 0.00109100341796875loss: 0.0026702880859375 loss: 0.001312255859375 predicted value: tensor([[0.4238], [0.3496], [0.7344], [0.5195], [0.5195], [0.5586], [0.2676], [0.9492], [0.5312], [0.3398], [0.2119], [0.4219], [0.3574], [0.3066], [0.1748], [0.2090]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [0.8008], [0.7500], [0.5547], [0.6016], [0.3340], [1.0000], [0.6016], [0.3340], [0.2500], [0.4668], [0.3340], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00125885009765625 loss: 0.00139617919921875 loss: 0.0021209716796875 loss: 0.0009613037109375 31%|███ | 151/492 [1:20:29<2:59:29, 31.58s/it] {'loss': 0.0081, 'learning_rate': 1e-05, 'epoch': 0.31} 31%|███ | 151/492 [1:20:29<2:59:29, 31.58s/it]predicted value: tensor([[1.0469], [0.6602], [0.4258], [0.4668], [0.7383], [0.3086], [0.6094], [0.5000], [0.3652], [0.5352], [0.6016], [0.5586], [0.3184], [0.1309], [0.1533], [0.1631]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.4668], [0.4668], [0.8008], [0.3340], [0.6016], [0.6016], [0.3340], [0.6016], [0.6016], [0.7500], [0.4004], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00244140625 loss: 0.00147247314453125 loss: 0.00189971923828125 loss: 0.00148773193359375 predicted value: tensor([[0.4336], [0.4570], [0.5039], [0.3223], [0.4004], [0.2412], [0.3301], [0.7383], [0.6602], [0.1807], [0.6602], [0.2227], [0.3223], [0.1680], [0.1680], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.5547], [0.5000], [0.3340], [0.4668], [0.3340], [0.3750], [0.8320], [0.7500], [0.2500], [0.8008], [0.0400], [0.4004], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00152587890625 loss: 0.0023956298828125 loss: 0.00323486328125 loss: 0.00122833251953125 predicted value: tensor([[0.5625], [0.7305], [0.9336], [0.2402], [0.2793], [0.7227], [0.6523], [0.3711], [0.5391], [0.6445], [0.3457], [0.4688], [0.3750], [0.1924], [0.1611], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [1.0000], [0.3340], [0.2500], [0.8320], [0.6680], [0.2500], [0.6016], [0.6016], [0.4004], [0.3750], [0.2852], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011749267578125 loss: 0.00125885009765625 loss: 0.00128173828125 loss: 0.0023040771484375 predicted value: tensor([[0.6445], [0.4180], [0.4062], [0.9297], [0.4746], [0.9688], [0.3164], [0.7383], [0.3359], [0.3711], [0.3496], [0.3516], [0.3848], [0.3418], [0.2119], [0.3848]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.3750], [1.0000], [0.8008], [1.0000], [0.2500], [0.8008], [0.5000], [0.3340], [0.3340], [0.4004], [0.4004], [0.3340], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029754638671875 loss: 0.0015106201171875 loss: 0.0031890869140625 loss: 0.0028076171875 31%|███ | 152/492 [1:21:01<2:59:34, 31.69s/it] {'loss': 0.008, 'learning_rate': 1e-05, 'epoch': 0.31} 31%|███ | 152/492 [1:21:01<2:59:34, 31.69s/it]predicted value: tensor([[0.8047], [0.5547], [1.0234], [0.8203], [0.8398], [0.6680], [1.0469], [1.0391], [0.3965], [0.4609], [0.7383], [0.5391], [0.5742], [0.2695], [0.2617], [0.3066]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [1.0000], [0.8008], [0.8008], [0.6016], [1.0000], [1.0000], [0.2500], [0.5000], [0.6016], [0.4004], [0.4004], [0.1670], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024871826171875 loss: 0.00677490234375 loss: 0.002197265625 loss: 0.0042724609375 predicted value: tensor([[0.6016], [0.8555], [0.8203], [0.4082], [0.6328], [0.8164], [0.3438], [0.8320], [0.5195], [0.6250], [1.0391], [0.4707], [0.5156], [0.2422], [0.3145], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.8008], [0.2500], [0.5000], [0.6680], [0.2002], [0.7148], [0.3145], [0.5000], [1.0000], [0.3340], [0.5000], [0.1670], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005615234375 loss: 0.003021240234375 loss: 0.0023345947265625 loss: 0.0023193359375 predicted value: tensor([[1.0938], [0.4766], [0.5430], [0.8359], [0.7969], [0.7930], [1.0547], [0.3594], [0.2559], [0.4492], [0.4766], [0.5078], [0.3008], [0.3242], [0.2598], [0.2930]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.8008], [0.7500], [0.6016], [1.0000], [0.2500], [0.2500], [0.3340], [0.3340], [0.4004], [0.0278], [0.0400], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0050048828125 loss: 0.004364013671875 loss: 0.00335693359375 loss: 0.005401611328125 predicted value: tensor([[0.6523], [0.5273], [0.4922], [0.5781], [1.0625], [1.0859], [0.6836], [0.5312], [0.4180], [1.0391], [0.5039], [0.5156], [0.2520], [0.3086], [0.2832], [0.4609]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.4668], [1.0000], [1.0000], [0.8008], [0.5000], [0.3340], [1.0000], [0.4004], [0.5000], [0.2500], [0.2500], [0.2002], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012664794921875 loss: 0.005706787109375 loss: 0.0032196044921875 loss: 0.00160980224609375 31%|███ | 153/492 [1:21:32<2:58:30, 31.59s/it] {'loss': 0.0147, 'learning_rate': 1e-05, 'epoch': 0.31} 31%|███ | 153/492 [1:21:32<2:58:30, 31.59s/it]predicted value: tensor([[0.6445], [0.5391], [0.5430], [0.8672], [0.8164], [1.0469], [0.8125], [0.7188], [0.3281], [0.7070], [0.5352], [0.4375], [0.2812], [0.2676], [0.2344], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.3750], [0.8008], [0.6680], [1.0000], [0.8008], [0.6016], [0.2002], [0.7500], [0.5000], [0.2852], [0.2002], [0.2500], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00323486328125 loss: 0.0023040771484375 loss: 0.004791259765625 loss: 0.0022430419921875 predicted value: tensor([[0.9805], [0.9844], [0.4941], [0.6719], [1.0625], [0.3496], [0.8047], [0.6719], [0.6367], [0.6172], [0.4961], [0.5547], [0.4512], [0.2812], [0.2520], [0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.8320], [0.3750], [0.5547], [1.0000], [0.2500], [0.6680], [0.6016], [0.6016], [0.5000], [0.4668], [0.5000], [0.3340], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00153350830078125 loss: 0.0022430419921875loss: 0.0012359619140625 loss: 0.0020294189453125 predicted value: tensor([[0.6016], [1.0859], [0.3809], [0.6523], [0.8516], [0.5625], [0.6211], [0.8164], [0.6719], [0.6914], [0.3594], [0.6133], [0.5117], [0.2598], [0.3418], [0.2197]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.2500], [0.5547], [0.8008], [0.2500], [0.4668], [0.6680], [0.8320], [0.6016], [0.2500], [0.3340], [0.4004], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019073486328125 loss: 0.00421142578125 loss: 0.00537109375 loss: 0.005889892578125 predicted value: tensor([[0.8672], [0.7656], [1.0625], [0.4980], [0.8320], [0.2676], [0.4961], [1.1016], [0.6875], [0.7344], [1.0156], [0.4570], [0.7266], [0.4238], [0.2334], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [1.0000], [0.4668], [0.8008], [0.2500], [0.3750], [1.0000], [0.5000], [0.6016], [1.0000], [0.3340], [0.6016], [0.3340], [0.1426], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001495361328125 loss: 0.0016326904296875 loss: 0.002716064453125 loss: 0.0037689208984375 31%|███▏ | 154/492 [1:22:04<2:57:43, 31.55s/it] {'loss': 0.0117, 'learning_rate': 1e-05, 'epoch': 0.31} 31%|███▏ | 154/492 [1:22:04<2:57:43, 31.55s/it]predicted value: tensor([[0.5156], [0.7305], [0.4980], [0.6641], [0.9844], [0.4980], [0.3438], [0.4141], [0.3340], [0.9727], [0.5938], [0.2637], [0.1445], [0.4160], [0.1504], [0.1436]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.4648], [0.5547], [1.0000], [0.4668], [0.4668], [0.4668], [0.5000], [1.0000], [0.6016], [0.2002], [0.2002], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00162506103515625 loss: 0.001678466796875loss: 0.00110626220703125 loss: 0.00103759765625 predicted value: tensor([[0.4512], [0.2578], [0.6602], [1.0078], [0.9570], [0.6484], [0.7695], [0.6367], [0.5469], [0.9727], [0.2969], [0.4297], [0.3809], [0.1455], [0.3105], [0.1465]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [0.8008], [1.0000], [1.0000], [0.6680], [0.8008], [0.8008], [0.6016], [1.0000], [0.4004], [0.2500], [0.3340], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001220703125 loss: 0.00055694580078125 loss: 0.0024566650390625 loss: 0.001983642578125 predicted value: tensor([[0.4316], [0.4141], [0.3340], [0.7500], [0.7344], [0.4727], [0.4980], [0.2012], [0.9375], [0.3418], [0.4668], [0.6367], [0.3691], [0.4414], [0.1699], [0.1465]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.8008], [0.8008], [0.5547], [0.4668], [0.2002], [1.0000], [0.2500], [0.6016], [0.6016], [0.5000], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019683837890625 loss: 0.00131988525390625loss: 0.00125885009765625 loss: 0.0037841796875 predicted value: tensor([[0.3574], [0.3809], [0.9648], [0.3945], [0.9219], [0.5391], [0.5547], [0.6641], [0.4297], [0.1631], [0.3770], [0.3398], [0.3906], [0.3203], [0.0811], [0.1367]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.4668], [1.0000], [0.5547], [0.7500], [0.7500], [0.6016], [0.2500], [0.4668], [0.3340], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025177001953125 loss: 0.00177001953125 loss: 0.00238037109375 loss: 0.0022125244140625 32%|███▏ | 155/492 [1:22:35<2:56:51, 31.49s/it] {'loss': 0.0072, 'learning_rate': 1e-05, 'epoch': 0.32} 32%|███▏ | 155/492 [1:22:35<2:56:51, 31.49s/it]predicted value: tensor([[0.4434], [0.6172], [0.6133], [0.4395], [0.9883], [0.2383], [0.2734], [0.6641], [0.6758], [0.6094], [0.9961], [0.3730], [0.3789], [0.3652], [0.3438], [0.1357]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5312], [0.5547], [0.4668], [0.4668], [1.0000], [0.2002], [0.3340], [0.8008], [0.7500], [0.6016], [1.0000], [0.4004], [0.4004], [0.5000], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00116729736328125 loss: 0.00154876708984375 loss: 0.0019683837890625 loss: 0.0008087158203125 predicted value: tensor([[0.5234], [0.6914], [0.7148], [0.4570], [0.2363], [0.9727], [0.5156], [0.9727], [0.9844], [0.6523], [0.5234], [0.3398], [0.5352], [0.1875], [0.1660], [0.1318]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.8008], [0.7500], [0.4668], [0.3340], [1.0000], [0.4668], [1.0000], [1.0000], [0.6016], [0.6016], [0.3340], [0.3340], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00311279296875 loss: 0.00135040283203125 loss: 0.0013885498046875 loss: 0.002044677734375 predicted value: tensor([[0.5117], [0.9492], [0.9883], [0.1865], [0.1836], [0.8359], [0.3848], [0.9609], [0.6719], [0.6914], [0.6602], [0.3086], [0.3516], [0.2930], [0.2617], [0.4961]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.2500], [0.3340], [0.8320], [0.3750], [1.0000], [0.7500], [0.8008], [0.4668], [0.6016], [0.4004], [0.4004], [0.2852], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00102996826171875 loss: 0.00335693359375 loss: 0.0013885498046875 loss: 0.002197265625 predicted value: tensor([[0.9844], [0.7695], [0.8320], [0.5039], [0.3594], [0.3711], [0.6445], [0.6719], [0.6758], [0.4766], [0.6211], [0.3359], [0.1240], [0.4492], [0.2949], [0.1328]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.8320], [0.4648], [0.4668], [0.3145], [0.7500], [0.7500], [0.5000], [0.6016], [0.5547], [0.4004], [0.4004], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001983642578125 loss: 0.001922607421875 loss: 0.0028839111328125loss: 0.00121307373046875 32%|███▏ | 156/492 [1:23:07<2:56:36, 31.54s/it] {'loss': 0.0073, 'learning_rate': 1e-05, 'epoch': 0.32} 32%|███▏ | 156/492 [1:23:07<2:56:36, 31.54s/it]predicted value: tensor([[1.1562], [0.5312], [1.0781], [0.4746], [1.0859], [0.5547], [0.2891], [0.3320], [1.0547], [0.6758], [0.4707], [0.5039], [0.5977], [1.0781], [0.2793], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.4668], [1.0000], [0.4668], [0.2500], [0.3340], [1.0000], [0.6016], [0.4004], [0.5000], [0.6016], [1.0000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016326904296875 loss: 0.00115966796875 loss: 0.00147247314453125 loss: 0.00170135498046875 predicted value: tensor([[0.7656], [0.5820], [0.6367], [0.7109], [0.5977], [0.4824], [1.0781], [0.2969], [0.5781], [0.7344], [0.4805], [0.4766], [0.5391], [0.3008], [0.2344], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.4668], [0.5547], [0.5547], [0.4668], [0.4668], [1.0000], [0.2500], [0.3750], [0.7500], [0.5000], [0.4004], [0.5000], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0038299560546875 loss: 0.0020751953125loss: 0.002105712890625 loss: 0.005035400390625 predicted value: tensor([[1.0625], [0.9102], [0.7891], [1.0938], [0.9258], [1.0938], [0.8320], [1.1250], [0.5625], [0.5977], [0.5664], [0.5469], [0.4551], [0.4824], [0.2598], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.6680], [1.0000], [0.8320], [1.0000], [0.7500], [1.0000], [0.5000], [0.3340], [0.5000], [0.4004], [0.4004], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004608154296875 loss: 0.0029449462890625loss: 0.00933837890625 loss: 0.0030364990234375 predicted value: tensor([[0.4395], [0.4980], [0.6875], [0.5312], [0.6289], [1.1250], [0.3438], [1.1094], [1.0859], [0.5430], [0.8398], [0.4082], [0.4199], [0.4336], [0.5586], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3145], [0.3750], [0.5547], [0.4668], [0.6016], [1.0000], [0.3340], [1.0000], [1.0000], [0.6016], [0.8008], [0.5000], [0.2852], [0.5000], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035552978515625 loss: 0.00185394287109375 loss: 0.0028839111328125 loss: 0.0019989013671875 32%|███▏ | 157/492 [1:23:38<2:56:15, 31.57s/it] {'loss': 0.0123, 'learning_rate': 1e-05, 'epoch': 0.32} 32%|███▏ | 157/492 [1:23:38<2:56:15, 31.57s/it]predicted value: tensor([[0.6289], [0.5391], [0.7852], [1.1016], [0.7422], [0.8203], [0.4316], [1.0547], [0.7461], [0.7852], [0.6562], [0.3652], [0.4590], [0.2832], [0.2441], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8320], [1.0000], [0.6680], [0.8008], [0.3340], [1.0000], [0.6680], [0.7500], [0.6016], [0.6016], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001556396484375 loss: 0.001953125loss: 0.00176239013671875 loss: 0.0015869140625 predicted value: tensor([[0.8086], [0.4785], [0.4297], [0.4375], [1.0703], [1.0781], [0.4707], [1.0547], [0.8711], [0.5820], [0.6211], [0.5508], [0.5312], [0.5312], [0.2500], [0.3125]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.1670], [0.4668], [1.0000], [1.0000], [0.4668], [1.0000], [0.8320], [0.5000], [0.5000], [0.2002], [0.3340], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00433349609375 loss: 0.004302978515625 loss: 0.0010833740234375 loss: 0.0016021728515625 predicted value: tensor([[0.6875], [0.7891], [0.3398], [1.0938], [0.7930], [1.1016], [0.7461], [0.7227], [0.7266], [0.7070], [0.7266], [0.5898], [0.5117], [0.3008], [0.2695], [0.3281]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.8008], [0.2002], [1.0000], [0.3750], [1.0000], [0.8008], [0.7500], [0.6016], [0.6016], [0.7500], [0.6016], [0.5000], [0.1670], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002197265625 loss: 0.00439453125loss: 0.0035247802734375 loss: 0.002197265625 predicted value: tensor([[0.6289], [0.5625], [0.6289], [0.9141], [1.0859], [0.4863], [0.7422], [1.0547], [0.4141], [0.3691], [0.7891], [0.4863], [0.3574], [1.0781], [0.2676], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.5547], [0.8320], [1.0000], [0.3750], [0.6680], [1.0000], [0.3750], [0.3340], [0.8008], [0.4004], [0.2500], [1.0000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020294189453125 loss: 0.0030059814453125 loss: 0.00185394287109375 loss: 0.0022125244140625 32%|███▏ | 158/492 [1:24:09<2:54:37, 31.37s/it] {'loss': 0.0099, 'learning_rate': 1e-05, 'epoch': 0.32} 32%|███▏ | 158/492 [1:24:09<2:54:37, 31.37s/it]predicted value: tensor([[0.4941], [0.9648], [0.9531], [0.9414], [0.9336], [0.6094], [0.3340], [0.4922], [0.3496], [0.1963], [0.3164], [0.5156], [0.4297], [0.2266], [0.1523], [0.1914]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [1.0000], [1.0000], [1.0000], [1.0000], [0.5547], [0.3340], [0.8008], [0.4668], [0.3340], [0.3340], [0.6016], [0.5000], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001251220703125 loss: 0.00299072265625loss: 0.001678466796875 loss: 0.00162506103515625 predicted value: tensor([[0.7539], [0.9492], [0.6992], [0.9648], [0.6211], [0.5078], [0.6172], [0.4160], [0.4570], [0.2178], [0.4766], [0.4238], [0.4746], [0.4336], [0.1348], [0.1338]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.8008], [1.0000], [0.8008], [0.5000], [0.6016], [0.4668], [0.3750], [0.3340], [0.6016], [0.5000], [0.8008], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001373291015625 loss: 0.003326416015625loss: 0.00145721435546875 loss: 0.000904083251953125 predicted value: tensor([[0.7070], [0.9609], [0.6484], [0.2139], [0.3301], [0.3496], [0.9648], [0.2578], [0.6094], [0.3770], [0.3711], [0.5352], [0.3359], [0.3555], [0.2246], [0.4668]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.8008], [0.2002], [0.4668], [0.4668], [1.0000], [0.2500], [0.7500], [0.4668], [0.4004], [0.6016], [0.1670], [0.3340], [0.2002], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00194549560546875 loss: 0.0021514892578125loss: 0.0022735595703125 loss: 0.0023345947265625 predicted value: tensor([[0.9570], [0.9609], [0.3555], [1.0000], [0.3828], [0.9492], [0.4590], [0.6953], [0.9570], [0.4746], [0.2217], [0.5508], [0.3398], [0.0488], [0.1826], [0.1504]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [1.0000], [0.3145], [1.0000], [0.3750], [0.6680], [1.0000], [0.4668], [0.5000], [0.6016], [0.4004], [0.0625], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00156402587890625 loss: 0.002685546875 loss: 0.00193023681640625 loss: 0.000885009765625 32%|███▏ | 159/492 [1:24:41<2:54:00, 31.35s/it] {'loss': 0.0076, 'learning_rate': 1e-05, 'epoch': 0.32} 32%|███▏ | 159/492 [1:24:41<2:54:00, 31.35s/it]predicted value: tensor([[0.4277], [0.6719], [0.7188], [0.4199], [0.5898], [0.4199], [0.2432], [0.3594], [0.4180], [0.9297], [0.0884], [0.3574], [0.4395], [0.3438], [0.1953], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.8320], [0.3750], [0.7500], [0.3750], [0.3340], [0.3750], [0.3750], [1.0000], [0.0278], [0.3340], [0.5000], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002593994140625 loss: 0.0016937255859375loss: 0.002197265625 loss: 0.0010986328125 predicted value: tensor([[0.8164], [0.9414], [0.7812], [0.5469], [0.3809], [0.5664], [0.6484], [0.6211], [0.6172], [0.5781], [0.3926], [0.3633], [0.4277], [0.1865], [0.3535], [0.1973]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.8320], [0.5703], [0.4668], [0.5000], [0.4648], [0.6016], [0.6680], [0.5000], [0.4004], [0.4004], [0.4004], [0.2500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018310546875 loss: 0.00109100341796875loss: 0.00099945068359375 loss: 0.002777099609375 predicted value: tensor([[0.4297], [0.9570], [0.3906], [0.4512], [0.2021], [0.9609], [0.6797], [0.3164], [0.6641], [0.9414], [0.4531], [0.3984], [0.3750], [0.3672], [0.1650], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.4668], [0.2500], [1.0000], [0.6172], [0.2500], [0.7500], [1.0000], [0.3750], [0.4004], [0.3340], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024566650390625 loss: 0.000652313232421875 loss: 0.0021209716796875 loss: 0.00093841552734375 predicted value: tensor([[0.6953], [0.4082], [0.1807], [0.3828], [0.1021], [0.4277], [0.9688], [0.9297], [0.2207], [0.5039], [0.2256], [0.2012], [0.1670], [0.3672], [0.2969], [0.2158]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.2002], [0.4668], [0.0278], [0.4668], [1.0000], [1.0000], [0.3340], [0.5000], [0.2500], [0.2500], [0.0400], [0.5000], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115966796875 loss: 0.001251220703125 loss: 0.00148773193359375 loss: 0.0013275146484375 33%|███▎ | 160/492 [1:25:12<2:53:54, 31.43s/it] {'loss': 0.0064, 'learning_rate': 1e-05, 'epoch': 0.33} 33%|███▎ | 160/492 [1:25:12<2:53:54, 31.43s/it]predicted value: tensor([[0.4668], [0.5664], [0.2793], [0.3438], [0.7031], [0.5586], [0.6875], [0.3516], [0.5938], [0.3398], [0.5742], [1.0391], [0.5469], [0.4805], [0.2354], [0.2969]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.2002], [0.2500], [0.6680], [0.4668], [0.7500], [0.2500], [0.4668], [0.2500], [0.3750], [1.0000], [0.5000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00162506103515625 loss: 0.0024566650390625 loss: 0.0022735595703125 loss: 0.00180816650390625 predicted value: tensor([[0.5391], [1.0625], [1.0547], [0.6328], [0.3438], [0.3926], [0.8281], [0.2930], [0.3066], [0.6133], [0.7031], [0.5000], [0.1885], [0.2754], [0.2715], [0.4727]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3145], [1.0000], [1.0000], [0.3750], [0.2500], [0.7500], [0.8008], [0.2500], [0.2500], [0.4277], [0.6016], [0.4004], [0.0400], [0.2002], [0.2002], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001983642578125 loss: 0.005828857421875loss: 0.0027618408203125 loss: 0.002044677734375 predicted value: tensor([[0.7031], [0.8164], [1.0547], [1.0625], [0.9023], [0.5312], [0.7695], [1.0469], [0.5547], [0.3223], [0.4863], [0.4180], [0.4961], [0.3965], [0.2891], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.7148], [1.0000], [1.0000], [0.8320], [0.4668], [0.8320], [1.0000], [0.5000], [0.2500], [0.4004], [0.3340], [0.4004], [0.3340], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001373291015625 loss: 0.0015106201171875loss: 0.0014801025390625 loss: 0.00274658203125 predicted value: tensor([[0.4688], [1.0234], [0.5078], [0.7617], [0.6133], [0.8281], [0.6758], [0.7773], [0.6211], [1.1094], [0.6367], [0.5391], [0.5312], [0.2520], [0.2559], [0.3086]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [0.8320], [0.4668], [0.8320], [0.6016], [0.8008], [0.6016], [1.0000], [0.6016], [0.5000], [0.5000], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00201416015625 loss: 0.00177001953125 loss: 0.001861572265625 loss: 0.0035552978515625 33%|███▎ | 161/492 [1:25:44<2:53:48, 31.50s/it] {'loss': 0.0093, 'learning_rate': 1e-05, 'epoch': 0.33} 33%|███▎ | 161/492 [1:25:44<2:53:48, 31.50s/it]predicted value: tensor([[0.6992], [0.7930], [1.0625], [0.2871], [0.8203], [0.7539], [0.6719], [0.2930], [1.0547], [0.6445], [0.5078], [0.6680], [0.5312], [0.4512], [0.2598], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.8320], [1.0000], [0.2500], [0.8008], [0.8008], [0.6016], [0.2500], [1.0000], [0.6016], [0.4004], [0.5000], [0.3340], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00130462646484375 loss: 0.00089263916015625 loss: 0.00174713134765625 loss: 0.0013427734375 predicted value: tensor([[0.6562], [0.5508], [0.7305], [1.0469], [0.6016], [0.5000], [0.3203], [0.7812], [0.5664], [1.0312], [0.3301], [0.5547], [0.3340], [0.5039], [0.2793], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.4668], [0.6680], [1.0000], [0.3340], [0.4668], [0.3340], [0.7500], [0.2500], [1.0000], [0.3340], [0.5000], [0.0278], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00335693359375 loss: 0.0042724609375loss: 0.004913330078125 loss: 0.00124359130859375 predicted value: tensor([[0.7930], [1.0469], [1.0234], [0.8555], [1.0938], [0.7891], [0.5977], [0.3340], [0.7070], [0.6484], [0.3516], [0.3027], [0.6992], [0.5234], [0.4961], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [1.0000], [0.8008], [1.0000], [0.8008], [0.6016], [0.3340], [0.8008], [0.6016], [0.3340], [0.2002], [0.7500], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027618408203125 loss: 0.000701904296875loss: 0.0020904541015625 loss: 0.00119781494140625 predicted value: tensor([[0.5352], [0.9141], [0.5547], [1.0703], [0.5000], [0.7852], [0.3066], [0.5820], [0.4824], [0.5234], [0.3594], [0.3301], [0.4512], [0.4688], [0.2930], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.7500], [1.0000], [0.7500], [0.8008], [0.2500], [0.5000], [0.4668], [0.5000], [0.2002], [0.2002], [0.3340], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00168609619140625 loss: 0.002593994140625 loss: 0.00173187255859375 loss: 0.003204345703125 33%|███▎ | 162/492 [1:26:16<2:54:15, 31.68s/it] {'loss': 0.0088, 'learning_rate': 1e-05, 'epoch': 0.33} 33%|███▎ | 162/492 [1:26:16<2:54:15, 31.68s/it]predicted value: tensor([[0.4121], [0.4043], [0.8164], [0.6445], [0.6758], [0.4980], [0.6719], [0.2754], [0.5078], [0.5234], [0.9492], [0.2500], [0.3086], [0.1084], [0.1973], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8320], [0.7500], [0.8008], [0.5000], [0.7500], [0.2500], [0.5000], [0.7500], [1.0000], [0.5000], [0.4004], [0.0400], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002655029296875 loss: 0.002716064453125 loss: 0.0023651123046875 loss: 0.002838134765625 predicted value: tensor([[0.4141], [0.2217], [0.9844], [0.4492], [0.6875], [0.4551], [0.5508], [0.2656], [0.8984], [0.2432], [0.5000], [0.5273], [0.4238], [0.0835], [0.4785], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [1.0000], [0.5547], [0.8008], [0.4668], [0.6016], [0.3340], [1.0000], [0.3340], [0.6016], [0.5000], [0.3340], [0.0278], [0.6016], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00323486328125 loss: 0.00171661376953125 loss: 0.002044677734375 loss: 0.00182342529296875 predicted value: tensor([[0.3906], [0.4277], [0.3047], [0.6758], [0.5039], [0.2148], [0.9492], [0.9531], [0.5039], [0.1924], [0.3691], [0.3730], [0.3711], [0.3281], [0.2852], [0.1582]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.2500], [0.6680], [0.5000], [0.3340], [1.0000], [1.0000], [0.6016], [0.2002], [0.2500], [0.5000], [0.4004], [0.5000], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00168609619140625 loss: 0.00189971923828125 loss: 0.0022430419921875 loss: 0.0012054443359375 predicted value: tensor([[0.6797], [0.1455], [0.2617], [0.2246], [0.9258], [0.4062], [0.3555], [0.7070], [0.2363], [0.6211], [0.5469], [0.9414], [0.3008], [0.1660], [0.3047], [0.1328]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.2500], [0.3340], [0.2500], [1.0000], [0.4668], [0.6016], [0.8008], [0.3340], [0.7500], [0.5000], [1.0000], [0.3340], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037384033203125 loss: 0.0035858154296875 loss: 0.00201416015625 loss: 0.0029144287109375 33%|███▎ | 163/492 [1:26:48<2:53:25, 31.63s/it] {'loss': 0.0097, 'learning_rate': 1e-05, 'epoch': 0.33} 33%|███▎ | 163/492 [1:26:48<2:53:25, 31.63s/it]predicted value: tensor([[0.5508], [0.9258], [0.9805], [0.6250], [0.4082], [0.4395], [0.2148], [0.6484], [0.1699], [0.4590], [0.3730], [0.9258], [0.1875], [0.1846], [0.1914], [0.1533]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [1.0000], [1.0000], [0.6680], [0.4668], [0.7500], [0.2500], [0.6680], [0.3340], [0.5000], [0.5000], [1.0000], [0.2002], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001373291015625 loss: 0.002777099609375loss: 0.0025482177734375 loss: 0.0023345947265625 predicted value: tensor([[0.8359], [0.4297], [0.6953], [0.9297], [0.5469], [0.6797], [0.9297], [0.9688], [0.4141], [0.9297], [0.3184], [0.0874], [0.2197], [0.2500], [0.1709], [0.1943]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.8008], [1.0000], [0.8008], [0.6680], [1.0000], [1.0000], [0.5000], [1.0000], [0.3340], [0.0625], [0.2500], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115203857421875 loss: 0.001617431640625 loss: 0.000667572021484375 loss: 0.0015869140625 predicted value: tensor([[0.4102], [0.5898], [0.7188], [0.6953], [0.7031], [0.6602], [0.1855], [0.7188], [0.6602], [0.6055], [0.4355], [0.5195], [0.3633], [0.1523], [0.1465], [0.1768]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6016], [0.8320], [0.6680], [0.7148], [0.6680], [0.2500], [0.7148], [0.8008], [0.7500], [0.6680], [0.6016], [0.3340], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020599365234375 loss: 0.0022125244140625 loss: 0.00103759765625 loss: 0.00323486328125 predicted value: tensor([[0.4141], [0.2969], [0.4492], [0.4316], [0.3711], [0.9688], [0.9297], [0.6602], [0.6406], [0.4043], [0.9570], [0.4512], [0.4062], [0.4062], [0.1484], [0.1167]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2715], [0.4668], [0.4668], [0.4668], [1.0000], [1.0000], [0.8320], [0.8008], [0.4004], [1.0000], [0.4004], [0.5000], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00244140625 loss: 0.0028533935546875 loss: 0.0016326904296875 loss: 0.00146484375 33%|███▎ | 164/492 [1:27:19<2:53:12, 31.69s/it] {'loss': 0.0077, 'learning_rate': 1e-05, 'epoch': 0.33} 33%|███▎ | 164/492 [1:27:19<2:53:12, 31.69s/it]predicted value: tensor([[1.0547], [0.5352], [0.3398], [0.7734], [0.8828], [0.5391], [0.8711], [0.7266], [0.8086], [0.7852], [0.4941], [0.4844], [0.4941], [0.2617], [0.2578], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.2002], [0.6680], [0.8320], [0.4668], [0.8008], [0.8008], [0.4668], [0.8008], [0.4004], [0.3340], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018157958984375 loss: 0.00537109375 loss: 0.0037689208984375 loss: 0.0023193359375 predicted value: tensor([[0.6289], [0.7812], [0.6445], [0.3418], [1.0938], [0.6211], [0.7461], [0.7852], [0.4922], [0.7227], [0.4395], [0.6992], [0.4512], [0.4492], [0.2295], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4004], [0.8008], [0.5547], [0.2500], [1.0000], [0.4668], [0.8008], [0.7500], [0.4668], [0.5000], [0.3340], [0.6016], [0.5000], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035247802734375 loss: 0.0030517578125loss: 0.0023956298828125 loss: 0.002410888671875 predicted value: tensor([[0.6602], [0.4824], [0.6680], [0.8828], [1.0312], [0.5117], [1.0312], [0.6953], [1.0391], [0.6289], [0.6445], [0.5898], [0.4492], [0.5156], [0.2598], [0.2441]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.8008], [1.0000], [0.3750], [1.0000], [0.7500], [1.0000], [0.7500], [0.5000], [0.3750], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019683837890625 loss: 0.0030975341796875 loss: 0.00151824951171875 loss: 0.0042724609375 predicted value: tensor([[0.4727], [0.4922], [0.6289], [1.0234], [0.3516], [0.5195], [0.7578], [0.5273], [0.3164], [1.0469], [0.4922], [1.0156], [0.4609], [0.2402], [0.2676], [0.3965]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [1.0000], [0.2002], [0.4668], [0.7500], [0.4668], [0.2500], [1.0000], [0.3340], [1.0000], [0.5000], [0.2500], [0.2002], [0.0625]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020294189453125 loss: 0.0017242431640625 loss: 0.0020904541015625 loss: 0.002899169921875 34%|███▎ | 165/492 [1:27:51<2:52:50, 31.71s/it] {'loss': 0.0111, 'learning_rate': 1e-05, 'epoch': 0.34} 34%|███▎ | 165/492 [1:27:51<2:52:50, 31.71s/it]predicted value: tensor([[0.5586], [0.6328], [0.8086], [0.6992], [0.2480], [0.7031], [1.0547], [0.3086], [1.0703], [0.2734], [0.6523], [1.0391], [0.4766], [0.6562], [0.4238], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.8008], [0.6016], [0.2002], [0.7500], [1.0000], [0.2002], [1.0000], [0.3340], [0.6016], [1.0000], [0.4004], [0.7500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0038604736328125 loss: 0.00183868408203125 loss: 0.00101470947265625 loss: 0.004241943359375 predicted value: tensor([[0.5039], [0.8672], [0.5312], [0.5117], [0.5508], [0.5469], [1.0312], [1.0391], [0.3008], [0.2930], [0.6523], [0.6719], [0.2500], [0.2793], [0.2715], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.3750], [0.3750], [0.3750], [0.4668], [1.0000], [1.0000], [0.2500], [0.2002], [0.7500], [0.6016], [0.0400], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012359619140625 loss: 0.002593994140625 loss: 0.005584716796875 loss: 0.0021209716796875 predicted value: tensor([[0.8750], [0.8125], [0.7930], [0.4629], [0.3574], [1.0312], [0.5117], [0.6445], [0.6172], [0.3965], [0.5547], [0.5703], [0.4355], [0.5156], [0.2314], [0.2480]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.8008], [0.8008], [0.3750], [0.3340], [1.0000], [0.3750], [0.5000], [0.6016], [0.2002], [0.4004], [0.5000], [0.4004], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029449462890625 loss: 0.0019073486328125 loss: 0.0022430419921875 loss: 0.00567626953125 predicted value: tensor([[0.5234], [0.4727], [1.0625], [1.0625], [0.6680], [0.7695], [0.6484], [0.3594], [0.4609], [0.9062], [0.5000], [0.4746], [0.4883], [0.2637], [0.2432], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [1.0000], [0.5547], [0.6250], [0.6016], [0.2500], [0.4004], [0.8008], [0.2500], [0.5000], [0.4004], [0.2500], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024566650390625 loss: 0.0018463134765625 loss: 0.0034637451171875 loss: 0.0023956298828125 34%|███▎ | 166/492 [1:28:22<2:51:36, 31.59s/it] {'loss': 0.0114, 'learning_rate': 1e-05, 'epoch': 0.34} 34%|███▎ | 166/492 [1:28:22<2:51:36, 31.59s/it]predicted value: tensor([[0.4824], [0.3789], [0.3945], [0.4043], [0.2363], [0.6094], [0.3828], [0.4316], [0.3242], [0.2988], [0.7266], [0.3652], [0.2676], [0.1299], [0.0972], [0.1455]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.4668], [0.3340], [0.7500], [0.4668], [0.2500], [0.3340], [0.2002], [0.8008], [0.4004], [0.2500], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021820068359375 loss: 0.00150299072265625 loss: 0.00201416015625 loss: 0.000972747802734375 predicted value: tensor([[0.7188], [0.7617], [0.9688], [0.9727], [0.4199], [0.5078], [0.3633], [0.1299], [0.6367], [0.1377], [0.9688], [0.4766], [0.5703], [0.1689], [0.3750], [0.1182]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6680], [1.0000], [1.0000], [0.4668], [0.4668], [0.4668], [0.2500], [0.8008], [0.2500], [1.0000], [0.6016], [0.7500], [0.2002], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001739501953125 loss: 0.001678466796875 loss: 0.00225830078125 loss: 0.002288818359375 predicted value: tensor([[0.7070], [0.2471], [0.7305], [0.2852], [0.3848], [0.9609], [0.9727], [0.5391], [0.2178], [0.5391], [0.3477], [0.2598], [0.5195], [0.3242], [0.1641], [0.0898]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.3340], [0.8320], [0.2500], [0.3516], [1.0000], [1.0000], [0.6016], [0.6016], [0.7500], [0.3340], [0.3340], [0.5000], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019378662109375 loss: 0.003814697265625loss: 0.000850677490234375 loss: 0.000782012939453125 predicted value: tensor([[0.9609], [0.4375], [0.1729], [0.6914], [0.4141], [0.5312], [0.9922], [0.7227], [0.9531], [0.3945], [0.6562], [0.5586], [0.3301], [0.3594], [0.1289], [0.1973]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.2500], [0.8008], [0.4668], [0.6016], [1.0000], [0.8008], [1.0000], [0.4004], [0.6016], [0.6016], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00244140625 loss: 0.00162506103515625 loss: 0.000919342041015625 loss: 0.0020904541015625 34%|███▍ | 167/492 [1:28:54<2:50:22, 31.45s/it] {'loss': 0.0073, 'learning_rate': 1e-05, 'epoch': 0.34} 34%|███▍ | 167/492 [1:28:54<2:50:22, 31.45s/it]predicted value: tensor([[0.5078], [0.5781], [0.4316], [0.6055], [0.1816], [0.7305], [0.5742], [0.6250], [0.2676], [0.4473], [0.1631], [0.4414], [0.3906], [0.3633], [0.1396], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.7500], [0.4004], [0.6680], [0.2002], [0.8320], [0.6016], [0.7500], [0.2500], [0.5000], [0.1670], [0.4004], [0.3340], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000885009765625 loss: 0.0010833740234375 loss: 0.0012054443359375 loss: 0.002685546875 predicted value: tensor([[0.4688], [0.9727], [0.9531], [0.9570], [0.7070], [0.3555], [0.9648], [0.6445], [0.5820], [0.5352], [0.9336], [0.9570], [0.1157], [0.3164], [0.1748], [0.1592]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [1.0000], [0.7148], [0.3340], [1.0000], [0.7500], [0.6016], [0.8008], [1.0000], [1.0000], [0.0400], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026397705078125 loss: 0.0017242431640625loss: 0.0019989013671875 loss: 0.0019073486328125 predicted value: tensor([[0.1514], [0.3242], [0.9258], [0.2100], [0.4043], [0.9844], [0.1143], [0.6133], [0.5586], [0.1650], [0.6758], [0.4590], [0.4492], [0.1533], [0.1226], [0.1729]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.3340], [1.0000], [0.2500], [0.4668], [1.0000], [0.2002], [0.7500], [0.6016], [0.2002], [0.6016], [0.4668], [0.5000], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019683837890625 loss: 0.00093841552734375 loss: 0.00157928466796875 loss: 0.00079345703125 predicted value: tensor([[0.6016], [0.4082], [0.4434], [0.4707], [0.6211], [0.9805], [0.6406], [0.9961], [0.6016], [0.3457], [0.5352], [0.4160], [0.1572], [0.3262], [0.3320], [0.1650]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.5547], [0.7500], [1.0000], [0.6680], [1.0000], [0.6016], [0.4668], [0.7500], [0.5000], [0.0625], [0.3340], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001953125 loss: 0.0015716552734375 loss: 0.0018310546875 loss: 0.001617431640625 34%|███▍ | 168/492 [1:29:26<2:50:40, 31.61s/it] {'loss': 0.0066, 'learning_rate': 1e-05, 'epoch': 0.34} 34%|███▍ | 168/492 [1:29:26<2:50:40, 31.61s/it]predicted value: tensor([[0.6172], [0.5898], [0.2617], [0.2793], [0.8594], [0.6328], [0.3125], [0.2891], [0.4785], [0.3867], [0.7383], [0.6250], [0.6680], [0.2656], [0.2520], [0.2236]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.5547], [0.2500], [0.3340], [0.8008], [0.5000], [0.3340], [0.2002], [0.4668], [0.3340], [0.7500], [0.5000], [0.7500], [0.1670], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002960205078125 loss: 0.0024871826171875 loss: 0.00157928466796875 loss: 0.00168609619140625 predicted value: tensor([[0.5781], [0.6680], [0.7461], [1.0547], [0.7852], [0.8906], [0.5195], [0.6992], [1.0625], [0.4980], [0.6211], [0.6875], [0.3105], [0.2422], [0.2695], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.6680], [1.0000], [0.8008], [0.8008], [0.3750], [0.6016], [1.0000], [0.4668], [0.6016], [0.6016], [0.2500], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00160980224609375 loss: 0.0013885498046875loss: 0.0036773681640625 loss: 0.00225830078125 predicted value: tensor([[0.4980], [0.4961], [0.5234], [0.6875], [0.6875], [0.8359], [0.3945], [1.0547], [0.1152], [0.9062], [0.3809], [0.2324], [0.5781], [0.5039], [0.5312], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.6016], [0.6016], [0.8008], [0.2500], [1.0000], [0.0278], [0.8008], [0.3340], [0.0278], [0.4004], [0.5000], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0036773681640625 loss: 0.0022125244140625 loss: 0.004241943359375 loss: 0.00189208984375 predicted value: tensor([[0.7773], [0.7070], [0.5078], [0.3398], [0.8828], [1.0703], [0.7148], [0.8398], [0.7734], [0.3789], [0.5156], [0.7305], [0.4570], [0.2754], [0.3008], [0.2432]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.5312], [0.3750], [0.2500], [0.8008], [1.0000], [0.6016], [0.8008], [0.8008], [0.2002], [0.3340], [0.6016], [0.3340], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026702880859375 loss: 0.00372314453125 loss: 0.003021240234375 loss: 0.003021240234375 34%|███▍ | 169/492 [1:29:57<2:50:35, 31.69s/it] {'loss': 0.0105, 'learning_rate': 1e-05, 'epoch': 0.34} 34%|███▍ | 169/492 [1:29:57<2:50:35, 31.69s/it]predicted value: tensor([[0.6133], [0.4160], [0.3242], [0.5117], [0.8672], [0.4609], [1.0156], [0.7852], [0.5508], [0.5664], [1.0938], [0.4434], [0.4883], [0.4980], [0.2930], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.3340], [0.4668], [0.8008], [0.4668], [1.0000], [0.8008], [0.4668], [0.8008], [1.0000], [0.2852], [0.4004], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005767822265625 loss: 0.004241943359375 loss: 0.0020751953125 loss: 0.001556396484375 predicted value: tensor([[0.5352], [0.5156], [0.3496], [0.8438], [0.4551], [0.4629], [0.7617], [0.3398], [0.4629], [0.2852], [0.6094], [1.1016], [0.5000], [0.2412], [0.2490], [0.2598]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.2500], [0.7500], [0.3750], [0.4668], [0.8008], [0.3340], [0.5000], [0.2500], [0.5000], [1.0000], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023345947265625 loss: 0.00146484375 loss: 0.0023193359375 loss: 0.00408935546875 predicted value: tensor([[0.5430], [0.8281], [0.4824], [1.0391], [0.2314], [0.5156], [0.8008], [0.8203], [1.0547], [1.0703], [0.2852], [0.4805], [0.6562], [0.3438], [0.3281], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.4668], [1.0000], [0.2002], [0.4668], [0.8008], [0.6680], [1.0000], [1.0000], [0.0625], [0.4004], [0.6016], [0.0278], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00189971923828125 loss: 0.004974365234375 loss: 0.0036163330078125 loss: 0.00311279296875 predicted value: tensor([[0.6367], [0.4805], [0.3008], [0.4688], [0.3438], [0.4961], [0.7852], [0.4668], [1.0469], [0.6641], [0.4531], [0.5625], [0.2539], [0.4531], [0.5039], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.3340], [0.4668], [0.2500], [0.4668], [0.6680], [0.3340], [1.0000], [0.7500], [0.3340], [0.5000], [0.1670], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00171661376953125 loss: 0.00250244140625 loss: 0.00179290771484375 loss: 0.0030059814453125 35%|███▍ | 170/492 [1:30:29<2:49:42, 31.62s/it] {'loss': 0.0116, 'learning_rate': 1e-05, 'epoch': 0.35} 35%|███▍ | 170/492 [1:30:29<2:49:42, 31.62s/it]predicted value: tensor([[0.9648], [0.9492], [0.6289], [0.7383], [0.5547], [0.7344], [0.5781], [0.5234], [0.9414], [0.9922], [0.3672], [0.7109], [0.2061], [0.3242], [0.3438], [0.1865]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.6680], [0.8008], [0.6680], [0.8008], [0.5547], [0.6016], [1.0000], [1.0000], [0.4004], [0.8008], [0.2002], [0.3340], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001495361328125 loss: 0.00141143798828125 loss: 0.00077056884765625 loss: 0.00124359130859375 predicted value: tensor([[0.4570], [0.4395], [0.7773], [0.3418], [0.9336], [0.3145], [0.7031], [0.3828], [0.9727], [0.9883], [0.4707], [0.9414], [0.4238], [0.1748], [0.1572], [0.1562]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8320], [0.3750], [1.0000], [0.3750], [0.8008], [0.4668], [1.0000], [1.0000], [0.4668], [1.0000], [0.5000], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00125885009765625 loss: 0.00090789794921875 loss: 0.00091552734375 loss: 0.00139617919921875 predicted value: tensor([[0.5117], [0.3535], [0.7422], [0.9336], [0.6055], [0.2539], [0.2061], [0.3750], [0.5273], [0.6523], [0.9648], [0.4863], [0.4668], [0.1992], [0.3516], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.8320], [1.0000], [0.7148], [0.3340], [0.2500], [0.3340], [0.6016], [0.8008], [1.0000], [0.6016], [0.5000], [0.2500], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00124359130859375 loss: 0.001861572265625 loss: 0.00139617919921875 loss: 0.0015716552734375 predicted value: tensor([[0.3652], [0.3906], [0.9219], [0.7695], [0.9766], [0.9453], [0.9688], [0.2520], [0.6055], [0.9883], [0.5273], [0.5547], [0.3672], [0.3555], [0.3672], [0.1260]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.8320], [1.0000], [1.0000], [1.0000], [0.2500], [0.8008], [1.0000], [0.7500], [0.6016], [0.4004], [0.4004], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023040771484375 loss: 0.00116729736328125 loss: 0.001983642578125 loss: 0.003875732421875 35%|███▍ | 171/492 [1:31:00<2:49:02, 31.60s/it] {'loss': 0.0062, 'learning_rate': 1e-05, 'epoch': 0.35} 35%|███▍ | 171/492 [1:31:00<2:49:02, 31.60s/it]predicted value: tensor([[0.9375], [0.1943], [0.3652], [0.9297], [0.6953], [0.3770], [0.2617], [0.6953], [0.9688], [0.3945], [0.2070], [0.2559], [0.3340], [0.4531], [0.3770], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [0.4668], [1.0000], [0.8320], [0.4668], [0.3340], [0.7500], [1.0000], [0.4668], [0.0400], [0.2500], [0.4004], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00152587890625 loss: 0.0015411376953125loss: 0.0029296875 loss: 0.000789642333984375 predicted value: tensor([[0.5117], [0.9609], [0.7539], [0.6641], [0.2520], [0.6641], [0.2891], [0.9531], [0.7539], [0.4980], [0.9805], [0.3379], [0.4160], [0.2793], [0.3867], [0.1660]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8008], [0.8008], [0.2002], [0.8008], [0.3340], [1.0000], [0.6680], [0.3340], [1.0000], [0.4004], [0.4004], [0.4004], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00121307373046875 loss: 0.0016937255859375 loss: 0.00165557861328125 loss: 0.001617431640625 predicted value: tensor([[0.4531], [0.4766], [0.4258], [0.7148], [0.9844], [0.3672], [0.2305], [0.2197], [0.6133], [0.5820], [0.7344], [0.6523], [0.3906], [0.3418], [0.3320], [0.1416]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4004], [0.3750], [0.8008], [1.0000], [0.4668], [0.2500], [0.2500], [0.6016], [0.4668], [0.8008], [0.7500], [0.3340], [0.3340], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0014190673828125 loss: 0.00299072265625 loss: 0.0010528564453125 loss: 0.00183868408203125 predicted value: tensor([[0.4258], [0.6641], [0.8047], [0.6094], [0.3848], [0.9453], [0.4434], [0.2148], [0.6133], [0.3281], [0.9648], [0.4551], [0.4023], [0.1943], [0.3398], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.8320], [0.5000], [0.4668], [1.0000], [0.3750], [0.2500], [0.7500], [0.4004], [1.0000], [0.4004], [0.4004], [0.2002], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00145721435546875 loss: 0.001220703125 loss: 0.001495361328125 loss: 0.00171661376953125 35%|███▍ | 172/492 [1:31:32<2:48:27, 31.59s/it] {'loss': 0.0065, 'learning_rate': 1e-05, 'epoch': 0.35} 35%|███▍ | 172/492 [1:31:32<2:48:27, 31.59s/it]predicted value: tensor([[1.1094], [0.4668], [0.5273], [0.5156], [0.5781], [0.4961], [0.5312], [0.7734], [0.6875], [0.1865], [1.1172], [1.0938], [0.5781], [0.5273], [0.2520], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.4668], [0.5547], [0.2500], [0.6016], [0.8008], [0.6016], [0.0625], [1.0000], [1.0000], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028533935546875 loss: 0.003326416015625 loss: 0.00262451171875 loss: 0.005584716796875 predicted value: tensor([[0.6211], [0.6172], [0.8281], [0.8164], [0.5352], [1.0625], [0.3359], [0.6406], [0.7617], [1.0781], [0.3613], [0.4668], [0.3770], [0.3535], [0.4160], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.8008], [0.7148], [0.3750], [1.0000], [0.3340], [0.7500], [0.7500], [1.0000], [0.3340], [0.5000], [0.0400], [0.7500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0014190673828125 loss: 0.001953125 loss: 0.005340576171875 loss: 0.0027313232421875 predicted value: tensor([[1.1250], [0.4004], [0.8789], [0.7695], [1.1016], [0.2930], [0.6523], [0.3887], [0.3379], [0.5547], [0.5742], [1.0938], [0.7578], [0.2676], [0.2812], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.8320], [0.8008], [1.0000], [0.2500], [0.7500], [0.3340], [0.2500], [0.6016], [0.4004], [1.0000], [0.7500], [0.0400], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019989013671875 loss: 0.0027923583984375 loss: 0.0028839111328125 loss: 0.003173828125 predicted value: tensor([[0.4980], [1.0234], [0.6836], [0.4980], [0.5703], [0.3340], [0.7148], [0.5664], [1.0781], [0.8203], [0.3066], [0.5664], [0.4922], [0.4316], [0.2539], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.6680], [0.4668], [0.5547], [0.2500], [0.6016], [0.3750], [1.0000], [0.7500], [0.2500], [0.6016], [0.5000], [0.4004], [0.1426], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0064697265625 loss: 0.0016937255859375 loss: 0.00138092041015625 loss: 0.0018463134765625 35%|███▌ | 173/492 [1:32:04<2:48:07, 31.62s/it] {'loss': 0.012, 'learning_rate': 1e-05, 'epoch': 0.35} 35%|███▌ | 173/492 [1:32:04<2:48:07, 31.62s/it]predicted value: tensor([[0.5664], [1.1250], [1.0938], [1.0547], [1.0938], [0.5742], [0.5039], [1.1250], [0.7852], [1.0859], [0.3535], [0.4629], [0.4766], [0.2354], [0.2363], [0.2412]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [1.0000], [1.0000], [1.0000], [1.0000], [0.4668], [0.3340], [1.0000], [0.7500], [1.0000], [0.2500], [0.5000], [0.4004], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013427734375 loss: 0.0019989013671875loss: 0.001953125 loss: 0.0009613037109375 predicted value: tensor([[0.6719], [0.9375], [0.5391], [0.8984], [0.6211], [0.8203], [0.3223], [0.6875], [0.7734], [0.4805], [0.6016], [0.5273], [0.4570], [0.3828], [0.4668], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.8008], [0.4668], [0.8008], [0.5000], [0.8320], [0.2500], [0.7500], [0.6016], [0.5000], [0.5000], [0.3340], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023345947265625 loss: 0.002471923828125 loss: 0.00225830078125 loss: 0.0027618408203125 predicted value: tensor([[0.8398], [0.5430], [1.0859], [0.5039], [1.1016], [0.8008], [0.3457], [1.0938], [0.4219], [0.7148], [0.4473], [0.7031], [0.3574], [0.4102], [0.2617], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [1.0000], [0.4668], [1.0000], [0.8008], [0.2002], [1.0000], [0.3340], [0.6016], [0.3340], [0.6016], [0.5000], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00250244140625 loss: 0.00191497802734375loss: 0.001068115234375 loss: 0.0032806396484375 predicted value: tensor([[0.5039], [0.8672], [0.5742], [1.1250], [0.6016], [1.0859], [0.7695], [1.0547], [0.5039], [0.6914], [0.5000], [0.5820], [0.2520], [0.2520], [0.2754], [0.2432]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.3750], [1.0000], [0.4668], [1.0000], [0.8008], [1.0000], [0.4668], [0.7500], [0.4004], [0.6680], [0.1670], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002960205078125 loss: 0.001953125 loss: 0.0020294189453125 loss: 0.0027923583984375 35%|███▌ | 174/492 [1:32:36<2:47:58, 31.69s/it] {'loss': 0.0086, 'learning_rate': 1e-05, 'epoch': 0.35} 35%|███▌ | 174/492 [1:32:36<2:47:58, 31.69s/it]predicted value: tensor([[0.9531], [0.2285], [0.2910], [0.3965], [0.6641], [0.4336], [0.2256], [0.5742], [0.6562], [0.4688], [0.5234], [0.3691], [0.3418], [0.1699], [0.1641], [0.1338]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [0.3340], [0.2500], [0.8008], [0.6016], [0.3340], [0.6016], [0.7500], [0.5000], [0.6016], [0.3340], [0.3340], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013580322265625 loss: 0.002593994140625 loss: 0.0019378662109375 loss: 0.0011138916015625 predicted value: tensor([[0.3906], [0.4199], [1.0000], [0.4121], [0.4141], [0.9688], [0.7031], [0.6719], [0.3008], [0.9766], [0.7383], [0.3379], [0.4336], [0.3418], [0.1562], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.3750], [0.4668], [1.0000], [0.7500], [0.8320], [0.2002], [1.0000], [0.8008], [0.2002], [0.2500], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034332275390625 loss: 0.000682830810546875 loss: 0.0019989013671875 loss: 0.00153350830078125 predicted value: tensor([[0.3926], [0.6328], [0.3828], [0.4219], [0.6562], [0.4434], [0.2812], [0.5625], [0.3848], [0.4961], [0.5312], [0.2158], [0.3477], [0.3496], [0.1455], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [0.4668], [0.3750], [0.8008], [0.5000], [0.2500], [0.6680], [0.4668], [0.5000], [0.5000], [0.2500], [0.4004], [0.3340], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013275146484375 loss: 0.0013427734375 loss: 0.0011749267578125 loss: 0.0012054443359375 predicted value: tensor([[0.2334], [0.3535], [0.5000], [0.5625], [0.2236], [0.6797], [0.2637], [0.2451], [0.9648], [0.2656], [0.1079], [0.4004], [0.1299], [0.1865], [0.1357], [0.1426]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.3750], [0.6680], [0.8008], [0.2500], [0.7500], [0.2500], [0.2500], [1.0000], [0.2500], [0.0278], [0.5000], [0.1250], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003082275390625 loss: 0.00131988525390625 loss: 0.001739501953125 loss: 0.0019378662109375 36%|███▌ | 175/492 [1:33:07<2:47:33, 31.72s/it] {'loss': 0.0069, 'learning_rate': 1e-05, 'epoch': 0.36} 36%|███▌ | 175/492 [1:33:07<2:47:33, 31.72s/it]predicted value: tensor([[0.5391], [0.5430], [0.4121], [0.9648], [0.5742], [0.4316], [0.6562], [0.3965], [0.5273], [0.4297], [0.9805], [0.3633], [0.3359], [0.3223], [0.1221], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.3750], [1.0000], [0.6016], [0.3750], [0.6680], [0.3750], [0.6680], [0.4004], [1.0000], [0.5000], [0.4004], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002899169921875 loss: 0.000545501708984375 loss: 0.00142669677734375 loss: 0.0010986328125 predicted value: tensor([[0.5273], [0.7461], [0.7656], [0.2207], [0.3750], [0.9570], [0.2871], [0.5977], [0.5742], [0.7422], [0.9609], [0.2852], [0.2930], [0.2100], [0.1436], [0.1973]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.8008], [0.2500], [0.4668], [1.0000], [0.3340], [0.8008], [0.6016], [0.8008], [1.0000], [0.4004], [0.2500], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002838134765625 loss: 0.001373291015625loss: 0.00133514404296875 loss: 0.0026397705078125 predicted value: tensor([[0.7695], [0.3633], [0.9805], [0.3867], [0.3242], [0.3184], [0.2715], [0.5664], [0.5664], [0.4258], [0.5312], [0.3711], [0.4238], [0.1348], [0.1572], [0.1729]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [1.0000], [0.3750], [0.2500], [0.3340], [0.3340], [0.6016], [0.6016], [0.4668], [0.6016], [0.5000], [0.3340], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034027099609375 loss: 0.0027923583984375 loss: 0.00104522705078125 loss: 0.00262451171875 predicted value: tensor([[0.3926], [0.9258], [0.7461], [0.9727], [0.6602], [0.9609], [0.9570], [0.9531], [0.9375], [0.7148], [0.3496], [0.9531], [0.1504], [0.1914], [0.2012], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.6680], [1.0000], [0.6680], [1.0000], [1.0000], [1.0000], [1.0000], [0.6680], [0.3750], [1.0000], [0.2002], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029296875 loss: 0.000881195068359375 loss: 0.0020751953125 loss: 0.000568389892578125 36%|███▌ | 176/492 [1:33:39<2:46:47, 31.67s/it] {'loss': 0.0076, 'learning_rate': 1e-05, 'epoch': 0.36} 36%|███▌ | 176/492 [1:33:39<2:46:47, 31.67s/it]predicted value: tensor([[0.5547], [0.8281], [0.7461], [0.8945], [0.7539], [0.4902], [0.5391], [0.6914], [1.0781], [0.3945], [0.5117], [0.3555], [0.3242], [0.4336], [0.2930], [0.2188]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.6016], [0.8008], [0.7500], [0.3750], [0.2500], [0.2500], [1.0000], [0.2500], [0.4668], [0.2500], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00141143798828125 loss: 0.006072998046875loss: 0.00171661376953125 loss: 0.00372314453125 predicted value: tensor([[0.3457], [0.5586], [0.5312], [0.4805], [0.5352], [0.3535], [0.5742], [1.0547], [0.5195], [0.7461], [0.8945], [0.7930], [0.5781], [0.2734], [0.2812], [0.4453]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.4668], [0.4668], [0.4668], [0.3750], [0.2500], [0.6680], [1.0000], [0.4668], [0.6680], [0.8008], [0.6680], [0.2500], [0.2500], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001861572265625 loss: 0.0034332275390625loss: 0.00154876708984375 loss: 0.0032501220703125 predicted value: tensor([[0.6055], [0.3027], [1.0703], [0.3516], [0.8945], [0.6523], [0.4922], [0.5078], [0.7266], [0.5234], [0.3516], [0.4023], [0.4062], [0.4297], [0.3750], [0.2432]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.2500], [1.0000], [0.2002], [0.8008], [0.8008], [0.4668], [0.4668], [0.6680], [0.5000], [0.3340], [0.4004], [0.4004], [0.5000], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021820068359375 loss: 0.0023040771484375 loss: 0.00122833251953125 loss: 0.002593994140625 predicted value: tensor([[0.4258], [0.4941], [0.8008], [0.3281], [0.8125], [0.8008], [0.5938], [0.3496], [0.6133], [0.7070], [0.5469], [0.5469], [0.4863], [0.3066], [0.4922], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.3750], [0.6680], [0.2500], [0.7500], [0.7500], [0.6016], [0.2002], [0.3750], [0.7500], [0.6016], [0.4668], [0.4668], [0.1670], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00113677978515625 loss: 0.0050048828125 loss: 0.002593994140625loss: 0.00286865234375 36%|███▌ | 177/492 [1:34:10<2:44:57, 31.42s/it] {'loss': 0.0107, 'learning_rate': 1e-05, 'epoch': 0.36} 36%|███▌ | 177/492 [1:34:10<2:44:57, 31.42s/it]predicted value: tensor([[0.6758], [0.5312], [0.8164], [0.7852], [0.3242], [1.0312], [1.0391], [1.0391], [0.5898], [1.0234], [0.7148], [0.3340], [0.4355], [0.2500], [0.2236], [0.2949]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5312], [0.3750], [0.8320], [0.8008], [0.2500], [1.0000], [1.0000], [1.0000], [0.4668], [1.0000], [0.6016], [0.3340], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004241943359375 loss: 0.002044677734375 loss: 0.00165557861328125 loss: 0.00110626220703125 predicted value: tensor([[1.0234], [1.0156], [0.5430], [0.5078], [0.4004], [0.6250], [1.0547], [1.0469], [1.0234], [0.5234], [0.6680], [0.7422], [0.4980], [0.4531], [0.2773], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.4668], [0.2002], [0.8008], [1.0000], [1.0000], [1.0000], [0.3750], [0.6016], [0.6016], [0.5000], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001861572265625 loss: 0.0021820068359375loss: 0.0015106201171875 loss: 0.0031585693359375 predicted value: tensor([[1.0234], [0.3359], [1.0781], [0.3027], [1.0000], [0.5156], [0.7500], [1.0312], [0.5820], [0.7031], [0.6953], [0.2695], [0.6523], [0.2656], [0.2715], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [1.0000], [0.2002], [1.0000], [0.4668], [0.7500], [1.0000], [0.5000], [0.7500], [0.6016], [0.0625], [0.8008], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002899169921875 loss: 0.0019683837890625 loss: 0.0013580322265625 loss: 0.00179290771484375 predicted value: tensor([[0.5234], [1.0234], [0.4941], [0.4629], [0.7773], [1.0625], [0.8242], [0.3516], [1.0234], [0.6836], [0.7656], [0.6836], [0.4375], [1.0234], [0.2676], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [0.3145], [0.7500], [1.0000], [0.8008], [0.3340], [1.0000], [0.6016], [0.6680], [0.5703], [0.5000], [1.0000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003570556640625 loss: 0.005157470703125 loss: 0.0016021728515625 loss: 0.003173828125 36%|███▌ | 178/492 [1:34:42<2:46:08, 31.75s/it] {'loss': 0.0098, 'learning_rate': 1e-05, 'epoch': 0.36} 36%|███▌ | 178/492 [1:34:42<2:46:08, 31.75s/it]predicted value: tensor([[0.6289], [0.3828], [0.3828], [0.1523], [0.4844], [0.4180], [0.2930], [0.3613], [0.2520], [0.6172], [0.3301], [0.3926], [0.1719], [0.4121], [0.1758], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.2500], [0.4668], [0.2500], [0.4668], [0.3750], [0.3340], [0.4668], [0.3340], [0.6016], [0.4004], [0.2002], [0.2002], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00091552734375 loss: 0.00168609619140625loss: 0.00140380859375 loss: 0.0013885498046875 predicted value: tensor([[0.4785], [0.2930], [0.7812], [0.4629], [0.4688], [0.9297], [0.4316], [0.6094], [0.6484], [0.6445], [0.4180], [0.4688], [0.5273], [0.1514], [0.2080], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.1670], [0.8008], [0.4668], [0.4668], [1.0000], [0.8008], [0.6016], [0.2500], [0.6016], [0.4668], [0.5000], [0.6016], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004486083984375 loss: 0.0052490234375 loss: 0.00139617919921875 loss: 0.003509521484375 predicted value: tensor([[0.4512], [0.3750], [0.1797], [0.7070], [0.9062], [0.4746], [0.4043], [0.5547], [0.6406], [0.6406], [0.4883], [0.4941], [0.1846], [0.1768], [0.1602], [0.1768]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.4668], [0.2002], [0.8008], [1.0000], [0.6016], [0.4668], [0.6016], [0.6016], [0.8008], [0.5000], [0.5000], [0.2002], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000972747802734375 loss: 0.0018157958984375loss: 0.00125885009765625 loss: 0.00060272216796875 predicted value: tensor([[0.9375], [0.7695], [0.2168], [0.2236], [0.5352], [0.9297], [0.5742], [0.1865], [0.5195], [0.5039], [0.3379], [0.6094], [0.1924], [0.1602], [0.1719], [0.1572]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.2002], [0.2002], [0.5000], [1.0000], [0.6016], [0.2002], [0.5000], [0.6016], [0.4004], [0.7500], [0.2002], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002410888671875 loss: 0.00131988525390625 loss: 0.000812530517578125loss: 0.00110626220703125 36%|███▋ | 179/492 [1:35:14<2:46:21, 31.89s/it] {'loss': 0.0076, 'learning_rate': 1e-05, 'epoch': 0.36} 36%|███▋ | 179/492 [1:35:14<2:46:21, 31.89s/it]predicted value: tensor([[0.4785], [0.4355], [0.7344], [0.5938], [0.3887], [0.6250], [0.3906], [0.4004], [0.5859], [0.4297], [0.3008], [0.3730], [0.5742], [0.2275], [0.1963], [0.1914]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.8008], [0.7500], [0.2002], [0.4668], [0.3750], [0.3750], [0.7500], [0.4668], [0.3340], [0.4668], [0.6016], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00162506103515625 loss: 0.002655029296875 loss: 0.002166748046875 loss: 0.00131988525390625 predicted value: tensor([[0.9453], [0.2061], [0.7344], [0.7891], [0.9570], [0.9102], [0.3672], [0.5742], [0.9570], [0.3652], [0.6133], [0.5508], [0.4082], [0.2373], [0.2129], [0.1816]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [0.5547], [0.8320], [1.0000], [1.0000], [0.4668], [0.6016], [1.0000], [0.5000], [0.7500], [0.5000], [0.5000], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00067901611328125 loss: 0.0019683837890625loss: 0.0018463134765625 loss: 0.0017852783203125 predicted value: tensor([[0.9141], [0.9883], [0.1719], [0.3770], [0.8008], [0.1177], [0.3867], [0.9414], [0.9297], [0.7578], [0.3418], [0.5508], [0.4336], [0.4590], [0.1836], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.2002], [0.4668], [0.8008], [0.0625], [0.4668], [1.0000], [1.0000], [0.8008], [0.4004], [0.6016], [0.4004], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003326416015625 loss: 0.0008087158203125loss: 0.00138092041015625 loss: 0.002593994140625 predicted value: tensor([[0.5078], [0.9727], [0.3301], [0.7031], [0.9336], [0.9062], [0.5469], [0.4180], [0.9219], [0.6641], [0.3477], [0.3809], [0.0688], [0.3379], [0.1982], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3750], [0.6680], [1.0000], [1.0000], [0.5000], [0.5000], [1.0000], [0.7500], [0.4004], [0.4004], [0.0278], [0.3340], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0022735595703125 loss: 0.00185394287109375 loss: 0.00080108642578125 loss: 0.0034027099609375 37%|███▋ | 180/492 [1:35:45<2:44:32, 31.64s/it] {'loss': 0.0076, 'learning_rate': 1e-05, 'epoch': 0.37} 37%|███▋ | 180/492 [1:35:45<2:44:32, 31.64s/it]predicted value: tensor([[0.6250], [0.5391], [0.4805], [0.8633], [0.2441], [0.7539], [0.3320], [0.5039], [0.5469], [0.5234], [0.7656], [0.5234], [0.2490], [0.3047], [0.2715], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.4668], [0.4668], [0.8008], [0.2500], [0.6016], [0.1670], [0.4668], [0.4668], [0.5000], [0.7500], [0.5000], [0.1670], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010986328125 loss: 0.001373291015625 loss: 0.003875732421875 loss: 0.001129150390625 predicted value: tensor([[0.4707], [0.4707], [0.4297], [0.3223], [1.0625], [0.4062], [0.7148], [1.0547], [0.6719], [0.4766], [0.7031], [0.4395], [0.7422], [0.5703], [0.2451], [0.2432]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.2002], [1.0000], [0.3340], [0.6016], [1.0000], [0.7500], [0.4004], [0.6016], [0.5000], [0.6016], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023956298828125 loss: 0.003021240234375loss: 0.00165557861328125 loss: 0.0033721923828125 predicted value: tensor([[0.6523], [0.5078], [0.8750], [0.9141], [0.4297], [0.8945], [1.1016], [0.2539], [0.3984], [0.2871], [0.6562], [0.1177], [0.5039], [0.2695], [0.5195], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8320], [0.8008], [0.3145], [0.8008], [1.0000], [0.2002], [0.2500], [0.3340], [0.6016], [0.0400], [0.4004], [0.2002], [0.5000], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001983642578125 loss: 0.00177001953125 loss: 0.00165557861328125 loss: 0.00046539306640625 predicted value: tensor([[0.5312], [0.3340], [1.0703], [0.9062], [1.0938], [1.0625], [1.0703], [0.8594], [0.5625], [1.0938], [0.3828], [0.4980], [0.4961], [0.2734], [0.2695], [0.3242]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [1.0000], [0.8008], [1.0000], [1.0000], [1.0000], [0.8008], [0.5000], [1.0000], [0.2500], [0.5000], [0.3750], [0.0625], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001678466796875 loss: 0.0030059814453125 loss: 0.002471923828125 loss: 0.003997802734375 37%|███▋ | 181/492 [1:36:17<2:43:45, 31.59s/it] {'loss': 0.0087, 'learning_rate': 1e-05, 'epoch': 0.37} 37%|███▋ | 181/492 [1:36:17<2:43:45, 31.59s/it]predicted value: tensor([[0.4688], [0.4805], [0.4746], [0.4727], [0.4727], [0.3438], [0.7930], [0.4844], [0.5664], [1.0703], [0.8203], [0.4453], [0.2910], [0.4414], [0.2520], [0.2949]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.4668], [0.3750], [0.4668], [0.3340], [0.8008], [0.2500], [0.4004], [1.0000], [0.3750], [0.3340], [0.2002], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023040771484375 loss: 0.001739501953125 loss: 0.005279541015625 loss: 0.001220703125 predicted value: tensor([[0.8203], [1.0781], [1.0703], [0.5039], [1.0703], [0.6367], [0.4980], [0.5312], [1.0781], [0.5977], [0.7578], [1.0859], [0.6914], [0.2832], [0.2773], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [1.0000], [1.0000], [0.4668], [1.0000], [0.6016], [0.4668], [0.4668], [1.0000], [0.6016], [0.8008], [1.0000], [0.7500], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00116729736328125 loss: 0.00110626220703125 loss: 0.003387451171875 loss: 0.00127410888671875 predicted value: tensor([[0.4941], [0.7695], [0.3066], [0.3652], [0.5156], [0.5195], [1.0781], [0.4941], [1.0703], [0.7617], [0.2188], [0.4141], [0.7227], [0.2373], [0.2656], [0.3281]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5703], [0.2500], [0.3340], [0.4668], [0.4668], [1.0000], [0.4668], [1.0000], [0.7500], [0.5000], [0.3340], [0.7500], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007781982421875 loss: 0.0025787353515625 loss: 0.00262451171875 loss: 0.0031890869140625 predicted value: tensor([[0.4785], [0.6094], [0.8594], [1.0625], [0.2852], [0.3789], [0.7227], [0.6172], [1.0469], [0.8438], [0.5117], [0.7227], [0.4648], [0.4141], [0.3027], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2734], [0.8008], [1.0000], [0.2500], [0.3340], [0.7500], [0.5000], [1.0000], [0.7500], [0.4668], [0.6680], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002655029296875 loss: 0.000835418701171875 loss: 0.0023040771484375 loss: 0.0028076171875 37%|███▋ | 182/492 [1:36:48<2:42:59, 31.55s/it] {'loss': 0.0106, 'learning_rate': 1e-05, 'epoch': 0.37} 37%|███▋ | 182/492 [1:36:48<2:42:59, 31.55s/it]predicted value: tensor([[0.9219], [0.6328], [0.3730], [0.2158], [0.4395], [0.4160], [0.3633], [0.9844], [0.4473], [0.8008], [0.3281], [0.3145], [0.3398], [0.9844], [0.2041], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.4668], [0.3340], [0.4668], [0.4668], [0.3750], [1.0000], [0.6016], [0.8008], [0.3340], [0.4004], [0.4004], [1.0000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000492095947265625 loss: 0.001190185546875loss: 0.0010986328125 loss: 0.000762939453125 predicted value: tensor([[0.6992], [0.4824], [0.3887], [0.4082], [0.4004], [0.4258], [0.4766], [0.9688], [0.6992], [0.3145], [0.3535], [0.1797], [0.2754], [0.1855], [0.1855], [0.1855]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.3750], [0.4668], [0.3750], [0.3750], [0.4668], [1.0000], [0.6680], [0.4004], [0.4004], [0.2500], [0.4004], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00311279296875 loss: 0.0009613037109375 loss: 0.00148773193359375 loss: 0.000949859619140625 predicted value: tensor([[1.0078], [0.3750], [0.2695], [0.3887], [1.0234], [0.5547], [0.5898], [0.7070], [0.5391], [0.0500], [0.5117], [0.4219], [0.7461], [0.1738], [0.4062], [0.1865]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3750], [0.4668], [1.0000], [0.2500], [0.6016], [0.8008], [0.6016], [0.0625], [0.5000], [0.5000], [0.8008], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00102996826171875 loss: 0.002227783203125loss: 0.00162506103515625 loss: 0.0023651123046875 predicted value: tensor([[0.4004], [0.7969], [0.4531], [0.6719], [0.3340], [0.5039], [0.5000], [1.0000], [0.5586], [0.2334], [0.3125], [0.3398], [0.3965], [0.1885], [0.1914], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.8008], [0.5547], [0.6680], [0.3750], [0.4668], [0.5000], [1.0000], [0.6016], [0.2500], [0.4004], [0.4004], [0.3340], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010833740234375 loss: 0.001983642578125 loss: 0.00145721435546875 loss: 0.00128173828125 37%|███▋ | 183/492 [1:37:20<2:42:15, 31.51s/it] {'loss': 0.0058, 'learning_rate': 1e-05, 'epoch': 0.37} 37%|███▋ | 183/492 [1:37:20<2:42:15, 31.51s/it]predicted value: tensor([[0.9883], [0.5234], [0.1973], [0.9688], [0.5117], [0.2119], [0.5156], [0.2217], [0.4453], [0.9961], [0.3887], [0.4375], [0.3574], [0.2041], [0.1631], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.2002], [1.0000], [0.4668], [0.2500], [0.6016], [0.2500], [0.6016], [1.0000], [0.8008], [0.4004], [0.4004], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010833740234375 loss: 0.00335693359375loss: 0.002593994140625 loss: 0.0012969970703125 predicted value: tensor([[0.9844], [0.3691], [0.4160], [0.6875], [0.9766], [0.2578], [0.6289], [0.7109], [0.4082], [1.0000], [0.4102], [0.3770], [0.3965], [0.2480], [0.1982], [0.1953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.4668], [0.8008], [1.0000], [0.3340], [0.8008], [0.7500], [0.4668], [1.0000], [0.3145], [0.4668], [0.4668], [0.3340], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00311279296875 loss: 0.001434326171875 loss: 0.00128936767578125 loss: 0.0012359619140625 predicted value: tensor([[0.4492], [0.9766], [0.1963], [0.6875], [0.3965], [1.0234], [0.7109], [0.9727], [0.5117], [0.5938], [0.1777], [0.5547], [0.5977], [0.1963], [0.2041], [0.2168]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.2002], [0.8008], [0.4668], [1.0000], [0.8008], [1.0000], [0.5000], [0.6016], [0.2500], [0.7500], [0.6016], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0048828125 loss: 0.00112152099609375 loss: 0.005096435546875 loss: 0.00177001953125 predicted value: tensor([[0.9922], [0.6406], [0.9727], [0.3633], [0.9648], [0.3789], [0.6836], [0.2061], [0.6250], [0.2441], [0.1680], [0.1191], [1.0234], [0.1973], [0.0840], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.4668], [1.0000], [0.4668], [0.8008], [0.2500], [0.7500], [0.2002], [0.2500], [0.0625], [1.0000], [0.2002], [0.0278], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00078582763671875 loss: 0.002410888671875loss: 0.004669189453125 loss: 0.00153350830078125 37%|███▋ | 184/492 [1:37:52<2:42:23, 31.64s/it] {'loss': 0.0094, 'learning_rate': 1e-05, 'epoch': 0.37} 37%|███▋ | 184/492 [1:37:52<2:42:23, 31.64s/it]predicted value: tensor([[0.6094], [0.8828], [0.6016], [0.4688], [0.5039], [0.3594], [0.5938], [0.3770], [0.6211], [0.6289], [0.5430], [0.6250], [0.5664], [1.0703], [0.3105], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.8008], [0.3750], [0.3750], [0.3340], [0.5000], [0.2500], [0.5000], [0.5000], [0.5000], [0.6016], [0.6016], [1.0000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003662109375 loss: 0.0025634765625loss: 0.002777099609375 loss: 0.00244140625 predicted value: tensor([[0.6484], [0.5273], [0.3066], [1.0703], [0.8906], [0.3340], [0.3125], [0.5977], [0.2930], [0.5898], [1.0547], [0.3594], [0.2461], [0.6328], [0.2793], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.2002], [1.0000], [0.8008], [0.2500], [0.2500], [0.5000], [0.2500], [0.5547], [1.0000], [0.2500], [0.1670], [0.6016], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023956298828125 loss: 0.001312255859375 loss: 0.0042724609375 loss: 0.0029296875 predicted value: tensor([[0.9648], [0.4453], [0.9531], [0.4297], [1.0625], [0.4277], [0.7539], [0.3457], [0.3145], [0.1914], [0.6641], [0.4863], [0.4980], [0.5039], [0.2734], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.3340], [0.8008], [0.3340], [1.0000], [0.2500], [0.8555], [0.2500], [0.2002], [0.0625], [0.6016], [0.5000], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024566650390625 loss: 0.00262451171875loss: 0.00142669677734375 loss: 0.00173187255859375 predicted value: tensor([[0.5898], [1.0703], [0.8438], [1.0703], [0.3184], [0.8633], [0.2715], [0.5117], [0.6523], [1.0625], [0.5938], [0.5000], [0.4648], [0.4531], [0.2910], [0.4883]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8008], [1.0000], [0.2002], [0.8008], [0.2500], [0.4668], [0.6016], [1.0000], [0.5000], [0.5000], [0.4004], [0.7500], [0.2500], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00457763671875 loss: 0.002227783203125loss: 0.0030059814453125 loss: 0.0020751953125 38%|███▊ | 185/492 [1:38:24<2:42:33, 31.77s/it] {'loss': 0.0106, 'learning_rate': 1e-05, 'epoch': 0.38} 38%|███▊ | 185/492 [1:38:24<2:42:33, 31.77s/it]predicted value: tensor([[0.5195], [1.0391], [1.0078], [1.0547], [0.6914], [0.3848], [1.0156], [0.3223], [0.8398], [1.0000], [0.3574], [0.4707], [0.4336], [0.4043], [0.2500], [0.2314]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [1.0000], [0.8320], [0.2500], [1.0000], [0.2500], [0.8008], [1.0000], [0.2002], [0.5000], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029144287109375 loss: 0.0013427734375 loss: 0.0019073486328125 loss: 0.002288818359375 predicted value: tensor([[0.5352], [0.6367], [0.8516], [0.4980], [0.3535], [0.4199], [0.5391], [0.7734], [0.3516], [1.0469], [1.0234], [0.5977], [0.4707], [0.5117], [0.2734], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6016], [0.8008], [0.4668], [0.2500], [0.2500], [0.4668], [0.4668], [0.2500], [1.0000], [1.0000], [0.5000], [0.3340], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00421142578125 loss: 0.0030975341796875 loss: 0.00183868408203125 loss: 0.0033111572265625 predicted value: tensor([[1.0547], [0.8164], [0.8320], [0.7578], [1.0469], [0.5352], [0.6641], [0.7383], [0.2559], [0.7188], [1.0547], [0.3613], [0.3574], [0.4004], [0.2305], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.8008], [0.6680], [1.0000], [0.4668], [0.6016], [0.7500], [0.2002], [0.6680], [1.0000], [0.3340], [0.2500], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001373291015625 loss: 0.0007171630859375loss: 0.0018463134765625 loss: 0.004150390625 predicted value: tensor([[0.3555], [0.9062], [0.4570], [0.8750], [1.0391], [0.8555], [0.8750], [0.6953], [1.0703], [0.7734], [0.7227], [0.6562], [0.5391], [0.4004], [0.4102], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.8320], [0.3145], [0.8320], [1.0000], [0.8555], [0.8008], [0.6016], [1.0000], [0.7500], [0.7500], [0.5000], [0.6680], [0.3340], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00183868408203125 loss: 0.000751495361328125 loss: 0.00098419189453125 loss: 0.0017242431640625 38%|███▊ | 186/492 [1:38:55<2:41:36, 31.69s/it] {'loss': 0.0086, 'learning_rate': 1e-05, 'epoch': 0.38} 38%|███▊ | 186/492 [1:38:55<2:41:36, 31.69s/it]predicted value: tensor([[0.9102], [0.4531], [0.6758], [0.2275], [0.8984], [0.4141], [0.3984], [0.9258], [0.3945], [0.4727], [0.6523], [0.3887], [0.3418], [0.3633], [0.3125], [0.1875]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [0.2500], [1.0000], [0.3145], [0.3145], [1.0000], [0.4004], [0.4668], [0.7500], [0.4004], [0.3340], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00177764892578125 loss: 0.001617431640625loss: 0.00099945068359375 loss: 0.000926971435546875 predicted value: tensor([[0.9219], [0.9297], [0.4336], [0.4902], [0.2969], [0.3262], [0.7461], [0.2559], [0.6289], [0.6055], [0.2383], [0.5430], [0.2695], [0.3633], [0.3105], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.3750], [0.5547], [0.3340], [0.3340], [0.8008], [0.2500], [0.7500], [0.6016], [0.3340], [0.5547], [0.2002], [0.4004], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00081634521484375 loss: 0.004150390625 loss: 0.0009002685546875 loss: 0.0025482177734375 predicted value: tensor([[0.4883], [0.4590], [0.5117], [0.4297], [0.2852], [0.4375], [0.6211], [0.7148], [0.4395], [0.2559], [0.7344], [0.4355], [0.4277], [0.1973], [0.3555], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.5547], [0.4668], [0.3340], [0.4668], [0.4668], [0.8008], [0.4668], [0.7500], [0.8008], [0.4668], [0.5000], [0.2500], [0.3340], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000713348388671875 loss: 0.0019378662109375 loss: 0.00482177734375 loss: 0.00148773193359375 predicted value: tensor([[0.4414], [0.4082], [0.9414], [0.4082], [0.4141], [0.6758], [0.4219], [0.6836], [0.4395], [0.2412], [0.5078], [0.3984], [0.9141], [0.5703], [0.3809], [0.1992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [1.0000], [0.4668], [0.4668], [0.6680], [0.4668], [0.4668], [0.3750], [0.3340], [0.6016], [0.4004], [1.0000], [0.6016], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000820159912109375 loss: 0.0012664794921875 loss: 0.00173187255859375 loss: 0.0015716552734375 38%|███▊ | 187/492 [1:39:27<2:40:48, 31.63s/it] {'loss': 0.007, 'learning_rate': 1e-05, 'epoch': 0.38} 38%|███▊ | 187/492 [1:39:27<2:40:48, 31.63s/it]predicted value: tensor([[0.4375], [0.7070], [0.7031], [0.9609], [0.2012], [0.6797], [0.2285], [0.3301], [0.9414], [0.5781], [0.5430], [0.5898], [0.5859], [0.2217], [0.1982], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.8008], [1.0000], [0.2002], [0.6680], [0.1670], [0.3340], [1.0000], [0.5547], [0.6016], [0.6016], [0.7500], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025634765625 loss: 0.00099945068359375loss: 0.0011138916015625 loss: 0.00150299072265625 predicted value: tensor([[0.5742], [0.3730], [0.4219], [0.4199], [0.9336], [0.9492], [0.6680], [0.9102], [0.3262], [0.6055], [0.5898], [0.3027], [0.4238], [0.3574], [0.1992], [0.1680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3145], [0.4668], [0.4668], [1.0000], [1.0000], [0.7500], [1.0000], [0.4004], [0.6016], [0.7500], [0.2500], [0.5000], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00162506103515625 loss: 0.001220703125 loss: 0.002288818359375 loss: 0.0022430419921875 predicted value: tensor([[0.9414], [0.7617], [0.5078], [0.6953], [0.2373], [0.6797], [0.2119], [0.4531], [0.5938], [0.7148], [0.6094], [0.3164], [0.4121], [0.4316], [0.1699], [0.1982]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.5547], [0.6680], [0.3340], [0.8008], [0.2500], [0.4668], [0.8008], [0.6680], [0.6016], [0.2002], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009918212890625 loss: 0.0015411376953125loss: 0.004180908203125 loss: 0.00106048583984375 predicted value: tensor([[0.7812], [0.5039], [0.9180], [0.7422], [0.4219], [0.5977], [0.7188], [0.4473], [0.7617], [0.6367], [0.2891], [0.4375], [0.1699], [0.3926], [0.2012], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [1.0000], [0.8008], [0.3750], [0.8008], [0.8008], [0.5000], [0.8008], [0.6680], [0.3340], [0.3340], [0.2500], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001312255859375 loss: 0.00112152099609375 loss: 0.0014495849609375 loss: 0.000690460205078125 38%|███▊ | 188/492 [1:39:58<2:39:55, 31.57s/it] {'loss': 0.0065, 'learning_rate': 1e-05, 'epoch': 0.38} 38%|███▊ | 188/492 [1:39:58<2:39:55, 31.57s/it]predicted value: tensor([[0.4980], [1.0625], [0.5352], [1.0391], [0.5156], [0.3262], [0.5664], [0.7773], [0.7695], [0.5547], [0.7656], [0.4219], [0.4512], [0.2432], [0.2334], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [1.0000], [0.4668], [0.2500], [0.4668], [0.7500], [0.7500], [0.5000], [0.7500], [0.4004], [0.4004], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024871826171875 loss: 0.0034942626953125 loss: 0.00093841552734375 loss: 0.0029449462890625 predicted value: tensor([[0.6250], [0.5195], [0.3906], [1.0781], [1.0703], [0.7500], [0.5703], [0.8398], [0.4121], [0.6836], [0.6758], [0.5078], [0.4785], [0.2412], [0.2949], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.2500], [1.0000], [1.0000], [0.6680], [0.6016], [0.8008], [0.7500], [0.6016], [0.6016], [0.3340], [0.5000], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00186920166015625 loss: 0.0035858154296875loss: 0.002197265625 loss: 0.0045166015625 predicted value: tensor([[0.6250], [0.4844], [0.8164], [0.5391], [0.5430], [0.7031], [0.5000], [0.8008], [1.0781], [0.7305], [0.6250], [1.0547], [0.4238], [0.3809], [0.2295], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.6680], [0.3750], [0.4668], [0.7500], [0.4668], [0.8008], [1.0000], [0.6016], [0.5000], [1.0000], [0.5000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023345947265625 loss: 0.002044677734375 loss: 0.0009918212890625 loss: 0.00439453125 predicted value: tensor([[0.5430], [0.3418], [0.4883], [1.0469], [0.5312], [0.5859], [1.0312], [0.8359], [1.0625], [0.5117], [0.3711], [0.4922], [0.4258], [0.2334], [0.2441], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3340], [0.3750], [1.0000], [0.3750], [0.4668], [1.0000], [0.8008], [1.0000], [0.4004], [0.2500], [0.5000], [0.4004], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002410888671875 loss: 0.00154876708984375 loss: 0.004486083984375 loss: 0.00162506103515625 38%|███▊ | 189/492 [1:40:30<2:39:41, 31.62s/it] {'loss': 0.0105, 'learning_rate': 1e-05, 'epoch': 0.38} 38%|███▊ | 189/492 [1:40:30<2:39:41, 31.62s/it]predicted value: tensor([[0.6328], [0.6250], [0.8906], [0.4570], [0.5664], [1.0625], [0.6523], [0.7891], [0.6953], [0.8477], [0.7383], [0.1729], [0.4902], [0.4238], [0.2402], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.8008], [0.3750], [0.4668], [1.0000], [0.6016], [0.6680], [0.8008], [0.8008], [0.7500], [0.0400], [0.4004], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00164031982421875 loss: 0.00191497802734375loss: 0.002410888671875 loss: 0.00107574462890625 predicted value: tensor([[0.6094], [0.4785], [0.5430], [0.8242], [1.0625], [0.5781], [0.8945], [0.4844], [0.8477], [0.5273], [0.4727], [0.5469], [0.4492], [0.4668], [0.2598], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.8008], [1.0000], [0.4668], [0.8008], [0.4668], [0.8008], [0.3750], [0.4004], [0.4668], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00182342529296875 loss: 0.00156402587890625loss: 0.00238037109375 loss: 0.0011444091796875 predicted value: tensor([[0.5859], [0.9062], [0.4844], [1.0859], [0.9258], [0.3535], [0.8008], [1.0938], [0.3359], [0.4980], [0.5117], [0.4883], [0.2988], [0.2578], [0.4590], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.4668], [1.0000], [0.8320], [0.2500], [0.8008], [1.0000], [0.2500], [0.5000], [0.3750], [0.3340], [0.2002], [0.2002], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00157928466796875 loss: 0.0019073486328125 loss: 0.0022430419921875 loss: 0.00225830078125 predicted value: tensor([[0.8750], [1.0703], [0.2969], [0.8164], [0.8164], [0.5117], [0.2715], [0.4668], [0.6875], [0.4941], [0.1572], [0.4492], [0.6562], [0.2285], [0.2715], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.2002], [0.8008], [0.8008], [0.3750], [0.2500], [0.2002], [0.6016], [0.2852], [0.0625], [0.4004], [0.5000], [0.0625], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034942626953125 loss: 0.00142669677734375 loss: 0.00213623046875 loss: 0.0030364990234375 39%|███▊ | 190/492 [1:41:01<2:38:55, 31.57s/it] {'loss': 0.008, 'learning_rate': 1e-05, 'epoch': 0.39} 39%|███▊ | 190/492 [1:41:01<2:38:55, 31.57s/it]predicted value: tensor([[0.3926], [0.3418], [0.1406], [0.9766], [0.5195], [0.2832], [0.6562], [1.0078], [0.9961], [0.4629], [1.0391], [0.4004], [0.0315], [0.4023], [0.1562], [0.1611]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.2500], [1.0000], [0.7500], [0.4668], [0.8008], [1.0000], [1.0000], [0.2500], [1.0000], [0.5000], [0.0204], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021820068359375 loss: 0.003387451171875 loss: 0.00201416015625 loss: 0.0012054443359375 predicted value: tensor([[0.5977], [0.9844], [0.1641], [1.0078], [0.4121], [0.0566], [0.6289], [0.4922], [0.6992], [0.3594], [0.3301], [0.5977], [0.6523], [0.1328], [0.1416], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [1.0000], [0.2500], [1.0000], [0.4668], [0.2500], [0.5000], [0.6680], [0.8008], [0.4668], [0.4004], [0.7500], [0.6016], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026397705078125 loss: 0.0029144287109375 loss: 0.00144195556640625 loss: 0.000453948974609375 predicted value: tensor([[0.4746], [0.9805], [0.7578], [0.7539], [0.9727], [0.9961], [1.0312], [0.7656], [0.2129], [0.1660], [1.0312], [0.2539], [0.3848], [0.4473], [0.1621], [0.1182]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8008], [0.8555], [1.0000], [1.0000], [1.0000], [0.8008], [0.2500], [0.2500], [1.0000], [0.2500], [0.3340], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005889892578125 loss: 0.000789642333984375loss: 0.0008392333984375 loss: 0.00182342529296875 predicted value: tensor([[0.3828], [0.1777], [0.3242], [0.6836], [0.4785], [0.3223], [0.5664], [1.0234], [0.2715], [1.0312], [0.2363], [0.3613], [0.3438], [0.3652], [0.1455], [0.1836]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3340], [0.3145], [0.6680], [0.5547], [0.4668], [0.4668], [1.0000], [0.6016], [1.0000], [0.3340], [0.5000], [0.5000], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00213623046875 loss: 0.0021209716796875 loss: 0.0016937255859375 loss: 0.003692626953125 39%|███▉ | 191/492 [1:41:33<2:37:43, 31.44s/it] {'loss': 0.0088, 'learning_rate': 1e-05, 'epoch': 0.39} 39%|███▉ | 191/492 [1:41:33<2:37:43, 31.44s/it]predicted value: tensor([[0.3770], [0.4062], [0.3633], [0.2930], [0.9922], [0.9922], [1.0156], [0.7266], [0.0464], [0.2852], [0.4473], [0.3203], [0.3281], [0.5430], [0.1328], [0.1943]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.3750], [0.2500], [1.0000], [1.0000], [1.0000], [0.8008], [0.0400], [0.3340], [0.4668], [0.3340], [0.3340], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0004215240478515625 loss: 0.0004367828369140625loss: 0.000492095947265625 loss: 0.00176239013671875 predicted value: tensor([[0.3848], [0.1885], [0.4648], [0.2314], [0.1924], [0.5781], [0.4746], [0.6328], [0.5938], [0.9609], [0.4219], [0.4102], [0.3457], [0.4258], [0.1582], [0.1973]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.5547], [0.2500], [0.3340], [0.6016], [0.5547], [0.6680], [0.6172], [1.0000], [0.5000], [0.4004], [0.3340], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0006103515625 loss: 0.001068115234375 loss: 0.00091552734375 loss: 0.00238037109375 predicted value: tensor([[0.5664], [0.2754], [0.3574], [0.7695], [0.6875], [0.6719], [0.3809], [0.6133], [0.7188], [0.4746], [0.9883], [0.3086], [0.3477], [0.6055], [0.1196], [0.1318]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4004], [0.4668], [0.4668], [0.8008], [0.4668], [0.6680], [0.4668], [0.7500], [0.8008], [0.3750], [1.0000], [0.5000], [0.5000], [0.6016], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000827789306640625 loss: 0.003631591796875 loss: 0.0020294189453125 loss: 0.000904083251953125 predicted value: tensor([[0.4160], [0.7773], [0.2031], [0.5820], [0.6836], [0.9883], [0.3145], [1.0000], [0.9922], [0.5898], [0.1650], [0.3477], [0.4062], [0.3828], [0.1787], [0.1836]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.2500], [0.5547], [0.6680], [1.0000], [0.7500], [1.0000], [1.0000], [0.7500], [0.2500], [0.3340], [0.5000], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0031890869140625 loss: 0.00384521484375loss: 0.004547119140625 loss: 0.0027618408203125 39%|███▉ | 192/492 [1:42:04<2:37:07, 31.43s/it] {'loss': 0.0075, 'learning_rate': 1e-05, 'epoch': 0.39} 39%|███▉ | 192/492 [1:42:04<2:37:07, 31.43s/it]predicted value: tensor([[0.6289], [0.5820], [0.8672], [0.3027], [0.5820], [0.2832], [0.5312], [0.3027], [0.3965], [0.4746], [0.3691], [0.4922], [0.1748], [0.6523], [0.4863], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8320], [0.2002], [0.5547], [0.1426], [0.3750], [0.2002], [0.2500], [0.2500], [0.3340], [0.5000], [0.0400], [0.6016], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00138092041015625 loss: 0.0028533935546875loss: 0.001312255859375 loss: 0.0019989013671875 predicted value: tensor([[0.5977], [1.0469], [1.0547], [0.8672], [0.7305], [0.8086], [0.4941], [1.0391], [0.7578], [0.7305], [0.7734], [0.3184], [0.4297], [0.1455], [0.2598], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [1.0000], [1.0000], [0.8008], [0.6680], [0.7500], [0.4668], [1.0000], [0.6680], [0.7500], [0.6680], [0.3340], [0.3340], [0.0278], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019683837890625 loss: 0.00130462646484375loss: 0.002593994140625 loss: 0.0023651123046875 predicted value: tensor([[1.0547], [1.0469], [0.5859], [0.5273], [0.7734], [0.2734], [0.8359], [0.3125], [0.7695], [0.8086], [0.6523], [0.5352], [0.4746], [0.6875], [0.2695], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.4668], [0.6250], [0.1670], [0.8008], [0.2500], [0.6016], [0.4668], [0.5000], [0.4668], [0.4004], [0.7500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115203857421875 loss: 0.0038604736328125 loss: 0.00115203857421875 loss: 0.0019989013671875 predicted value: tensor([[0.4766], [0.5508], [0.5781], [1.0547], [0.3223], [1.0469], [1.0859], [0.7266], [0.7656], [0.5859], [0.7109], [0.1641], [0.5391], [0.4844], [0.2715], [0.4531]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.4648], [1.0000], [0.2500], [1.0000], [1.0000], [0.6016], [0.6680], [0.2500], [0.6016], [0.0400], [0.4004], [0.4004], [0.2500], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029144287109375 loss: 0.0018157958984375 loss: 0.004180908203125loss: 0.003082275390625 39%|███▉ | 193/492 [1:42:36<2:36:56, 31.49s/it] {'loss': 0.009, 'learning_rate': 1e-05, 'epoch': 0.39} 39%|███▉ | 193/492 [1:42:36<2:36:56, 31.49s/it]predicted value: tensor([[0.5000], [0.6719], [0.4258], [0.6016], [0.4941], [0.4297], [0.5508], [0.7617], [0.5469], [0.4961], [0.7188], [0.5469], [0.5625], [0.2500], [0.3008], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.2500], [0.5547], [0.3750], [0.3340], [0.3750], [0.7500], [0.6016], [0.4668], [0.6016], [0.4668], [0.5000], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019989013671875 loss: 0.0024566650390625 loss: 0.004241943359375 loss: 0.0015716552734375 predicted value: tensor([[0.8125], [0.5039], [0.5195], [0.3867], [1.0547], [1.0312], [1.0078], [0.4219], [0.7578], [0.4160], [0.4707], [0.3867], [0.2617], [0.5352], [0.2949], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.4668], [0.2500], [1.0000], [1.0000], [1.0000], [0.2500], [0.6250], [0.2500], [0.4004], [0.2500], [0.1670], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0017242431640625 loss: 0.0023040771484375loss: 0.00139617919921875 loss: 0.001007080078125 predicted value: tensor([[1.0234], [0.5742], [0.8633], [0.4844], [0.6406], [0.7383], [0.3438], [0.6367], [0.1768], [0.4805], [0.5547], [0.4297], [0.4902], [0.2793], [0.4961], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.8320], [0.4668], [0.6016], [0.6680], [0.3340], [0.6016], [0.0625], [0.5000], [0.5000], [0.4004], [0.4004], [0.2500], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00384521484375 loss: 0.00110626220703125 loss: 0.002716064453125 loss: 0.0019989013671875 predicted value: tensor([[0.5039], [0.5156], [1.0312], [0.4980], [0.6055], [1.0547], [0.5586], [0.3457], [0.7031], [0.3477], [0.6797], [0.6602], [0.3926], [0.2832], [0.2695], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3340], [1.0000], [0.3750], [0.5547], [1.0000], [0.3750], [0.2500], [0.4668], [0.2500], [0.5000], [0.6016], [0.2500], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018310546875 loss: 0.004364013671875 loss: 0.0034637451171875loss: 0.002655029296875 39%|███▉ | 194/492 [1:43:07<2:36:22, 31.49s/it] {'loss': 0.0097, 'learning_rate': 1e-05, 'epoch': 0.39} 39%|███▉ | 194/492 [1:43:07<2:36:22, 31.49s/it]predicted value: tensor([[0.9258], [0.9297], [0.9023], [0.7461], [0.2656], [0.4277], [0.5586], [0.8867], [0.4160], [0.2988], [0.2256], [0.6172], [0.1992], [0.3613], [0.1865], [0.2080]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [1.0000], [0.8320], [0.2500], [0.3750], [0.6016], [1.0000], [0.4668], [0.2500], [0.1426], [0.6680], [0.2002], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00104522705078125 loss: 0.0024261474609375 loss: 0.000957489013671875 loss: 0.00174713134765625 predicted value: tensor([[0.9023], [0.4160], [0.6289], [0.2773], [0.9141], [0.6562], [0.6914], [0.2988], [0.4316], [0.2559], [0.4180], [0.3398], [0.3223], [0.2490], [0.2227], [0.1592]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.7500], [0.2500], [1.0000], [0.8008], [0.8008], [0.3340], [0.4668], [0.2500], [0.3750], [0.2500], [0.4004], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029754638671875 loss: 0.0020599365234375loss: 0.0021514892578125 loss: 0.00168609619140625 predicted value: tensor([[0.1943], [0.5742], [0.6523], [0.6875], [0.3965], [0.3887], [0.6250], [0.3340], [0.9062], [0.5039], [0.4395], [0.3223], [0.5234], [0.3125], [0.2158], [0.1943]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.8320], [0.5547], [0.6680], [0.4668], [0.4668], [0.8320], [0.2500], [1.0000], [0.5000], [0.5000], [0.3340], [0.6016], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024261474609375 loss: 0.0026092529296875 loss: 0.0010986328125 loss: 0.00116729736328125 predicted value: tensor([[0.5078], [0.3184], [0.4492], [0.6836], [0.7109], [0.8906], [0.2969], [0.6641], [0.3047], [0.9180], [0.4023], [0.3809], [0.3594], [0.5352], [0.1758], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.2500], [0.4668], [0.8008], [0.8008], [1.0000], [0.2500], [0.8008], [0.3340], [1.0000], [0.4004], [0.4004], [0.4004], [0.6016], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0014801025390625 loss: 0.0009765625 loss: 0.0021209716796875 loss: 0.00147247314453125 40%|███▉ | 195/492 [1:43:39<2:35:52, 31.49s/it] {'loss': 0.0071, 'learning_rate': 1e-05, 'epoch': 0.4} 40%|███▉ | 195/492 [1:43:39<2:35:52, 31.49s/it]predicted value: tensor([[0.7461], [0.4043], [0.7188], [0.7070], [0.1924], [0.7422], [0.5742], [0.3652], [0.4492], [0.3848], [0.3047], [0.3867], [0.3926], [0.4492], [0.1680], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.8008], [0.8008], [0.1670], [0.8008], [0.8008], [0.3750], [0.5000], [0.4004], [0.3340], [0.5000], [0.4004], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003265380859375 loss: 0.004730224609375 loss: 0.0016021728515625 loss: 0.000911712646484375 predicted value: tensor([[0.4336], [0.9336], [0.4219], [0.9375], [0.5117], [0.3633], [0.9453], [0.6523], [0.5039], [0.3301], [0.6562], [0.5039], [0.4316], [0.5859], [0.2197], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [1.0000], [0.5547], [0.4668], [1.0000], [0.6016], [0.6016], [0.2500], [0.8008], [0.6016], [0.5000], [0.8008], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00189971923828125 loss: 0.00201416015625loss: 0.00125885009765625 loss: 0.0037994384765625 predicted value: tensor([[0.6953], [0.4043], [0.9102], [0.6680], [0.4961], [0.6562], [0.6367], [0.9023], [0.5234], [0.4570], [0.5625], [0.9375], [0.2227], [0.4375], [0.2061], [0.3477]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [1.0000], [0.6680], [0.5547], [0.8008], [0.4668], [1.0000], [0.5000], [0.4668], [0.6016], [1.0000], [0.2002], [0.5000], [0.1670], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000804901123046875 loss: 0.0015716552734375loss: 0.0014190673828125 loss: 0.000949859619140625 predicted value: tensor([[0.6367], [0.5195], [0.4199], [0.5000], [0.9375], [0.8867], [0.5664], [0.9336], [0.9062], [0.9258], [0.2285], [0.3789], [0.5195], [0.1206], [0.1787], [0.2168]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.5547], [0.3145], [0.5000], [1.0000], [1.0000], [0.7500], [1.0000], [1.0000], [1.0000], [0.2500], [0.4004], [0.7500], [0.0625], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010528564453125 loss: 0.0026092529296875 loss: 0.000675201416015625 loss: 0.0012054443359375 40%|███▉ | 196/492 [1:44:10<2:35:53, 31.60s/it] {'loss': 0.0074, 'learning_rate': 1e-05, 'epoch': 0.4} 40%|███▉ | 196/492 [1:44:10<2:35:53, 31.60s/it]predicted value: tensor([[0.8828], [1.0469], [0.5000], [0.5938], [0.2949], [0.6953], [0.4590], [0.4863], [0.3320], [0.5117], [0.7227], [1.0391], [0.4414], [0.5078], [0.2891], [0.3086]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.4668], [0.5547], [0.2002], [0.8008], [0.5000], [0.4668], [0.2002], [0.4668], [0.7500], [1.0000], [0.4004], [0.6680], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021514892578125 loss: 0.005035400390625loss: 0.00146484375 loss: 0.004241943359375 predicted value: tensor([[0.3574], [0.5977], [0.5391], [0.5938], [0.5195], [0.3867], [0.4824], [0.5781], [0.7461], [0.7500], [0.7695], [1.0781], [0.6953], [0.4883], [0.2812], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.5547], [0.4668], [0.4668], [0.3750], [0.2500], [0.3750], [0.6680], [0.8008], [0.8008], [0.6016], [1.0000], [0.6016], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002227783203125 loss: 0.0020599365234375 loss: 0.00244140625 loss: 0.0024871826171875 predicted value: tensor([[0.8516], [1.0234], [1.0469], [1.0391], [0.3750], [0.4629], [0.7930], [0.7383], [0.8203], [0.5898], [0.4453], [0.5156], [0.4766], [0.5312], [0.4961], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [1.0000], [1.0000], [0.2500], [0.3750], [0.6680], [0.8008], [0.8008], [0.5547], [0.4668], [0.4004], [0.5000], [0.5000], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00136566162109375 loss: 0.001251220703125loss: 0.00127410888671875 loss: 0.0029296875 predicted value: tensor([[0.5234], [1.0547], [0.8086], [0.7422], [0.8281], [0.3809], [0.3945], [1.0000], [0.6094], [0.3574], [0.5820], [0.5820], [0.3516], [0.7031], [0.3125], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8320], [0.3750], [0.8008], [0.2500], [0.2500], [1.0000], [0.4277], [0.3340], [0.6016], [0.4668], [0.2500], [0.7500], [0.2500], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00130462646484375 loss: 0.001983642578125 loss: 0.003997802734375loss: 0.0018768310546875 40%|████ | 197/492 [1:44:42<2:35:04, 31.54s/it] {'loss': 0.0095, 'learning_rate': 1e-05, 'epoch': 0.4} 40%|████ | 197/492 [1:44:42<2:35:04, 31.54s/it]predicted value: tensor([[0.5078], [0.5430], [0.7656], [0.7891], [0.4551], [0.8047], [0.4082], [0.6484], [0.6875], [0.6484], [1.0781], [0.4668], [0.5195], [0.3145], [0.2559], [0.1104]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8008], [0.8008], [0.4668], [0.8008], [0.5000], [0.5000], [0.6016], [0.3750], [1.0000], [0.4004], [0.4004], [0.2002], [0.1670], [0.0400]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003509521484375 loss: 0.002685546875loss: 0.00299072265625 loss: 0.00156402587890625 predicted value: tensor([[0.4805], [0.8984], [0.8242], [1.0781], [0.8242], [1.0859], [0.4141], [0.6523], [0.3027], [0.4746], [0.5000], [0.4883], [0.5078], [0.4707], [0.3105], [0.3105]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.8008], [1.0000], [0.7148], [1.0000], [0.3340], [0.5000], [0.2500], [0.4004], [0.3340], [0.4004], [0.4004], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015716552734375 loss: 0.0025634765625 loss: 0.0015411376953125 loss: 0.0023040771484375 predicted value: tensor([[0.7969], [0.3711], [0.8633], [0.4590], [0.5117], [0.5195], [0.7578], [0.2930], [1.0469], [0.3535], [0.6758], [0.4648], [0.3887], [0.1299], [0.2930], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3340], [0.8320], [0.3145], [0.4668], [0.6680], [0.7500], [0.2002], [1.0000], [0.3340], [0.6016], [0.4004], [0.3340], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002044677734375 loss: 0.0018310546875 loss: 0.0015106201171875 loss: 0.0022735595703125 predicted value: tensor([[0.5000], [0.4746], [1.0703], [0.4805], [1.0781], [0.8281], [0.3418], [0.3652], [1.0625], [0.6055], [0.5859], [0.4258], [0.4395], [0.2949], [0.2520], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [1.0000], [0.3750], [1.0000], [0.6680], [0.2002], [0.2500], [1.0000], [0.6016], [0.5000], [0.4004], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00138092041015625 loss: 0.0022430419921875 loss: 0.0020294189453125loss: 0.00164794921875 40%|████ | 198/492 [1:45:13<2:34:30, 31.53s/it] {'loss': 0.0084, 'learning_rate': 1e-05, 'epoch': 0.4} 40%|████ | 198/492 [1:45:13<2:34:30, 31.53s/it]predicted value: tensor([[0.4902], [0.5039], [0.5547], [0.3711], [0.6016], [1.0234], [0.5117], [0.6328], [0.4902], [0.2520], [0.4277], [0.4258], [0.3867], [0.1738], [0.1641], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.5547], [0.8008], [0.4668], [0.7500], [1.0000], [0.6016], [0.6680], [0.6016], [0.2500], [0.3750], [0.4004], [0.4004], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000576019287109375 loss: 0.0021514892578125loss: 0.00146484375 loss: 0.004058837890625 predicted value: tensor([[0.5117], [0.3652], [0.5586], [0.6992], [0.2383], [0.3672], [0.9844], [0.3340], [0.4805], [0.4238], [0.5820], [0.2412], [0.1875], [0.4199], [0.1748], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.5547], [0.8008], [0.3340], [0.2500], [1.0000], [0.2500], [0.5000], [0.4668], [0.6016], [0.5000], [0.5000], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00159454345703125 loss: 0.0035400390625loss: 0.001190185546875 loss: 0.002685546875 predicted value: tensor([[0.3457], [1.0156], [0.4863], [0.6172], [0.9766], [0.5938], [0.7500], [0.2412], [0.3398], [0.5273], [0.5391], [0.5117], [0.5117], [0.3711], [0.1631], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.5547], [0.5547], [1.0000], [0.6016], [0.8008], [0.2500], [0.3750], [0.6016], [0.7500], [0.6016], [0.6016], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002716064453125 loss: 0.0016021728515625loss: 0.00162506103515625 loss: 0.0025177001953125 predicted value: tensor([[0.7070], [1.0000], [0.8047], [0.6289], [1.0078], [0.6523], [0.9922], [0.5352], [1.0000], [0.3945], [0.6445], [0.3145], [0.3691], [0.3340], [0.2598], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.8320], [0.8008], [1.0000], [0.8320], [1.0000], [0.8320], [1.0000], [0.4668], [0.7500], [0.3340], [0.4004], [0.3340], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00408935546875 loss: 0.000637054443359375 loss: 0.003631591796875 loss: 0.003082275390625 40%|████ | 199/492 [1:45:45<2:33:51, 31.51s/it] {'loss': 0.0093, 'learning_rate': 1e-05, 'epoch': 0.4} 40%|████ | 199/492 [1:45:45<2:33:51, 31.51s/it]predicted value: tensor([[0.6016], [0.7734], [0.6836], [1.0156], [0.3652], [0.5781], [0.6641], [0.2246], [0.5898], [0.5625], [0.4004], [0.4316], [0.6406], [0.1504], [0.1758], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.8320], [0.8008], [1.0000], [0.4668], [0.3750], [0.6680], [0.2500], [0.7500], [0.5000], [0.3340], [0.4004], [0.6016], [0.2002], [0.1426], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00164031982421875 loss: 0.002777099609375loss: 0.0010223388671875 loss: 0.00390625 predicted value: tensor([[0.4102], [0.4043], [0.3691], [0.1787], [0.4980], [0.2139], [0.2197], [0.2197], [0.1963], [1.0000], [0.6055], [0.4492], [0.3086], [0.0320], [0.3711], [0.2070]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.2500], [0.5547], [0.2002], [0.2500], [0.2500], [0.2500], [1.0000], [0.7500], [0.6016], [0.3340], [0.0625], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018463134765625 loss: 0.001190185546875loss: 0.00152587890625 loss: 0.00102996826171875 predicted value: tensor([[0.5430], [1.0234], [0.2363], [0.7070], [0.6992], [0.3770], [0.3262], [1.0000], [0.4082], [0.6094], [0.2930], [0.3887], [0.3359], [0.1865], [0.1758], [0.1660]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3340], [0.8008], [0.8008], [0.4668], [0.4668], [1.0000], [0.4004], [0.8008], [0.7500], [0.5000], [0.5000], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00089263916015625 loss: 0.00543212890625loss: 0.0023345947265625 loss: 0.001312255859375 predicted value: tensor([[0.3750], [0.5273], [0.4199], [1.0156], [0.2012], [1.0000], [0.7383], [1.0000], [0.2715], [0.3828], [0.2676], [0.3145], [0.3281], [0.2256], [0.2295], [0.1807]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.4668], [1.0000], [0.3340], [1.0000], [0.8008], [1.0000], [0.3340], [0.4668], [0.2500], [0.4004], [0.2852], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030517578125 loss: 0.003021240234375 loss: 0.00086212158203125loss: 0.002166748046875 41%|████ | 200/492 [1:46:16<2:33:23, 31.52s/it] {'loss': 0.0085, 'learning_rate': 1e-05, 'epoch': 0.41} 41%|████ | 200/492 [1:46:16<2:33:23, 31.52s/it]Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 4096} /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( predicted value: tensor([[0.9297], [0.8086], [1.0938], [0.5195], [0.8516], [0.8945], [0.4727], [0.8086], [0.3750], [0.6445], [0.5078], [0.7070], [0.4844], [0.2539], [0.2617], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.8008], [1.0000], [0.4668], [0.8008], [0.8008], [0.4668], [0.8008], [0.2002], [0.6016], [0.4004], [0.6016], [0.4004], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002960205078125 loss: 0.00174713134765625loss: 0.0017242431640625 loss: 0.00138092041015625 predicted value: tensor([[1.1094], [0.4668], [0.8047], [0.7383], [0.4609], [0.3574], [0.6758], [0.7969], [0.4414], [0.8008], [0.2930], [0.6484], [0.4844], [0.2275], [0.2930], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [0.7500], [0.4668], [0.3340], [0.8008], [0.8008], [0.1670], [0.8008], [0.3340], [0.7500], [0.3340], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020751953125 loss: 0.00225830078125 loss: 0.00274658203125 loss: 0.001312255859375 predicted value: tensor([[0.4609], [0.4883], [0.5977], [0.7031], [0.8906], [0.7305], [1.0859], [1.0938], [0.2617], [1.0781], [0.6875], [0.4648], [0.5625], [0.4512], [0.2422], [0.2393]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4004], [0.4668], [0.8008], [0.6016], [1.0000], [1.0000], [0.2002], [1.0000], [0.7500], [0.4004], [0.5000], [0.3340], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027618408203125loss: 0.00439453125 loss: 0.0012969970703125 loss: 0.001800537109375 predicted value: tensor([[0.2773], [0.8125], [0.3105], [0.3379], [0.6133], [0.4941], [0.6211], [0.3047], [0.4785], [0.6797], [0.6797], [0.5430], [0.5312], [0.5000], [0.2656], [0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.5547], [0.2500], [0.2500], [0.5547], [0.3750], [0.4668], [0.3340], [0.3340], [0.4668], [0.6016], [0.5000], [0.6680], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005126953125 loss: 0.00191497802734375 loss: 0.0034942626953125 loss: 0.0020599365234375 41%|████ | 201/492 [1:48:50<5:30:13, 68.09s/it] {'loss': 0.0098, 'learning_rate': 1e-05, 'epoch': 0.41} 41%|████ | 201/492 [1:48:50<5:30:13, 68.09s/it]predicted value: tensor([[1.0547], [0.8164], [0.8320], [0.7812], [0.5078], [1.0469], [1.0859], [0.5234], [0.5117], [0.8242], [0.4551], [0.5898], [0.1885], [0.5625], [0.2256], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.8320], [0.6680], [0.3750], [1.0000], [1.0000], [0.3750], [0.5000], [0.7500], [0.3340], [0.6016], [0.0400], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024261474609375 loss: 0.0017547607421875 loss: 0.0037384033203125 loss: 0.001739501953125 predicted value: tensor([[0.3535], [0.7852], [0.5352], [0.3066], [0.3262], [1.0469], [1.0391], [0.6211], [0.6523], [0.5820], [0.4668], [0.4668], [0.4395], [0.2227], [0.2559], [0.2441]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.8008], [0.4668], [0.3340], [0.3340], [1.0000], [1.0000], [0.7500], [0.7500], [0.4668], [0.4668], [0.4004], [0.4004], [0.1670], [0.2500], [0.1250]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00171661376953125 loss: 0.00113677978515625 loss: 0.00153350830078125 loss: 0.0031585693359375 predicted value: tensor([[0.6562], [1.0703], [1.0781], [0.6992], [1.0859], [0.7656], [0.5078], [0.2412], [0.7344], [0.7852], [0.4258], [0.4453], [0.4277], [0.6055], [0.2695], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.8008], [1.0000], [0.6680], [0.3750], [0.2500], [0.7500], [0.6680], [0.4004], [0.3340], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003204345703125 loss: 0.00180816650390625loss: 0.00174713134765625 loss: 0.000701904296875 predicted value: tensor([[0.3066], [0.5312], [0.4258], [0.2734], [0.8438], [0.5938], [0.7812], [1.0781], [0.3281], [1.0391], [0.6328], [0.3672], [0.3301], [0.4004], [0.2393], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.3750], [0.3750], [0.2002], [0.8008], [0.5000], [0.6680], [1.0000], [0.2002], [1.0000], [0.6016], [0.2002], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00122833251953125 loss: 0.00118255615234375 loss: 0.002593994140625 loss: 0.0020904541015625 41%|████ | 202/492 [1:49:22<4:37:44, 57.46s/it] {'loss': 0.0079, 'learning_rate': 1e-05, 'epoch': 0.41} 41%|████ | 202/492 [1:49:22<4:37:44, 57.46s/it]predicted value: tensor([[0.5391], [0.7852], [0.5273], [0.9492], [0.4844], [0.9531], [0.6523], [0.9648], [0.3867], [0.5195], [0.2891], [0.9688], [0.4727], [0.3574], [0.1611], [0.1543]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.5547], [1.0000], [0.3340], [1.0000], [0.6680], [1.0000], [0.2500], [0.6016], [0.2500], [1.0000], [0.5000], [0.3340], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004180908203125 loss: 0.00095367431640625 loss: 0.00110626220703125 loss: 0.00156402587890625 predicted value: tensor([[0.5781], [0.7656], [0.2578], [0.4258], [0.3496], [0.9727], [0.5664], [0.4414], [0.3672], [0.7539], [0.5586], [0.3398], [0.3750], [0.1475], [0.1748], [0.1787]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.2500], [0.4668], [0.4668], [1.0000], [0.6016], [0.6016], [0.3750], [0.8008], [0.6016], [0.3340], [0.4004], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000957489013671875loss: 0.000728607177734375 loss: 0.0013580322265625 loss: 0.00125885009765625 predicted value: tensor([[0.5391], [0.3848], [0.2578], [0.4199], [0.6602], [0.9688], [0.6328], [0.4043], [0.3535], [0.7539], [0.6875], [0.3789], [0.3672], [0.3438], [0.1846], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3340], [0.4668], [0.8008], [1.0000], [0.6016], [0.4668], [0.3340], [0.3750], [0.7500], [0.3340], [0.4004], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000835418701171875 loss: 0.00156402587890625 loss: 0.0031585693359375 loss: 0.00225830078125 predicted value: tensor([[0.9609], [0.3672], [0.5312], [0.2158], [0.7891], [0.6211], [1.0000], [0.6641], [0.5273], [0.3867], [0.3848], [0.2217], [0.3691], [0.2891], [0.1680], [0.2070]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.5547], [0.2500], [0.8320], [0.6680], [1.0000], [0.8008], [0.6016], [0.4668], [0.4004], [0.2500], [0.4004], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00110626220703125 loss: 0.00125885009765625 loss: 0.00177764892578125 loss: 0.00074005126953125 41%|████▏ | 203/492 [1:49:56<4:01:48, 50.20s/it] {'loss': 0.0062, 'learning_rate': 1e-05, 'epoch': 0.41} 41%|████▏ | 203/492 [1:49:56<4:01:48, 50.20s/it]predicted value: tensor([[0.3867], [0.3926], [0.7656], [0.3379], [0.3691], [0.5234], [0.6406], [0.4062], [0.9805], [0.9609], [0.5977], [0.4141], [0.3555], [0.4180], [0.1729], [0.1338]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.8008], [0.3750], [0.4668], [0.6016], [0.7500], [0.4668], [1.0000], [1.0000], [0.7500], [0.5000], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002349853515625 loss: 0.00075531005859375 loss: 0.0013427734375 loss: 0.00167083740234375 predicted value: tensor([[0.3984], [0.3867], [0.2041], [0.8242], [0.9609], [0.9492], [0.5938], [0.6602], [0.9609], [0.9648], [0.6875], [0.1177], [0.4277], [0.1719], [0.1846], [0.1943]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.2002], [0.8320], [1.0000], [1.0000], [0.7148], [0.7500], [1.0000], [1.0000], [0.8008], [0.0625], [0.5000], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00159454345703125 loss: 0.00102996826171875 loss: 0.0023193359375 loss: 0.00167083740234375 predicted value: tensor([[0.8086], [0.9492], [0.3789], [0.4922], [0.2295], [0.3457], [0.4863], [0.7383], [0.4648], [0.2432], [0.4941], [0.1030], [0.4375], [0.1621], [0.3047], [0.1533]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.4668], [0.7500], [0.2500], [0.4668], [0.4668], [0.6680], [0.3750], [0.2500], [0.5000], [0.0625], [0.5000], [0.2002], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002044677734375loss: 0.00146484375 loss: 0.00151824951171875 loss: 0.000507354736328125 predicted value: tensor([[0.9453], [0.3574], [0.4238], [0.2715], [0.9844], [0.9648], [0.1992], [0.2656], [0.3418], [1.0078], [0.6992], [0.6797], [0.4570], [0.3125], [0.1514], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.2500], [1.0000], [1.0000], [0.2002], [0.3340], [0.3340], [1.0000], [0.8008], [0.7500], [0.4004], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00075531005859375loss: 0.00150299072265625 loss: 0.000446319580078125 loss: 0.0015411376953125 41%|████▏ | 204/492 [1:50:29<3:36:14, 45.05s/it] {'loss': 0.0056, 'learning_rate': 1e-05, 'epoch': 0.41} 41%|████▏ | 204/492 [1:50:29<3:36:14, 45.05s/it]predicted value: tensor([[1.0312], [0.8945], [0.5508], [1.0625], [0.5273], [1.0781], [0.4805], [0.4883], [0.6719], [0.7422], [0.7344], [0.4883], [0.4336], [0.2910], [0.4453], [0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.4668], [1.0000], [0.6680], [1.0000], [0.4004], [0.4668], [0.7500], [0.6016], [0.7500], [0.4004], [0.2852], [0.2500], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00146484375 loss: 0.0019989013671875loss: 0.001983642578125 loss: 0.00159454345703125 predicted value: tensor([[1.0859], [0.4707], [0.6367], [0.4844], [1.0469], [0.6602], [0.8164], [0.3301], [0.3066], [1.0234], [0.6055], [0.5000], [0.1514], [0.4883], [0.2295], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.4668], [0.4668], [1.0000], [0.6016], [0.8008], [0.2500], [0.2500], [1.0000], [0.4668], [0.3340], [0.0400], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000812530517578125 loss: 0.0008544921875 loss: 0.00213623046875 loss: 0.0016021728515625 predicted value: tensor([[1.0859], [0.5586], [0.3574], [0.5156], [1.0625], [1.0625], [0.6797], [1.0781], [0.8242], [1.0547], [0.3145], [0.8203], [0.6250], [0.5508], [0.2422], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3340], [0.4668], [1.0000], [1.0000], [0.8008], [1.0000], [0.8008], [1.0000], [0.2500], [0.8008], [0.5000], [0.8008], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021820068359375loss: 0.0021820068359375 loss: 0.0035858154296875 loss: 0.00148773193359375 predicted value: tensor([[0.5898], [0.5156], [0.4844], [0.3535], [0.5547], [0.8594], [0.5078], [0.2578], [0.8516], [1.0625], [0.6289], [0.5781], [0.3262], [0.5469], [0.2559], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.3750], [0.3750], [0.2002], [0.5547], [0.8320], [0.4668], [0.2500], [0.8008], [1.0000], [0.5547], [0.5000], [0.2500], [0.5000], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003173828125 loss: 0.00154876708984375 loss: 0.0027923583984375 loss: 0.001800537109375 42%|████▏ | 205/492 [1:51:02<3:18:25, 41.48s/it] {'loss': 0.0078, 'learning_rate': 1e-05, 'epoch': 0.42} 42%|████▏ | 205/492 [1:51:02<3:18:25, 41.48s/it]predicted value: tensor([[0.7070], [1.0547], [0.2695], [1.0547], [0.4570], [1.0312], [0.5195], [0.4727], [0.7109], [0.5391], [0.7305], [0.5898], [0.4570], [0.2715], [0.2637], [0.2451]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [1.0000], [0.2002], [1.0000], [0.3750], [1.0000], [0.3750], [0.4668], [0.6016], [0.3750], [0.7500], [0.5000], [0.3340], [0.1670], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020599365234375 loss: 0.0019683837890625loss: 0.00148773193359375 loss: 0.000698089599609375 predicted value: tensor([[0.8633], [0.4941], [1.0625], [0.7070], [0.8203], [0.5703], [0.4746], [0.3145], [0.2793], [0.5703], [0.2656], [0.5391], [0.5898], [0.2734], [0.3184], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3750], [1.0000], [0.7500], [0.8008], [0.4668], [0.3750], [0.2500], [0.2002], [0.5000], [0.2002], [0.5000], [0.6680], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001312255859375 loss: 0.00140380859375 loss: 0.0009307861328125 loss: 0.0020294189453125 predicted value: tensor([[0.6367], [0.7227], [0.7578], [0.2812], [1.0391], [0.5195], [0.7852], [0.2773], [1.0703], [0.7812], [0.8281], [0.4570], [0.5039], [0.6211], [0.2578], [0.2598]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.6680], [0.2002], [1.0000], [0.4668], [0.7500], [0.2500], [1.0000], [0.8008], [0.6680], [0.5000], [0.4004], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003997802734375 loss: 0.0023651123046875 loss: 0.00164794921875 loss: 0.00109100341796875 predicted value: tensor([[0.3438], [0.4727], [1.0859], [0.4492], [0.4766], [0.3730], [0.8203], [0.6172], [0.6172], [0.6406], [0.7383], [0.4336], [0.2314], [0.2432], [0.2754], [0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.3750], [1.0000], [0.3750], [0.3145], [0.3340], [0.6680], [0.6680], [0.6016], [0.6016], [0.7500], [0.4004], [0.1670], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016326904296875 loss: 0.0029296875 loss: 0.001556396484375 loss: 0.00128173828125 42%|████▏ | 206/492 [1:51:35<3:05:59, 39.02s/it] {'loss': 0.0071, 'learning_rate': 1e-05, 'epoch': 0.42} 42%|████▏ | 206/492 [1:51:35<3:05:59, 39.02s/it]predicted value: tensor([[0.5625], [0.4375], [0.9648], [0.1484], [0.4062], [0.6172], [0.5430], [0.9492], [0.4082], [0.6641], [0.2500], [0.4785], [0.9375], [0.1230], [0.1982], [0.1875]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.3750], [1.0000], [0.2002], [0.4668], [0.6680], [0.5547], [1.0000], [0.4668], [0.8008], [0.2500], [0.4004], [1.0000], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000888824462890625 loss: 0.00113677978515625 loss: 0.000598907470703125 loss: 0.00055694580078125 predicted value: tensor([[0.4863], [0.2061], [0.1904], [0.4766], [0.9688], [0.4727], [0.6719], [0.6836], [0.7422], [0.9570], [0.3730], [0.4141], [0.3730], [0.1709], [0.1582], [0.1348]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [0.2500], [0.3750], [1.0000], [0.8008], [0.7500], [0.5703], [0.8008], [1.0000], [0.7500], [0.4004], [0.4004], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009307861328125 loss: 0.00131988525390625 loss: 0.005035400390625 loss: 0.0004062652587890625 predicted value: tensor([[0.9766], [0.6797], [0.4180], [0.8203], [0.9414], [0.3945], [0.6055], [0.3594], [0.5977], [0.7188], [0.1689], [0.3770], [1.0000], [0.1621], [0.1484], [0.1836]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.3750], [0.8320], [1.0000], [0.4668], [0.6016], [0.3750], [0.6016], [0.6680], [0.2500], [0.4004], [1.0000], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00069427490234375 loss: 0.001129150390625 loss: 0.0021820068359375 loss: 0.00061798095703125 predicted value: tensor([[0.9531], [0.3984], [0.2070], [0.3613], [0.9648], [0.9648], [0.5547], [0.9648], [0.5117], [0.6055], [0.5430], [0.4902], [0.4492], [0.1611], [0.1572], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.2500], [0.3145], [1.0000], [1.0000], [0.5547], [1.0000], [0.7500], [0.7500], [0.6016], [0.4004], [0.5000], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000560760498046875 loss: 0.0008087158203125 loss: 0.002716064453125 loss: 0.0017852783203125 42%|████▏ | 207/492 [1:52:08<2:56:18, 37.12s/it] {'loss': 0.0053, 'learning_rate': 1e-05, 'epoch': 0.42} 42%|████▏ | 207/492 [1:52:08<2:56:18, 37.12s/it]predicted value: tensor([[0.4746], [0.3945], [0.9727], [0.5195], [1.0078], [0.6445], [0.1934], [0.7383], [0.9648], [0.9492], [0.5430], [0.3145], [0.1768], [0.1484], [0.1572], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6797], [0.4668], [1.0000], [0.5547], [1.0000], [0.7148], [0.3340], [0.8008], [1.0000], [1.0000], [0.5000], [0.3340], [0.2002], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00144195556640625 loss: 0.00145721435546875loss: 0.00145721435546875 loss: 0.00070953369140625 predicted value: tensor([[ 0.3926], [ 0.8203], [ 0.8398], [ 0.4043], [ 0.7461], [ 0.3691], [ 0.5078], [ 0.9688], [ 0.4395], [-0.0488], [ 0.9727], [ 0.9453], [ 0.4746], [ 0.4531], [ 0.1592], [ 0.1621]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.8555], [0.4668], [0.8320], [0.4668], [0.8008], [1.0000], [0.8008], [0.0278], [1.0000], [1.0000], [0.5000], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000583648681640625 loss: 0.00147247314453125 loss: 0.004058837890625 loss: 0.0062255859375 predicted value: tensor([[0.7617], [0.7891], [0.7930], [0.7695], [0.2344], [0.6680], [0.6914], [0.9570], [0.9688], [0.7656], [0.7812], [0.5508], [0.2012], [0.3223], [0.4688], [0.1299]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.8555], [0.8320], [0.8008], [0.2500], [0.6016], [0.8008], [1.0000], [1.0000], [0.8008], [0.8008], [0.7500], [0.2002], [0.4004], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020904541015625 loss: 0.00122833251953125loss: 0.00421142578125 loss: 0.000499725341796875 predicted value: tensor([[0.3711], [0.6211], [0.3574], [0.2002], [0.9492], [0.3555], [0.5703], [0.4062], [0.5977], [0.9883], [0.5703], [0.4199], [0.2090], [0.4062], [0.4082], [0.1748]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.3750], [0.2500], [1.0000], [0.3340], [0.6016], [0.4668], [0.6016], [1.0000], [0.6016], [0.4668], [0.2500], [0.3340], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00122833251953125 loss: 0.0035247802734375 loss: 0.000537872314453125 loss: 0.0034942626953125 42%|████▏ | 208/492 [1:52:41<2:50:39, 36.05s/it] {'loss': 0.0086, 'learning_rate': 1e-05, 'epoch': 0.42} 42%|████▏ | 208/492 [1:52:41<2:50:39, 36.05s/it]predicted value: tensor([[0.9141], [0.3633], [0.7031], [0.1226], [0.7188], [0.2812], [0.6484], [0.6328], [0.6289], [0.6992], [0.5625], [0.4297], [0.3086], [0.4180], [0.5859], [0.2344]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3340], [0.6680], [0.0625], [0.7500], [0.2500], [0.6016], [0.5000], [0.5000], [0.7500], [0.5000], [0.4004], [0.2500], [0.2852], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00075531005859375 loss: 0.00130462646484375 loss: 0.004547119140625 loss: 0.0014495849609375 predicted value: tensor([[0.6133], [0.5039], [0.4961], [0.3496], [0.7070], [0.8320], [0.2598], [0.5430], [0.3867], [0.3691], [0.4336], [0.5664], [0.7852], [0.2852], [0.2676], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.3340], [0.5000], [0.6680], [0.3340], [0.7500], [0.2002], [0.4004], [0.4004], [0.4277], [0.7500], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030059814453125loss: 0.00170135498046875 loss: 0.002349853515625 loss: 0.00341796875 predicted value: tensor([[0.5273], [0.5195], [0.3359], [0.4922], [0.2559], [0.3711], [0.8711], [0.7344], [0.4629], [1.0547], [1.0391], [0.4824], [0.2793], [0.2598], [0.4902], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.2500], [0.4668], [0.2002], [0.2500], [0.8320], [0.6680], [0.3750], [1.0000], [1.0000], [0.4004], [0.2500], [0.2002], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004547119140625 loss: 0.0015716552734375 loss: 0.00115966796875 loss: 0.00186920166015625 predicted value: tensor([[0.8359], [0.6055], [0.5195], [1.0547], [0.3535], [0.5898], [0.5156], [0.4941], [0.6953], [0.5703], [1.0156], [0.5039], [0.4668], [0.5000], [0.2539], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.5547], [0.4668], [1.0000], [0.3340], [0.5547], [0.4668], [0.3145], [0.6016], [0.6680], [1.0000], [0.4004], [0.4004], [0.0625], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00138092041015625 loss: 0.0042724609375 loss: 0.00189208984375 loss: 0.002044677734375 42%|████▏ | 209/492 [1:53:14<2:45:46, 35.15s/it] {'loss': 0.0093, 'learning_rate': 1e-05, 'epoch': 0.42} 42%|████▏ | 209/492 [1:53:14<2:45:46, 35.15s/it]predicted value: tensor([[0.4863], [1.0703], [0.5781], [0.4531], [0.8164], [0.3789], [0.6523], [0.6367], [1.0703], [0.5195], [0.5430], [0.6602], [0.5703], [0.2773], [0.4434], [0.3223]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.7500], [0.3750], [0.8008], [0.2002], [0.6016], [0.6016], [1.0000], [0.3750], [0.5000], [0.6016], [0.2500], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00154876708984375 loss: 0.0023956298828125 loss: 0.004302978515625 loss: 0.0016326904296875 predicted value: tensor([[1.0781], [1.0781], [0.8203], [1.0781], [0.5820], [0.8438], [0.5469], [0.5234], [0.4121], [0.5430], [0.4395], [0.4707], [0.4453], [0.7070], [0.5273], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.7148], [1.0000], [0.4668], [0.8008], [0.4668], [0.4004], [0.3340], [0.4004], [0.4004], [0.3340], [0.3340], [0.6016], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021514892578125loss: 0.000762939453125 loss: 0.00182342529296875 loss: 0.0015411376953125 predicted value: tensor([[0.7812], [1.0156], [0.6797], [0.8203], [1.0391], [0.4629], [0.8398], [1.0391], [0.4355], [0.6367], [0.2988], [0.3809], [0.8086], [0.5117], [0.2422], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [1.0000], [0.5547], [0.8008], [1.0000], [0.4668], [0.8008], [1.0000], [0.4668], [0.7500], [0.2500], [0.2500], [0.6680], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00151824951171875 loss: 0.001708984375 loss: 0.001068115234375 loss: 0.00102996826171875 predicted value: tensor([[0.5312], [0.5547], [0.3652], [0.6367], [0.9219], [1.0391], [0.3320], [0.5312], [0.5469], [0.3086], [0.5117], [0.7344], [0.4590], [0.5312], [0.2451], [0.2480]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.2500], [0.5547], [0.8008], [1.0000], [0.3340], [0.3750], [0.3750], [0.2002], [0.5000], [0.5000], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00153350830078125 loss: 0.00141143798828125 loss: 0.004791259765625 loss: 0.00140380859375 43%|████▎ | 210/492 [1:53:47<2:41:38, 34.39s/it] {'loss': 0.0077, 'learning_rate': 1e-05, 'epoch': 0.43} 43%|████▎ | 210/492 [1:53:47<2:41:38, 34.39s/it]predicted value: tensor([[0.7383], [0.7617], [0.4004], [0.4395], [0.5938], [0.3848], [0.9297], [0.0242], [0.4551], [0.3457], [0.5234], [0.3184], [0.1689], [0.4121], [0.1846], [0.1299]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.8320], [0.3750], [0.4668], [0.6016], [0.4668], [1.0000], [0.0400], [0.5000], [0.3340], [0.6016], [0.4004], [0.2002], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000576019287109375 loss: 0.000732421875loss: 0.001983642578125 loss: 0.0013580322265625 predicted value: tensor([[0.4219], [0.4277], [0.7305], [0.4785], [0.1855], [0.5781], [0.2217], [0.6328], [0.9727], [0.4688], [0.2695], [0.3223], [0.4141], [0.4160], [0.1982], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.6680], [0.4668], [0.2500], [0.6016], [0.2500], [0.6016], [1.0000], [0.5000], [0.3340], [0.3340], [0.5000], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00048828125loss: 0.0036773681640625 loss: 0.000667572021484375 loss: 0.00118255615234375 predicted value: tensor([[0.5234], [0.2266], [0.6680], [0.7188], [0.1621], [0.5547], [0.3613], [0.5820], [0.6211], [0.3340], [0.5547], [0.4609], [0.3555], [0.3730], [0.1553], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.6680], [0.8008], [0.1670], [0.6016], [0.4668], [0.6016], [0.6016], [0.4004], [0.5000], [0.4004], [0.4004], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000579833984375 loss: 0.0013885498046875 loss: 0.000759124755859375 loss: 0.00140380859375 predicted value: tensor([[0.4785], [0.3789], [0.5078], [0.2412], [0.9883], [0.4668], [0.5977], [1.0078], [0.9609], [0.6367], [0.3945], [0.4863], [0.3379], [0.4062], [0.3496], [0.1611]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.3750], [0.4668], [0.2500], [1.0000], [0.5000], [0.5000], [1.0000], [1.0000], [0.7500], [0.5000], [0.4668], [0.4004], [0.5000], [0.4004], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000537872314453125 loss: 0.000965118408203125 loss: 0.0010833740234375 loss: 0.00084686279296875 43%|████▎ | 211/492 [1:54:20<2:39:10, 33.99s/it] {'loss': 0.0046, 'learning_rate': 1e-05, 'epoch': 0.43} 43%|████▎ | 211/492 [1:54:20<2:39:10, 33.99s/it]predicted value: tensor([[0.5234], [0.9844], [0.9727], [0.3105], [0.7109], [0.3906], [0.6133], [0.9883], [0.4141], [0.5508], [0.5469], [0.9766], [0.2109], [0.2715], [0.4355], [0.2021]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.3340], [0.6680], [0.3750], [0.6016], [1.0000], [0.4668], [0.7500], [0.5000], [1.0000], [0.2002], [0.3340], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019683837890625 loss: 0.00347900390625 loss: 0.001007080078125 loss: 0.000911712646484375 predicted value: tensor([[0.9570], [0.5234], [0.6875], [0.6211], [0.4043], [0.4258], [0.7227], [0.4805], [0.7305], [0.5273], [0.0376], [0.3477], [0.5820], [0.9844], [0.1582], [0.1836]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.8008], [0.5000], [0.4668], [0.3750], [0.8320], [0.6016], [0.8008], [0.5547], [0.0625], [0.4004], [0.5000], [1.0000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000576019287109375 loss: 0.001373291015625loss: 0.00121307373046875 loss: 0.000713348388671875 predicted value: tensor([[0.3965], [0.6992], [0.5117], [0.4199], [0.4922], [0.2432], [0.2988], [0.5938], [0.9766], [0.6133], [0.6523], [0.5430], [0.3379], [0.1973], [0.1533], [0.1680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.8008], [0.4668], [0.3750], [0.2500], [0.3340], [0.5000], [1.0000], [0.5000], [0.7500], [0.6016], [0.3340], [0.2500], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023345947265625 loss: 0.00106048583984375 loss: 0.00093841552734375 loss: 0.000644683837890625 predicted value: tensor([[0.6875], [0.3613], [0.9688], [0.4297], [0.3418], [0.6758], [0.3008], [0.5508], [0.3789], [0.5898], [0.5586], [0.5977], [0.6562], [0.6211], [0.3730], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5000], [1.0000], [0.4668], [0.2500], [0.8008], [0.2500], [0.6016], [0.3750], [0.6016], [0.6016], [0.5000], [0.6016], [0.4277], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019073486328125loss: 0.001068115234375 loss: 0.000823974609375 loss: 0.005126953125 43%|████▎ | 212/492 [1:54:53<2:36:42, 33.58s/it] {'loss': 0.0063, 'learning_rate': 1e-05, 'epoch': 0.43} 43%|████▎ | 212/492 [1:54:53<2:36:42, 33.58s/it]predicted value: tensor([[0.5742], [1.0781], [0.6914], [1.0938], [0.7812], [0.7891], [1.0547], [0.3164], [0.4238], [0.7656], [0.4512], [0.4824], [0.4629], [0.2734], [0.2637], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.7500], [1.0000], [0.7500], [0.6680], [1.0000], [0.2500], [0.3340], [0.7500], [0.5000], [0.4004], [0.4004], [0.2002], [0.1426], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021209716796875 loss: 0.00183868408203125 loss: 0.00138092041015625 loss: 0.00133514404296875 predicted value: tensor([[0.8789], [0.6055], [1.0703], [0.5625], [0.5508], [0.3496], [0.4844], [1.0625], [0.6914], [1.0781], [0.5820], [0.1270], [0.2598], [0.7422], [0.2930], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [1.0000], [0.4668], [0.4668], [0.2500], [0.4668], [1.0000], [0.3340], [1.0000], [0.5000], [0.0400], [0.1670], [0.7500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00183868408203125 loss: 0.00506591796875 loss: 0.0031890869140625 loss: 0.002593994140625 predicted value: tensor([[0.4355], [0.6094], [0.4590], [1.0781], [0.8008], [0.2637], [0.3535], [1.0547], [0.7578], [0.4766], [0.7266], [0.4102], [0.5039], [0.2871], [0.2578], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.4668], [1.0000], [0.8008], [0.2002], [0.3340], [1.0000], [0.3750], [0.4668], [0.6016], [0.4004], [0.4004], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003387451171875loss: 0.002899169921875 loss: 0.001373291015625 loss: 0.004547119140625 predicted value: tensor([[0.5039], [0.6172], [0.7070], [1.0781], [0.5469], [0.3066], [0.5156], [0.5469], [0.7031], [0.6367], [0.4512], [0.4922], [0.4844], [0.3008], [0.2852], [0.2891]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.5547], [1.0000], [0.4668], [0.2500], [0.3750], [0.3340], [0.7500], [0.6016], [0.3340], [0.5000], [0.5000], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0038909912109375 loss: 0.00180816650390625 loss: 0.00113677978515625 loss: 0.002288818359375 43%|████▎ | 213/492 [1:55:26<2:36:01, 33.55s/it] {'loss': 0.0102, 'learning_rate': 1e-05, 'epoch': 0.43} 43%|████▎ | 213/492 [1:55:26<2:36:01, 33.55s/it]predicted value: tensor([[0.6094], [0.8320], [1.0703], [0.5547], [0.7539], [1.0625], [1.0625], [0.6992], [0.6250], [0.4160], [0.1748], [0.4180], [0.6523], [0.4316], [0.2432], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [1.0000], [0.4004], [0.3750], [1.0000], [1.0000], [0.6016], [0.4668], [0.3340], [0.0400], [0.5000], [0.6016], [0.2852], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003448486328125 loss: 0.00433349609375loss: 0.0016632080078125 loss: 0.0032501220703125 predicted value: tensor([[1.0859], [0.8320], [1.0703], [0.8477], [0.4922], [1.0703], [0.2715], [0.6367], [0.7578], [0.5000], [0.4570], [0.3730], [0.6797], [0.2812], [0.4902], [0.3105]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [1.0000], [0.8008], [0.4668], [1.0000], [0.2002], [0.7500], [0.6680], [0.5000], [0.4004], [0.6016], [0.6016], [0.0400], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002777099609375loss: 0.00133514404296875 loss: 0.002105712890625 loss: 0.004638671875 predicted value: tensor([[0.4668], [0.5742], [0.4961], [0.7422], [0.6055], [1.0547], [0.3828], [1.0703], [1.0391], [0.6094], [0.4785], [0.5938], [0.6992], [0.7031], [0.5078], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.4668], [0.8008], [0.7500], [1.0000], [0.3340], [1.0000], [1.0000], [0.2500], [0.3340], [0.5000], [0.7500], [0.6016], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037689208984375 loss: 0.0032958984375 loss: 0.00150299072265625 loss: 0.001495361328125 predicted value: tensor([[0.5938], [0.6992], [0.4785], [1.0469], [1.0859], [0.6641], [0.5078], [0.7383], [0.3594], [0.3516], [0.2207], [0.3809], [0.6680], [0.3066], [0.2637], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.4668], [1.0000], [1.0000], [0.5547], [0.4668], [0.8008], [0.3340], [0.2500], [0.2002], [0.5000], [0.4277], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00311279296875 loss: 0.0029449462890625 loss: 0.0022735595703125 loss: 0.0022125244140625 43%|████▎ | 214/492 [1:55:59<2:33:39, 33.16s/it] {'loss': 0.011, 'learning_rate': 1e-05, 'epoch': 0.43} 43%|████▎ | 214/492 [1:55:59<2:33:39, 33.16s/it]predicted value: tensor([[0.7656], [0.2988], [0.7266], [0.9648], [0.7188], [0.5117], [0.4336], [0.9688], [0.2754], [0.4668], [0.9844], [0.5195], [0.3301], [0.4238], [0.2227], [0.1777]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.3340], [0.4668], [1.0000], [0.8008], [0.7500], [0.4668], [1.0000], [0.2500], [0.5000], [1.0000], [0.5000], [0.2852], [0.5000], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013885498046875 loss: 0.0029449462890625 loss: 0.0025177001953125 loss: 0.0015106201171875 predicted value: tensor([[0.9883], [0.3965], [0.6680], [0.4062], [0.2715], [0.4297], [0.6211], [0.5000], [0.2871], [0.6914], [0.6328], [0.4883], [0.3867], [0.3750], [0.3613], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [0.4668], [0.2500], [0.4668], [0.7500], [0.6016], [0.2500], [0.8008], [0.7500], [0.6016], [0.4004], [0.5000], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000614166259765625 loss: 0.0023040771484375 loss: 0.00176239013671875 loss: 0.0012664794921875 predicted value: tensor([[0.4102], [0.7852], [0.9805], [0.7266], [0.3535], [0.9805], [0.5469], [0.1836], [0.9609], [0.5039], [0.5039], [0.5352], [0.3945], [0.3672], [0.2988], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [1.0000], [0.8008], [0.3750], [1.0000], [0.5547], [0.2002], [1.0000], [0.6016], [0.5000], [0.2500], [0.4004], [0.4004], [0.5000], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002899169921875 loss: 0.00213623046875 loss: 0.0023040771484375 loss: 0.001983642578125 predicted value: tensor([[0.4023], [0.3594], [0.9609], [0.2969], [0.9883], [0.2246], [0.9883], [0.6250], [0.6133], [0.5938], [0.9688], [0.2930], [0.4668], [0.4336], [0.2246], [0.3105]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [1.0000], [0.2500], [1.0000], [0.2500], [1.0000], [0.8008], [0.8008], [0.6016], [1.0000], [0.2500], [0.5000], [0.5000], [0.2500], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00122833251953125 loss: 0.000545501708984375 loss: 0.00144195556640625loss: 0.0020751953125 44%|████▎ | 215/492 [1:56:31<2:32:30, 33.03s/it] {'loss': 0.0072, 'learning_rate': 1e-05, 'epoch': 0.44} 44%|████▎ | 215/492 [1:56:31<2:32:30, 33.03s/it]predicted value: tensor([[0.3984], [0.2734], [0.2676], [0.6016], [0.9688], [0.4316], [0.9844], [0.2158], [0.6289], [0.5547], [0.3887], [0.9844], [0.4082], [0.3672], [0.1904], [0.1777]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.2002], [0.8008], [1.0000], [0.4668], [1.0000], [0.2500], [0.7500], [0.7500], [0.4004], [1.0000], [0.5000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00103759765625 loss: 0.00182342529296875loss: 0.002227783203125 loss: 0.0007171630859375 predicted value: tensor([[0.4219], [0.7422], [0.2363], [0.4824], [0.6641], [0.4258], [0.4141], [0.6133], [0.4355], [0.4863], [0.9727], [0.6875], [0.2891], [0.3906], [0.3516], [0.2295]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.3340], [0.5547], [0.8008], [0.4668], [0.3145], [0.7500], [0.6016], [0.6016], [1.0000], [0.8008], [0.2500], [0.5000], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025482177734375loss: 0.003173828125 loss: 0.00180816650390625 loss: 0.00179290771484375 predicted value: tensor([[0.5117], [0.3887], [0.9766], [0.9688], [0.4238], [0.2373], [0.9883], [0.6367], [0.3457], [0.6875], [0.5703], [0.5352], [0.3887], [0.1836], [0.1934], [0.2383]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [1.0000], [1.0000], [0.3750], [0.2500], [1.0000], [0.5000], [0.4004], [0.8008], [0.6016], [0.6016], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016326904296875 loss: 0.000843048095703125loss: 0.0030364990234375 loss: 0.002716064453125 predicted value: tensor([[0.5312], [0.4258], [0.2314], [0.5391], [0.7109], [0.9766], [0.6992], [0.3828], [0.4453], [0.4316], [0.3496], [0.5039], [0.1875], [0.2061], [0.1846], [0.2090]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.2500], [0.5547], [0.8008], [1.0000], [0.8008], [0.4668], [0.6016], [0.3750], [0.4004], [0.6016], [0.2002], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021209716796875 loss: 0.00046539306640625 loss: 0.00116729736328125 loss: 0.00408935546875 44%|████▍ | 216/492 [1:57:04<2:31:16, 32.88s/it] {'loss': 0.0078, 'learning_rate': 1e-05, 'epoch': 0.44} 44%|████▍ | 216/492 [1:57:04<2:31:16, 32.88s/it]predicted value: tensor([[1.1484], [1.0859], [0.5156], [1.0781], [0.5195], [1.0078], [0.6719], [1.0781], [0.8359], [0.3809], [0.5039], [0.4258], [0.4551], [0.2734], [0.2852], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [1.0000], [0.4668], [1.0000], [0.6016], [1.0000], [0.8008], [0.3340], [0.2500], [0.3340], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001861572265625 loss: 0.002410888671875loss: 0.00127410888671875 loss: 0.00286865234375 predicted value: tensor([[0.8359], [0.5938], [0.4805], [0.3301], [0.2832], [0.4766], [1.0625], [0.4980], [0.5859], [1.0703], [0.4395], [0.3066], [0.2734], [0.5000], [0.3125], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.5547], [0.3750], [0.2500], [0.2002], [0.7500], [1.0000], [0.4668], [0.6680], [1.0000], [0.4004], [0.3340], [0.2500], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00135040283203125 loss: 0.00177764892578125 loss: 0.0020751953125 loss: 0.002532958984375 predicted value: tensor([[0.8984], [0.5039], [0.9180], [0.6523], [0.3906], [0.5195], [0.3418], [0.6602], [0.6172], [0.3477], [0.7773], [0.4141], [0.3105], [0.4316], [0.2891], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.8320], [0.6016], [0.2500], [0.5000], [0.2500], [0.5000], [0.6016], [0.2500], [0.8008], [0.3340], [0.2002], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00244140625 loss: 0.00157928466796875 loss: 0.00177764892578125 loss: 0.002410888671875 predicted value: tensor([[1.0938], [0.4707], [1.0859], [0.8281], [1.0859], [1.0703], [0.6797], [0.5820], [0.5859], [0.6406], [0.5039], [0.5312], [0.3555], [0.3008], [0.5859], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.8008], [1.0000], [1.0000], [0.6016], [0.6016], [0.8008], [0.6016], [0.5000], [0.5000], [0.2500], [0.2500], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00244140625 loss: 0.0023651123046875 loss: 0.00144195556640625 loss: 0.00180816650390625 44%|████▍ | 217/492 [1:57:38<2:32:57, 33.37s/it] {'loss': 0.0081, 'learning_rate': 1e-05, 'epoch': 0.44} 44%|████▍ | 217/492 [1:57:38<2:32:57, 33.37s/it]predicted value: tensor([[0.6211], [0.8203], [0.5195], [0.5234], [0.4785], [1.0859], [0.8242], [1.0781], [0.7812], [0.4883], [1.0625], [0.4395], [0.4922], [0.5000], [0.2734], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.6680], [0.4668], [0.4668], [0.4668], [1.0000], [0.8008], [1.0000], [0.6680], [0.4668], [1.0000], [0.4004], [0.4004], [0.4668], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00110626220703125 loss: 0.0013427734375loss: 0.000942230224609375 loss: 0.0022735595703125 predicted value: tensor([[0.5273], [0.4980], [0.2969], [1.0234], [0.8438], [0.8398], [0.5859], [0.5039], [0.6680], [0.3438], [0.3340], [0.5039], [0.4102], [0.1221], [0.2891], [0.3027]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.2500], [1.0000], [0.8008], [0.8008], [0.5000], [0.7500], [0.5000], [0.2002], [0.2500], [0.5000], [0.2852], [0.0400], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00262451171875 loss: 0.00153350830078125 loss: 0.00139617919921875 loss: 0.003265380859375 predicted value: tensor([[0.9062], [0.3672], [0.5352], [0.2617], [0.5234], [1.0703], [0.3262], [0.6992], [0.5078], [0.4688], [0.6289], [0.6641], [0.1396], [0.4844], [0.2275], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [0.4668], [0.2500], [0.3750], [1.0000], [0.2002], [0.6016], [0.3750], [0.4004], [0.6016], [0.6016], [0.0278], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001251220703125 loss: 0.00193023681640625 loss: 0.0015411376953125 loss: 0.00194549560546875 predicted value: tensor([[0.4727], [1.0469], [1.0781], [0.3828], [1.0703], [1.0859], [1.0391], [0.8281], [0.5000], [0.5664], [0.3984], [0.6562], [0.2949], [0.2812], [0.2949], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [0.2500], [1.0000], [1.0000], [1.0000], [0.7500], [0.4668], [0.4668], [0.4004], [0.5000], [0.2500], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00238037109375 loss: 0.0015106201171875 loss: 0.0014801025390625loss: 0.0020294189453125 44%|████▍ | 218/492 [1:58:11<2:31:14, 33.12s/it] {'loss': 0.0071, 'learning_rate': 1e-05, 'epoch': 0.44} 44%|████▍ | 218/492 [1:58:11<2:31:14, 33.12s/it]predicted value: tensor([[0.3926], [0.6016], [0.4473], [0.3926], [0.3809], [0.9648], [0.6211], [0.7148], [0.9844], [0.5508], [0.9766], [0.0077], [0.3223], [0.2559], [0.1494], [0.1572]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5000], [0.4668], [0.4668], [0.4668], [1.0000], [0.6016], [0.8008], [1.0000], [0.7500], [1.0000], [0.0625], [0.3340], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012969970703125 loss: 0.00075531005859375 loss: 0.00167083740234375 loss: 0.001556396484375 predicted value: tensor([[0.4746], [0.9883], [0.2793], [0.9727], [0.7773], [0.4336], [0.9492], [0.4316], [0.1729], [0.3789], [0.9922], [0.4141], [0.1855], [0.3105], [0.1553], [0.2080]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.2500], [1.0000], [0.6680], [0.3750], [1.0000], [0.5000], [0.2500], [0.4004], [1.0000], [0.4277], [0.5000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000659942626953125 loss: 0.00201416015625 loss: 0.00225830078125 loss: 0.00152587890625 predicted value: tensor([[0.7422], [1.0156], [0.7773], [0.4023], [0.7227], [0.3770], [0.7734], [0.2314], [0.9844], [0.5859], [0.3828], [0.5781], [1.0078], [0.1553], [0.1699], [0.1631]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.8008], [0.3750], [0.8008], [0.4668], [0.8320], [0.2500], [1.0000], [0.5000], [0.4004], [0.6680], [1.0000], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002716064453125 loss: 0.0025787353515625 loss: 0.0009307861328125 loss: 0.000713348388671875 predicted value: tensor([[0.6016], [0.8203], [0.9883], [0.4395], [0.7695], [0.8086], [0.7070], [0.9531], [0.5156], [0.6328], [0.3809], [0.9453], [0.2363], [0.1299], [0.1338], [0.1807]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5312], [0.8320], [1.0000], [0.4668], [0.8008], [0.8008], [0.7500], [1.0000], [0.6016], [0.5703], [0.4004], [1.0000], [0.2500], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00173187255859375 loss: 0.00125885009765625 loss: 0.00055694580078125loss: 0.00049591064453125 45%|████▍ | 219/492 [1:58:44<2:30:10, 33.01s/it] {'loss': 0.0057, 'learning_rate': 1e-05, 'epoch': 0.45} 45%|████▍ | 219/492 [1:58:44<2:30:10, 33.01s/it]predicted value: tensor([[0.9766], [0.7891], [0.4551], [0.4785], [0.9609], [0.7734], [0.7500], [0.7852], [0.6445], [0.9844], [0.1660], [0.5820], [0.3984], [0.3203], [0.1318], [0.1729]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.4668], [0.4668], [1.0000], [0.8008], [0.8008], [0.8008], [0.6016], [1.0000], [0.2500], [0.4668], [0.5000], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000553131103515625 loss: 0.000782012939453125loss: 0.000518798828125 loss: 0.0003528594970703125 predicted value: tensor([[0.3789], [0.7188], [0.3418], [0.6172], [0.9688], [0.2988], [0.7930], [0.5625], [0.9922], [0.6211], [0.6836], [0.3086], [0.3750], [0.3867], [0.3691], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.7148], [0.3750], [0.5000], [1.0000], [0.3340], [0.8320], [0.5000], [1.0000], [0.6016], [0.6680], [0.3340], [0.4668], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001678466796875 loss: 0.00098419189453125 loss: 0.000743865966796875 loss: 0.000904083251953125 predicted value: tensor([[0.5195], [0.2715], [0.3926], [0.7930], [0.4121], [0.2070], [0.5586], [0.4785], [0.3809], [0.4102], [0.7188], [0.4941], [0.1836], [0.4375], [0.3594], [0.1680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.3340], [0.4668], [0.8008], [0.4668], [0.2002], [0.8008], [0.4668], [0.3750], [0.4668], [0.6680], [0.6016], [0.2500], [0.4668], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001556396484375loss: 0.001068115234375 loss: 0.001922607421875 loss: 0.000274658203125 predicted value: tensor([[0.5117], [0.2334], [0.7891], [0.6211], [0.9766], [0.9922], [1.0156], [0.6797], [0.9961], [0.5938], [0.5508], [0.1592], [0.3164], [0.1670], [0.1621], [0.1748]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.2002], [0.8008], [0.8008], [1.0000], [1.0000], [1.0000], [0.6016], [1.0000], [0.5000], [0.6016], [0.4004], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0003795623779296875 loss: 0.0021514892578125 loss: 0.00146484375 loss: 0.00099945068359375 45%|████▍ | 220/492 [1:59:17<2:29:50, 33.05s/it] {'loss': 0.0041, 'learning_rate': 1e-05, 'epoch': 0.45} 45%|████▍ | 220/492 [1:59:17<2:29:50, 33.05s/it]predicted value: tensor([[0.5430], [0.4844], [1.0469], [0.6758], [0.6094], [1.0703], [0.5391], [0.5898], [0.7812], [0.6484], [0.6953], [0.5273], [0.4727], [0.4609], [0.2275], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6797], [0.4668], [1.0000], [0.3750], [0.3750], [1.0000], [0.8008], [0.3750], [0.7500], [0.5000], [0.6016], [0.5000], [0.3340], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021209716796875 loss: 0.00543212890625loss: 0.0023193359375 loss: 0.00159454345703125 predicted value: tensor([[0.9883], [0.5898], [1.0703], [1.0625], [0.7852], [0.5352], [0.7891], [1.0781], [1.0625], [0.3320], [0.5312], [0.6445], [0.4941], [0.2539], [0.2910], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4648], [1.0000], [1.0000], [0.7500], [0.4668], [0.4668], [1.0000], [1.0000], [0.3340], [0.4668], [0.7500], [0.4004], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001617431640625 loss: 0.00164794921875 loss: 0.003265380859375 loss: 0.0023956298828125 predicted value: tensor([[0.7070], [0.5000], [0.9062], [1.0469], [0.6953], [0.7344], [0.5703], [0.5625], [0.4688], [0.7383], [0.5898], [1.0547], [0.2559], [0.4434], [0.2832], [0.2402]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8320], [1.0000], [0.8320], [0.6016], [0.4668], [0.4668], [0.3750], [0.7500], [0.5000], [1.0000], [0.2002], [0.4004], [0.2500], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00070953369140625 loss: 0.00177001953125 loss: 0.00193023681640625 loss: 0.0019073486328125 predicted value: tensor([[1.0625], [0.5781], [1.0859], [0.9766], [0.5430], [0.3066], [1.0781], [1.0391], [0.3379], [1.1016], [0.7266], [0.8750], [0.6289], [0.4512], [0.4805], [0.4492]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3145], [1.0000], [0.8320], [0.3750], [0.2500], [1.0000], [1.0000], [0.2500], [1.0000], [0.7500], [0.8008], [0.6016], [0.4004], [0.5000], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032806396484375 loss: 0.0026702880859375loss: 0.0031280517578125 loss: 0.002105712890625 45%|████▍ | 221/492 [1:59:50<2:29:46, 33.16s/it] {'loss': 0.0095, 'learning_rate': 1e-05, 'epoch': 0.45} 45%|████▍ | 221/492 [1:59:50<2:29:46, 33.16s/it]predicted value: tensor([[0.6445], [0.3379], [1.0547], [0.8789], [0.4688], [0.6680], [0.8594], [0.4023], [0.2656], [0.3633], [0.5391], [0.5156], [0.4004], [0.2656], [0.2715], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [1.0000], [0.8008], [0.4668], [0.6016], [0.8008], [0.3340], [0.2002], [0.3340], [0.4004], [0.4004], [0.3340], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000858306884765625 loss: 0.00116729736328125 loss: 0.0013885498046875 loss: 0.00133514404296875 predicted value: tensor([[0.5938], [1.0391], [0.3340], [0.8477], [0.4766], [0.5898], [0.5195], [1.0234], [0.4746], [1.0469], [0.3652], [0.4688], [0.6562], [0.2422], [0.4824], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3340], [0.8008], [0.2500], [0.7500], [0.3750], [1.0000], [0.3750], [1.0000], [0.6016], [0.5000], [0.7500], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021820068359375 loss: 0.002960205078125 loss: 0.000598907470703125 loss: 0.0019683837890625 predicted value: tensor([[0.6523], [0.8984], [0.3418], [0.8242], [0.7344], [0.7617], [1.0469], [0.3047], [0.6133], [0.7344], [0.6289], [0.6016], [0.4512], [0.4082], [0.2871], [0.2373]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.3340], [0.6680], [0.6016], [0.4668], [1.0000], [0.2500], [0.8008], [0.6680], [0.5000], [0.5000], [0.4004], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013885498046875 loss: 0.003509521484375 loss: 0.0027618408203125 loss: 0.001434326171875 predicted value: tensor([[0.3027], [0.9258], [1.0547], [1.0391], [0.6680], [0.8477], [0.3145], [0.7930], [1.0391], [1.0547], [0.6719], [1.0547], [0.5820], [0.5391], [0.4668], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.1670], [0.8008], [1.0000], [1.0000], [0.5000], [0.8320], [0.3340], [0.6016], [1.0000], [1.0000], [0.5000], [1.0000], [0.6016], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026397705078125 loss: 0.0026702880859375 loss: 0.00299072265625 loss: 0.000736236572265625 45%|████▌ | 222/492 [2:00:23<2:28:45, 33.06s/it] {'loss': 0.0076, 'learning_rate': 1e-05, 'epoch': 0.45} 45%|████▌ | 222/492 [2:00:23<2:28:45, 33.06s/it]predicted value: tensor([[0.4980], [0.9492], [0.7695], [0.5508], [0.1865], [0.5469], [0.6562], [0.6523], [0.3496], [0.5391], [0.5508], [0.2129], [0.2441], [0.3555], [0.1602], [0.3457]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8008], [0.4648], [0.2500], [0.3750], [0.7500], [0.3750], [0.4004], [0.6016], [0.6016], [0.2002], [0.7500], [0.4004], [0.2500], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033416748046875 loss: 0.001953125 loss: 0.00677490234375 loss: 0.00151824951171875 predicted value: tensor([[0.4297], [0.4238], [0.5039], [0.9531], [0.9688], [0.6562], [0.1045], [0.9492], [0.6406], [0.9727], [0.6680], [0.5391], [0.1621], [0.4004], [0.1641], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [1.0000], [1.0000], [0.7500], [0.1670], [1.0000], [0.6016], [1.0000], [0.7500], [0.5000], [0.1426], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000885009765625 loss: 0.0014495849609375 loss: 0.000560760498046875 loss: 0.00115203857421875 predicted value: tensor([[0.7930], [0.9258], [0.4141], [0.9336], [0.4512], [0.2314], [0.1816], [0.6016], [0.4648], [0.3379], [0.3965], [0.2100], [0.3945], [0.1914], [0.2559], [0.2148]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.4668], [1.0000], [0.4668], [0.2500], [0.3340], [0.7500], [0.5000], [0.4668], [0.5000], [0.2500], [0.5000], [0.2002], [0.0625], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00072479248046875 loss: 0.00188446044921875 loss: 0.0027008056640625 loss: 0.0020599365234375 predicted value: tensor([[0.8477], [0.9375], [0.4062], [0.9688], [0.5078], [0.7969], [0.5352], [0.5859], [0.4004], [0.5938], [0.9336], [0.3027], [0.4121], [0.3926], [0.2080], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.4668], [1.0000], [0.5547], [0.8008], [0.6016], [0.6016], [0.4668], [0.7500], [1.0000], [0.3340], [0.5000], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002349853515625 loss: 0.001739501953125 loss: 0.000957489013671875 45%|████▌ | 223/492 [2:00:56<2:27:46, 32.96s/it]loss: 0.0031585693359375 {'loss': 0.0083, 'learning_rate': 1e-05, 'epoch': 0.45} 45%|████▌ | 223/492 [2:00:56<2:27:46, 32.96s/it]predicted value: tensor([[0.9531], [0.4590], [0.9570], [0.3867], [0.5039], [0.4004], [0.9336], [0.4746], [0.5781], [0.5273], [0.3574], [0.5703], [0.3789], [0.2266], [0.4082], [0.1855]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.4668], [0.5547], [0.3750], [1.0000], [0.7500], [0.6016], [0.6016], [0.2500], [0.7500], [0.4004], [0.2500], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023651123046875 loss: 0.0012664794921875 loss: 0.0024566650390625 loss: 0.0009918212890625 predicted value: tensor([[0.4004], [0.5156], [0.5820], [0.5039], [0.6055], [0.7422], [0.3809], [0.6133], [0.2314], [0.5352], [0.4766], [0.4414], [0.2656], [0.9414], [0.1650], [0.2178]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.6680], [0.5547], [0.6680], [0.8008], [0.3750], [0.6016], [0.2500], [0.6016], [0.6016], [0.4004], [0.0400], [1.0000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002166748046875 loss: 0.00159454345703125 loss: 0.0028076171875 loss: 0.000789642333984375 predicted value: tensor([[0.9414], [0.4141], [0.5078], [0.4238], [0.2217], [0.4668], [0.3574], [0.2080], [0.9492], [0.2236], [0.5117], [0.9453], [0.3340], [0.3867], [0.4902], [0.1982]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.5547], [0.4668], [0.1670], [0.4668], [0.4668], [0.2002], [1.0000], [0.2500], [0.6016], [1.0000], [0.6016], [0.3340], [0.6016], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019683837890625 loss: 0.00201416015625 loss: 0.00125885009765625 loss: 0.0019683837890625 predicted value: tensor([[0.2480], [0.6875], [0.9414], [0.2129], [0.5664], [0.4023], [0.7617], [0.5469], [0.3730], [0.9297], [0.1758], [0.3809], [0.3477], [0.5273], [0.4082], [0.1748]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.6680], [1.0000], [0.2500], [0.5547], [0.4668], [0.7500], [0.4668], [0.5000], [1.0000], [0.3340], [0.4004], [0.3340], [0.5000], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035247802734375 loss: 0.00142669677734375 loss: 0.0013275146484375 loss: 0.00151824951171875 46%|████▌ | 224/492 [2:01:29<2:27:38, 33.06s/it] {'loss': 0.0074, 'learning_rate': 1e-05, 'epoch': 0.46} 46%|████▌ | 224/492 [2:01:29<2:27:38, 33.06s/it]predicted value: tensor([[0.8477], [0.3418], [0.7617], [1.0469], [0.5820], [0.8008], [1.0469], [0.5117], [1.0391], [0.6562], [0.3789], [0.4551], [0.6172], [0.3047], [0.2256], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.2500], [0.6680], [1.0000], [0.4648], [0.8008], [1.0000], [0.4668], [1.0000], [0.6016], [0.3340], [0.4004], [0.6016], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00148773193359375 loss: 0.001678466796875 loss: 0.00098419189453125 loss: 0.0011444091796875 predicted value: tensor([[0.6055], [0.5000], [0.5508], [1.0312], [0.8242], [0.7422], [0.5742], [1.0469], [1.0312], [0.6133], [0.4238], [0.3203], [0.4648], [0.3340], [0.2520], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.4668], [1.0000], [0.8320], [0.8008], [0.5547], [1.0000], [1.0000], [0.5000], [0.4004], [0.3340], [0.4004], [0.0400], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011444091796875 loss: 0.00262451171875loss: 0.00250244140625 loss: 0.00130462646484375 predicted value: tensor([[1.0391], [0.9102], [0.5078], [1.0234], [0.4902], [0.8203], [0.8203], [0.7383], [0.7109], [0.6133], [1.0391], [0.5000], [0.7422], [0.6680], [0.2812], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.4668], [1.0000], [0.4668], [0.8008], [0.8008], [0.5547], [0.5000], [0.7500], [1.0000], [0.4004], [0.8008], [0.7500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002349853515625loss: 0.0016632080078125 loss: 0.002166748046875 loss: 0.0028076171875 predicted value: tensor([[0.4941], [0.6406], [1.0391], [0.5352], [0.5039], [0.5977], [0.7383], [0.7812], [0.4570], [0.7656], [0.1641], [0.5586], [0.5000], [0.3223], [0.2578], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [1.0000], [0.4668], [0.4668], [0.4668], [0.6680], [0.8008], [0.4004], [0.8008], [0.0400], [0.6016], [0.4004], [0.2500], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002777099609375 loss: 0.0019683837890625loss: 0.00150299072265625 loss: 0.0009002685546875 46%|████▌ | 225/492 [2:02:03<2:28:41, 33.41s/it] {'loss': 0.0073, 'learning_rate': 1e-05, 'epoch': 0.46} 46%|████▌ | 225/492 [2:02:03<2:28:41, 33.41s/it]predicted value: tensor([[0.8906], [1.0547], [0.5977], [1.0469], [0.7617], [0.8047], [0.7148], [1.0391], [0.6289], [0.4336], [0.1045], [0.4453], [1.0547], [0.2891], [0.4453], [0.2598]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [1.0000], [0.5547], [1.0000], [0.8008], [0.8008], [0.7500], [1.0000], [0.5000], [0.4004], [0.0400], [0.4004], [1.0000], [0.2002], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00084686279296875 loss: 0.0024261474609375loss: 0.000904083251953125 loss: 0.0029754638671875 predicted value: tensor([[0.4512], [1.0859], [0.2949], [1.0703], [0.5430], [0.6914], [0.7930], [0.3945], [0.5508], [0.7031], [0.6797], [0.6133], [0.4570], [0.5938], [0.2715], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.2002], [1.0000], [0.5547], [0.6016], [0.8320], [0.2500], [0.8008], [0.5000], [0.7500], [0.4277], [0.5000], [0.6016], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002044677734375 loss: 0.001251220703125 loss: 0.003173828125 loss: 0.001953125 predicted value: tensor([[0.5117], [0.9023], [0.3809], [0.5977], [0.7109], [0.2969], [0.4844], [0.6367], [0.4688], [0.3594], [0.5820], [0.2910], [0.4648], [0.4219], [0.2852], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.2500], [0.5547], [0.6680], [0.2002], [0.4004], [0.5000], [0.3750], [0.2500], [0.6016], [0.2002], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00131988525390625 loss: 0.00164794921875loss: 0.00191497802734375 loss: 0.0019989013671875 predicted value: tensor([[0.6914], [0.8672], [0.5078], [0.8750], [0.3555], [0.3633], [0.7148], [1.0234], [0.3535], [0.4336], [0.7148], [0.4668], [0.2520], [0.7070], [0.2656], [0.2969]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8555], [0.4668], [0.8320], [0.2500], [0.3340], [0.7500], [1.0000], [0.2500], [0.3340], [0.8008], [0.5000], [0.2002], [0.7500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001190185546875 loss: 0.00177001953125 loss: 0.00130462646484375 loss: 0.0019683837890625 46%|████▌ | 226/492 [2:02:42<2:35:14, 35.02s/it] {'loss': 0.0072, 'learning_rate': 1e-05, 'epoch': 0.46} 46%|████▌ | 226/492 [2:02:42<2:35:14, 35.02s/it]predicted value: tensor([[0.5039], [0.4355], [0.3828], [0.2891], [0.4961], [0.4785], [0.2080], [0.4473], [0.6562], [0.4766], [0.4531], [0.5430], [0.3359], [0.1885], [0.3672], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.2500], [0.6016], [0.5000], [0.2002], [0.4668], [0.8008], [0.5000], [0.4668], [0.5000], [0.3340], [0.2002], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00118255615234375 loss: 0.00102996826171875 loss: 0.00096893310546875 loss: 0.001922607421875 predicted value: tensor([[0.3516], [0.7422], [0.7031], [0.4043], [0.6211], [0.6289], [0.9883], [0.1787], [0.5117], [0.4629], [0.3867], [0.4824], [0.9570], [0.2832], [0.2188], [0.1533]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.7500], [0.4668], [0.6016], [0.7500], [1.0000], [0.3340], [0.6016], [0.6016], [0.4668], [0.5000], [1.0000], [0.2500], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00109100341796875 loss: 0.001556396484375loss: 0.00118255615234375 loss: 0.0034637451171875 predicted value: tensor([[0.5039], [0.9414], [0.3574], [0.9570], [0.4922], [0.1748], [0.2275], [0.2305], [0.5547], [0.1992], [0.5859], [0.2070], [0.5352], [0.3535], [0.2266], [0.1621]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [1.0000], [0.5547], [0.2500], [0.3340], [0.3340], [0.6016], [0.2500], [0.6016], [0.2500], [0.6016], [0.3340], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00152587890625 loss: 0.0006561279296875 loss: 0.0009918212890625 loss: 0.00274658203125 predicted value: tensor([[0.5586], [0.7812], [0.4102], [0.7344], [0.3516], [0.2559], [0.4707], [0.9883], [0.9492], [0.6250], [0.5508], [0.6211], [0.4121], [0.3965], [0.1924], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.4668], [0.8008], [0.4668], [0.3340], [0.5547], [1.0000], [1.0000], [0.5703], [0.6016], [0.7500], [0.4004], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002593994140625 loss: 0.002410888671875 loss: 0.00116729736328125 loss: 0.001708984375 46%|████▌ | 227/492 [2:03:18<2:35:44, 35.26s/it] {'loss': 0.0065, 'learning_rate': 1e-05, 'epoch': 0.46} 46%|████▌ | 227/492 [2:03:18<2:35:44, 35.26s/it]predicted value: tensor([[0.6211], [0.9727], [0.7148], [0.4824], [0.3164], [0.9648], [0.1904], [0.6367], [0.5352], [0.4297], [0.3223], [0.3574], [0.2100], [0.3594], [0.1836], [0.1992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.8008], [0.1670], [0.2500], [1.0000], [0.3340], [0.7500], [0.6016], [0.5000], [0.4004], [0.5000], [0.2500], [0.3340], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0031585693359375 loss: 0.000942230224609375 loss: 0.0036773681640625 loss: 0.0011444091796875 predicted value: tensor([[0.3887], [0.4434], [0.3672], [0.6211], [0.2578], [0.9922], [0.6602], [0.5703], [0.4902], [0.5352], [0.6211], [0.3652], [0.4141], [0.3105], [0.2012], [0.1826]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3145], [0.6680], [0.3340], [1.0000], [0.8008], [0.6016], [0.5000], [0.6016], [0.6016], [0.4004], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00089263916015625loss: 0.0025482177734375 loss: 0.0017242431640625 loss: 0.00113677978515625 predicted value: tensor([[0.6797], [0.2773], [0.4199], [0.5430], [0.3926], [0.5781], [1.0156], [0.6172], [0.3828], [0.6016], [0.4785], [0.4844], [0.3301], [0.2129], [0.3223], [0.1553]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5703], [0.2500], [0.4668], [0.4668], [0.3750], [0.6016], [1.0000], [0.6016], [0.4668], [0.6016], [0.0625], [0.7500], [0.4004], [0.6016], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00141143798828125 loss: 0.0072021484375loss: 0.005157470703125 loss: 0.002105712890625 predicted value: tensor([[0.7617], [0.9844], [0.4180], [0.5703], [0.2246], [0.2832], [0.2539], [0.2119], [0.3867], [0.2734], [0.5781], [0.3906], [0.3535], [0.1553], [0.1777], [0.2021]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.4668], [0.8008], [0.2500], [0.2500], [0.2500], [0.2002], [0.3750], [0.2500], [0.7500], [0.5000], [0.4668], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000850677490234375 loss: 0.00188446044921875 loss: 0.00186920166015625 loss: 0.00138092041015625 46%|████▋ | 228/492 [2:03:54<2:35:55, 35.44s/it] {'loss': 0.0093, 'learning_rate': 1e-05, 'epoch': 0.46} 46%|████▋ | 228/492 [2:03:54<2:35:55, 35.44s/it]predicted value: tensor([[0.8672], [0.4707], [0.3535], [0.4668], [1.0547], [1.0781], [1.0469], [1.1016], [0.7148], [0.2969], [0.5312], [0.4160], [0.4609], [0.2793], [0.4824], [0.3105]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3145], [0.2500], [0.4668], [1.0000], [1.0000], [1.0000], [1.0000], [0.4668], [0.2500], [0.4668], [0.3340], [0.4004], [0.2500], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028533935546875 loss: 0.00165557861328125 loss: 0.002471923828125 loss: 0.0024871826171875 predicted value: tensor([[1.0859], [1.0859], [0.4746], [0.4043], [0.5898], [0.7109], [1.0781], [0.5820], [0.4707], [1.0312], [1.0625], [0.5898], [0.4453], [0.4023], [0.2695], [0.2598]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.2500], [0.6016], [0.8008], [1.0000], [0.4648], [0.3340], [1.0000], [1.0000], [0.5000], [0.6016], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032958984375 loss: 0.002288818359375 loss: 0.00170135498046875 loss: 0.0019989013671875 predicted value: tensor([[0.4141], [1.0859], [1.0781], [0.6875], [1.0859], [0.8164], [0.6367], [0.6328], [0.5586], [0.5508], [0.4570], [0.6953], [0.3867], [0.5977], [0.2676], [0.4316]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [1.0000], [1.0000], [0.4668], [1.0000], [0.8008], [0.6016], [0.6016], [0.6016], [0.5000], [0.4004], [0.6016], [0.4004], [0.6016], [0.2002], [0.6680]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00194549560546875 loss: 0.00274658203125 loss: 0.00244140625 loss: 0.002044677734375 predicted value: tensor([[0.8789], [0.5352], [1.0547], [0.4707], [1.1016], [0.7227], [0.4844], [0.5195], [0.3398], [0.7070], [0.4277], [0.4180], [0.5312], [0.2500], [0.2598], [0.3066]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [1.0000], [0.3750], [1.0000], [0.6680], [0.3750], [0.3750], [0.3340], [0.6016], [0.3340], [0.4004], [0.5000], [0.1670], [0.1670], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00153350830078125 loss: 0.0016021728515625loss: 0.00183868408203125 loss: 0.0027923583984375 47%|████▋ | 229/492 [2:04:27<2:32:26, 34.78s/it] {'loss': 0.0089, 'learning_rate': 1e-05, 'epoch': 0.47} 47%|████▋ | 229/492 [2:04:27<2:32:26, 34.78s/it]predicted value: tensor([[0.7461], [0.8477], [1.0469], [0.8438], [0.7617], [0.6680], [1.0625], [0.3691], [0.4922], [0.6367], [0.5664], [0.5039], [0.4453], [0.2754], [0.2441], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.8320], [1.0000], [0.8320], [0.7500], [0.8008], [1.0000], [0.2500], [0.3750], [0.6016], [0.6016], [0.4668], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00116729736328125 loss: 0.002288818359375 loss: 0.00115966796875 loss: 0.00146484375 predicted value: tensor([[1.0625], [0.8164], [1.0625], [0.7148], [1.0469], [0.8047], [0.3711], [0.8047], [0.5508], [0.4707], [0.6172], [0.4746], [0.4395], [0.4121], [0.2158], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [1.0000], [0.6016], [1.0000], [0.8008], [0.2500], [0.8008], [0.6016], [0.4668], [0.6016], [0.4004], [0.4004], [0.2852], [0.2500], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000965118408203125 loss: 0.00119781494140625loss: 0.001190185546875 loss: 0.0032806396484375 predicted value: tensor([[0.5078], [0.3320], [0.7031], [0.7734], [0.4727], [1.0391], [0.4766], [0.3066], [0.4512], [1.0547], [0.6914], [0.5430], [0.4551], [0.6406], [0.2031], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.6680], [0.6680], [0.3750], [1.0000], [0.4668], [0.2500], [0.3750], [1.0000], [0.7500], [0.4668], [0.4004], [0.6016], [0.0400], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00299072265625 loss: 0.0013275146484375 loss: 0.0019378662109375 loss: 0.002349853515625 predicted value: tensor([[0.4492], [0.5117], [0.7305], [1.0625], [0.6367], [0.6680], [0.3340], [0.7188], [0.7812], [0.3359], [0.4746], [0.6406], [0.4590], [0.4258], [0.2578], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [1.0000], [0.8008], [0.7500], [0.3340], [0.8008], [0.8008], [0.2002], [0.4004], [0.6016], [0.3340], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025482177734375 loss: 0.00124359130859375 loss: 0.00179290771484375 loss: 0.0007171630859375 47%|████▋ | 230/492 [2:05:00<2:30:04, 34.37s/it] {'loss': 0.0069, 'learning_rate': 1e-05, 'epoch': 0.47} 47%|████▋ | 230/492 [2:05:00<2:30:04, 34.37s/it]predicted value: tensor([[0.2754], [0.7578], [0.2695], [0.4414], [0.7383], [0.4609], [0.3223], [0.9453], [0.9375], [0.5664], [0.4277], [0.6406], [0.5430], [0.3555], [0.2354], [0.1543]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.8320], [0.3340], [0.4668], [0.8320], [0.4668], [0.3145], [1.0000], [1.0000], [0.7500], [0.4004], [0.7500], [0.4277], [0.3340], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000598907470703125 loss: 0.00151824951171875 loss: 0.0015411376953125 loss: 0.00061798095703125 predicted value: tensor([[0.2910], [0.3672], [0.6133], [0.7227], [0.4805], [0.3809], [0.9414], [0.7227], [0.9414], [0.0400], [0.3574], [0.3145], [0.4141], [0.9453], [0.4238], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.3750], [0.5000], [0.6680], [0.6680], [0.3750], [1.0000], [0.8008], [1.0000], [0.0400], [0.4004], [0.4004], [0.5000], [1.0000], [0.5000], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00124359130859375 loss: 0.00157928466796875 loss: 0.0017547607421875 loss: 0.0015869140625 predicted value: tensor([[0.4160], [0.4336], [0.3848], [0.6445], [0.5156], [0.7227], [0.9766], [0.4258], [0.4180], [0.4648], [0.5781], [0.3613], [0.3359], [0.3672], [0.2432], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.4668], [0.8008], [0.5547], [0.8008], [1.0000], [0.3750], [0.4668], [0.5000], [0.6680], [0.4004], [0.3340], [0.4004], [0.2500], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004730224609375 loss: 0.00118255615234375loss: 0.0013427734375 loss: 0.0032196044921875 predicted value: tensor([[0.4141], [0.9375], [0.4375], [0.1885], [0.6836], [0.5977], [0.4512], [0.4922], [0.5039], [0.2949], [0.9688], [0.5078], [0.3535], [0.3945], [0.1709], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.2002], [0.8008], [0.6680], [0.3750], [0.6016], [0.6016], [0.2002], [1.0000], [0.5000], [0.2002], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00180816650390625 loss: 0.00136566162109375 loss: 0.001068115234375 loss: 0.00138092041015625 47%|████▋ | 231/492 [2:05:33<2:27:45, 33.97s/it] {'loss': 0.0066, 'learning_rate': 1e-05, 'epoch': 0.47} 47%|████▋ | 231/492 [2:05:33<2:27:45, 33.97s/it]predicted value: tensor([[0.9453], [0.3867], [0.2871], [0.2559], [0.9336], [0.7227], [0.8008], [0.2227], [0.5352], [0.5703], [0.2295], [0.3750], [0.5859], [0.1543], [0.0087], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.2002], [0.2500], [1.0000], [0.8008], [0.8320], [0.2500], [0.5000], [0.7500], [0.2002], [0.4004], [0.6016], [0.2002], [0.0625], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001922607421875 loss: 0.001007080078125 loss: 0.00079345703125 loss: 0.00091552734375 predicted value: tensor([[0.3750], [0.5234], [0.4375], [0.4219], [0.4023], [0.5586], [0.3711], [0.6133], [0.4043], [0.2402], [0.9531], [0.3770], [0.6133], [0.2715], [0.1973], [0.1553]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5547], [0.4668], [0.4668], [0.4668], [0.5000], [0.2715], [0.7500], [0.4668], [0.2002], [1.0000], [0.4004], [0.6016], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0008697509765625 loss: 0.0008392333984375loss: 0.00077056884765625 loss: 0.004119873046875 predicted value: tensor([[0.4355], [0.4395], [0.4121], [0.7227], [0.9492], [0.3770], [0.9258], [0.6758], [0.6016], [0.2734], [0.5312], [0.1904], [0.2715], [0.1709], [0.1709], [0.1816]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.3750], [0.8008], [1.0000], [0.4668], [1.0000], [0.7500], [0.6016], [0.3340], [0.6016], [0.2002], [0.0400], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018768310546875 loss: 0.00095367431640625 loss: 0.00159454345703125 loss: 0.0011749267578125 predicted value: tensor([[0.7734], [0.3965], [0.6914], [0.7695], [0.2812], [0.3984], [0.9297], [0.9453], [0.2432], [0.5898], [0.4277], [0.2061], [0.0986], [0.1592], [0.1826], [0.1543]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.8008], [0.8008], [0.2002], [0.4668], [1.0000], [1.0000], [0.3340], [0.6016], [0.5000], [0.0625], [0.0400], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00121307373046875 loss: 0.00090789794921875 loss: 0.000743865966796875 loss: 0.000640869140625 47%|████▋ | 232/492 [2:06:06<2:25:05, 33.48s/it] {'loss': 0.0051, 'learning_rate': 1e-05, 'epoch': 0.47} 47%|████▋ | 232/492 [2:06:06<2:25:05, 33.48s/it]predicted value: tensor([[0.5430], [0.8281], [0.5430], [1.0234], [1.0469], [0.8281], [0.5820], [1.0312], [1.0469], [0.4785], [0.8320], [0.4590], [0.5820], [0.2988], [0.2422], [0.2363]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4004], [0.8008], [0.4668], [1.0000], [1.0000], [0.8008], [0.6680], [1.0000], [1.0000], [0.4004], [0.8008], [0.3340], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001861572265625 loss: 0.002410888671875 loss: 0.001312255859375 loss: 0.00244140625 predicted value: tensor([[0.5156], [0.5430], [0.3320], [0.3594], [0.6133], [0.7617], [0.3496], [0.3184], [1.0312], [0.5859], [0.6055], [0.6484], [1.0469], [0.5078], [0.5273], [0.2334]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.2500], [0.3340], [0.6172], [0.7500], [0.2500], [0.2002], [1.0000], [0.5000], [0.2500], [0.6016], [1.0000], [0.5000], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003387451171875loss: 0.00106048583984375 loss: 0.001190185546875 loss: 0.0022735595703125 predicted value: tensor([[0.8945], [0.5547], [0.8047], [0.6914], [0.9062], [1.0234], [0.3301], [0.6875], [0.8906], [0.6836], [0.3594], [0.4668], [0.4668], [0.6602], [0.2480], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.7148], [0.5547], [0.8320], [1.0000], [0.3340], [0.5703], [0.8320], [0.5000], [0.2500], [0.4004], [0.3340], [0.6016], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015716552734375 loss: 0.002105712890625loss: 0.0021820068359375 loss: 0.002532958984375 predicted value: tensor([[1.0391], [1.0391], [0.8125], [0.5625], [0.7656], [0.5547], [0.6758], [1.0703], [0.3691], [1.0469], [0.3574], [1.0703], [0.4941], [0.6406], [0.2490], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.8008], [0.4668], [0.8008], [0.4668], [0.6680], [1.0000], [0.2500], [1.0000], [0.2500], [1.0000], [0.4004], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000942230224609375 loss: 0.0011749267578125 loss: 0.00101470947265625 loss: 0.00115966796875 47%|████▋ | 233/492 [2:06:38<2:23:36, 33.27s/it] {'loss': 0.0072, 'learning_rate': 1e-05, 'epoch': 0.47} 47%|████▋ | 233/492 [2:06:38<2:23:36, 33.27s/it]predicted value: tensor([[0.5312], [0.8828], [0.4492], [1.0469], [0.5000], [0.5078], [0.3711], [1.0391], [0.5469], [0.5312], [0.6484], [0.7578], [0.4355], [0.2500], [0.2500], [0.2246]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.4668], [1.0000], [0.4668], [0.4668], [0.2500], [1.0000], [0.3750], [0.4004], [0.6016], [0.8008], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00101470947265625 loss: 0.001617431640625 loss: 0.00133514404296875 loss: 0.00145721435546875 predicted value: tensor([[0.5938], [0.4863], [0.4512], [0.8672], [1.0703], [0.3242], [0.6914], [0.6211], [0.7578], [0.2812], [0.4355], [0.4395], [0.4414], [0.4844], [0.2441], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5312], [0.4668], [0.4668], [0.8320], [1.0000], [0.3340], [0.7500], [0.5000], [0.8008], [0.2002], [0.3340], [0.3340], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00104522705078125 loss: 0.0015716552734375 loss: 0.001190185546875 loss: 0.000701904296875 predicted value: tensor([[0.4375], [0.4688], [1.0312], [0.4746], [0.8125], [0.7031], [0.3477], [0.3594], [0.8242], [1.0391], [0.4395], [0.5586], [0.4453], [0.4980], [0.2061], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.3750], [0.8008], [0.7500], [0.3340], [0.3340], [0.7500], [1.0000], [0.3340], [0.4668], [0.2852], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000759124755859375 loss: 0.0010528564453125loss: 0.00262451171875 loss: 0.0018463134765625 predicted value: tensor([[0.6211], [0.4473], [1.0312], [0.5859], [0.4551], [1.0391], [0.5156], [0.6758], [0.6680], [0.7109], [1.0469], [0.1689], [1.0547], [0.4180], [0.2500], [0.2207]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.3145], [1.0000], [0.6680], [0.4668], [1.0000], [0.3750], [0.6016], [0.6016], [0.8008], [1.0000], [0.0400], [1.0000], [0.3340], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003631591796875 loss: 0.00148773193359375 loss: 0.001953125 loss: 0.0016326904296875 48%|████▊ | 234/492 [2:07:11<2:22:06, 33.05s/it] {'loss': 0.0062, 'learning_rate': 1e-05, 'epoch': 0.48} 48%|████▊ | 234/492 [2:07:11<2:22:06, 33.05s/it]predicted value: tensor([[0.4844], [0.6680], [0.7148], [0.3691], [0.9492], [0.7461], [0.4746], [0.7695], [0.4473], [0.2598], [0.5508], [0.6406], [0.4004], [0.1406], [0.1445], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.6680], [0.4668], [1.0000], [0.8008], [0.2500], [0.6680], [0.6016], [0.3340], [0.6016], [0.7500], [0.5000], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000843048095703125 loss: 0.00115966796875 loss: 0.0028533935546875 loss: 0.0007781982421875 predicted value: tensor([[0.3867], [0.9688], [0.7656], [0.6992], [0.9609], [0.4141], [0.4121], [0.6523], [0.5938], [0.2334], [0.5195], [0.2334], [0.1963], [0.3457], [0.1963], [0.1484]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8320], [0.8008], [1.0000], [0.7500], [0.4668], [0.6016], [0.6016], [0.2002], [0.6016], [0.2002], [0.0400], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0006256103515625 loss: 0.0032806396484375loss: 0.00159454345703125 loss: 0.002532958984375 predicted value: tensor([[0.3984], [0.3906], [0.7578], [0.2295], [0.3984], [0.2295], [0.6289], [0.6211], [0.6914], [0.4004], [0.9688], [0.1011], [0.4688], [0.1836], [0.1367], [0.1484]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8008], [0.2500], [0.4668], [0.2500], [0.7500], [0.6016], [0.8008], [0.4004], [1.0000], [0.0625], [0.4004], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00101470947265625 loss: 0.00151824951171875 loss: 0.00103759765625 loss: 0.00174713134765625 predicted value: tensor([[0.3945], [0.6602], [0.6953], [0.4082], [0.9727], [0.9180], [0.6523], [0.7188], [0.9727], [0.6172], [0.3066], [0.2734], [0.6133], [0.3613], [0.1377], [0.1367]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [0.8320], [0.4668], [1.0000], [1.0000], [0.6016], [0.7500], [1.0000], [0.6016], [0.2500], [0.2002], [0.6016], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001373291015625 loss: 0.000850677490234375 loss: 0.00164794921875 loss: 0.001068115234375 48%|████▊ | 235/492 [2:07:44<2:21:56, 33.14s/it] {'loss': 0.006, 'learning_rate': 1e-05, 'epoch': 0.48} 48%|████▊ | 235/492 [2:07:44<2:21:56, 33.14s/it]predicted value: tensor([[0.5430], [0.9648], [0.5859], [0.7500], [0.7852], [0.9688], [0.3203], [0.5977], [0.2695], [0.9570], [0.6406], [0.5586], [0.0664], [0.0029], [0.3516], [0.1650]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [1.0000], [0.5000], [0.8008], [0.8008], [1.0000], [0.3340], [0.6016], [0.2500], [1.0000], [0.6016], [0.6016], [0.0400], [0.0400], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000820159912109375 loss: 0.00107574462890625 loss: 0.0005645751953125 loss: 0.00060272216796875 predicted value: tensor([[0.7461], [0.9648], [0.5195], [0.4609], [0.1572], [0.9609], [0.5977], [0.6367], [0.9922], [0.9961], [1.0156], [0.3711], [0.3613], [0.1426], [0.1387], [0.1680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.5547], [0.5547], [0.2500], [1.0000], [0.6016], [0.6016], [1.0000], [1.0000], [1.0000], [0.0278], [0.4004], [0.1670], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012359619140625loss: 0.0025787353515625 loss: 0.002410888671875 loss: 0.0021514892578125 predicted value: tensor([[0.3496], [0.9453], [0.4160], [0.9336], [0.6836], [0.3711], [0.7656], [0.3086], [0.4238], [1.0078], [0.7773], [0.4238], [0.4043], [0.1758], [0.1670], [0.1396]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [1.0000], [0.6680], [0.4668], [0.8008], [0.3340], [0.5000], [1.0000], [0.8008], [0.4004], [0.3340], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00087738037109375 loss: 0.00089263916015625 loss: 0.0018310546875 loss: 0.00107574462890625 predicted value: tensor([[0.7500], [0.5039], [0.3887], [0.3613], [0.3008], [0.7539], [0.2217], [0.5664], [0.5352], [0.0378], [0.4297], [0.4023], [0.3652], [0.1650], [0.1670], [0.1436]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6016], [0.3750], [0.3750], [0.4668], [0.8008], [0.3340], [0.6016], [0.5000], [0.0278], [0.5000], [0.5000], [0.4004], [0.1670], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004608154296875 loss: 0.0019683837890625 loss: 0.0020904541015625 loss: 0.00164794921875 48%|████▊ | 236/492 [2:08:17<2:21:05, 33.07s/it] {'loss': 0.0066, 'learning_rate': 1e-05, 'epoch': 0.48} 48%|████▊ | 236/492 [2:08:17<2:21:05, 33.07s/it]predicted value: tensor([[0.4883], [0.8125], [0.4492], [0.5078], [0.7812], [0.4277], [0.6602], [0.7969], [0.3008], [0.4824], [1.0547], [0.6562], [0.2773], [0.5000], [0.2480], [0.2402]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.3750], [0.4668], [0.8008], [0.4668], [0.6016], [0.8008], [0.2500], [0.3750], [1.0000], [0.6016], [0.2002], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002838134765625 loss: 0.0008087158203125 loss: 0.000614166259765625 loss: 0.00125885009765625 predicted value: tensor([[0.5703], [0.4434], [0.3555], [1.0625], [1.0625], [0.8711], [0.6367], [0.7734], [0.2793], [1.0625], [0.5703], [0.6094], [0.4570], [0.3770], [0.2695], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.3340], [1.0000], [1.0000], [0.8320], [0.5000], [0.8008], [0.2002], [1.0000], [0.6016], [0.5000], [0.5000], [0.3340], [0.2500], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002105712890625 loss: 0.0009613037109375loss: 0.00125885009765625 loss: 0.0034637451171875 predicted value: tensor([[0.7539], [1.0391], [0.5078], [0.5156], [0.7695], [0.8516], [0.6914], [0.7148], [0.3008], [0.7656], [0.8359], [0.6328], [0.6406], [0.5078], [0.4453], [0.0771]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [1.0000], [0.4668], [0.3750], [0.7500], [0.8008], [0.7500], [0.6016], [0.3340], [0.7500], [0.8008], [0.4668], [0.2500], [0.4004], [0.4004], [0.0625]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001251220703125 loss: 0.0029754638671875 loss: 0.0038909912109375 loss: 0.004119873046875 predicted value: tensor([[0.5625], [0.4590], [1.0547], [0.8398], [0.7891], [0.6758], [0.3555], [0.6445], [0.3281], [0.6680], [0.4609], [0.4883], [0.6680], [0.4570], [0.2520], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [1.0000], [0.8008], [0.7500], [0.8320], [0.3340], [0.7500], [0.3340], [0.6016], [0.3340], [0.5000], [0.5000], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021514892578125 loss: 0.00180816650390625 loss: 0.00154876708984375 loss: 0.001556396484375 48%|████▊ | 237/492 [2:08:49<2:19:15, 32.77s/it] {'loss': 0.0082, 'learning_rate': 1e-05, 'epoch': 0.48} 48%|████▊ | 237/492 [2:08:49<2:19:15, 32.77s/it]predicted value: tensor([[0.4258], [1.0391], [0.8672], [0.2734], [0.7617], [0.3672], [0.7031], [0.5117], [0.3438], [0.7461], [0.4199], [0.6992], [0.4688], [0.3887], [0.2432], [0.2422]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.8320], [0.2500], [0.8008], [0.2002], [0.6016], [0.4668], [0.2500], [0.7500], [0.3340], [0.7500], [0.5000], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00139617919921875 loss: 0.00121307373046875 loss: 0.0016326904296875 loss: 0.001312255859375 predicted value: tensor([[0.4121], [1.0469], [0.4980], [0.5977], [0.4902], [0.5195], [0.7109], [0.4297], [0.6328], [0.8047], [0.4219], [0.4863], [0.4863], [0.2754], [0.2500], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [0.5547], [0.4668], [0.3750], [0.7500], [0.4668], [0.5000], [0.8008], [0.3340], [0.5000], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00125885009765625 loss: 0.00109100341796875loss: 0.001220703125 loss: 0.001983642578125 predicted value: tensor([[0.6211], [0.5078], [0.7734], [0.5078], [0.4922], [0.7188], [0.4785], [0.8828], [0.3457], [1.0547], [0.3203], [0.3594], [0.6094], [1.0547], [0.5195], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8008], [0.4668], [0.4668], [0.6680], [0.3750], [0.8008], [0.2500], [1.0000], [0.2500], [0.3340], [0.6016], [1.0000], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003326416015625 loss: 0.0013885498046875 loss: 0.0017547607421875 loss: 0.000804901123046875 predicted value: tensor([[0.2432], [1.0469], [0.4785], [1.0625], [0.4707], [0.6172], [0.4375], [0.5977], [0.4121], [0.7148], [0.6641], [0.4824], [0.6953], [0.3418], [0.6211], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.1670], [1.0000], [0.4668], [1.0000], [0.4668], [0.5547], [0.3750], [0.5547], [0.3340], [0.7500], [0.7500], [0.4004], [0.6016], [0.0278], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00168609619140625 loss: 0.00469970703125 loss: 0.00150299072265625 loss: 0.002105712890625 48%|████▊ | 238/492 [2:09:21<2:17:11, 32.41s/it] {'loss': 0.0071, 'learning_rate': 1e-05, 'epoch': 0.48} 48%|████▊ | 238/492 [2:09:21<2:17:11, 32.41s/it]predicted value: tensor([[0.9922], [0.9766], [0.3809], [0.4102], [0.3848], [0.1582], [0.4688], [0.1689], [0.9570], [0.3594], [0.4590], [0.6094], [0.1768], [0.1904], [0.1592], [0.1992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.3750], [0.4668], [0.4668], [0.2500], [0.4668], [0.2002], [1.0000], [0.4668], [0.3340], [0.7500], [0.2002], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000812530517578125 loss: 0.002227783203125 loss: 0.001251220703125 loss: 0.0019989013671875 predicted value: tensor([[0.9727], [0.1963], [0.9922], [0.3359], [0.4121], [0.4570], [0.4023], [0.5898], [0.1982], [0.6055], [0.3008], [0.3770], [0.1895], [0.1621], [0.2695], [0.4375]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [1.0000], [0.4668], [0.4668], [0.3340], [0.4668], [0.6680], [0.3340], [0.7500], [0.4004], [0.5000], [0.2500], [0.2002], [0.2500], [0.6680]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027618408203125 loss: 0.0017547607421875 loss: 0.0026397705078125 loss: 0.0019683837890625 predicted value: tensor([[0.3652], [0.5195], [0.6680], [0.3711], [0.1147], [0.8047], [0.9531], [0.5039], [0.6094], [0.9648], [0.5391], [0.4062], [0.7344], [0.2559], [0.3691], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8008], [0.4668], [0.2002], [0.8008], [1.0000], [0.4668], [0.4668], [1.0000], [0.5000], [0.5000], [0.8008], [0.3340], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020751953125 loss: 0.00182342529296875 loss: 0.0014495849609375 loss: 0.0010223388671875 predicted value: tensor([[0.4863], [0.3828], [0.5000], [0.4902], [0.4668], [0.9570], [0.7383], [0.5742], [0.5430], [0.1187], [0.5273], [0.3770], [0.3438], [0.3516], [0.1953], [0.1992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.4668], [0.5547], [0.5547], [0.5547], [1.0000], [0.8008], [0.7500], [0.6016], [0.1670], [0.5000], [0.5000], [0.4004], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00101470947265625 loss: 0.000804901123046875 loss: 0.0015869140625 loss: 0.001739501953125 49%|████▊ | 239/492 [2:09:53<2:16:18, 32.32s/it] {'loss': 0.0067, 'learning_rate': 1e-05, 'epoch': 0.49} 49%|████▊ | 239/492 [2:09:53<2:16:18, 32.32s/it]predicted value: tensor([[0.3945], [0.7305], [0.9727], [0.1963], [0.3750], [0.2227], [0.5117], [0.7109], [0.5898], [0.2324], [0.1719], [0.9609], [0.4238], [0.3848], [0.1982], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [1.0000], [0.2500], [0.3145], [0.3340], [0.5000], [0.8008], [0.6016], [0.2002], [0.2500], [1.0000], [0.5000], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0022125244140625 loss: 0.0035247802734375 loss: 0.000873565673828125 loss: 0.0019683837890625 predicted value: tensor([[0.4785], [0.4297], [0.2275], [0.4805], [0.3398], [0.4766], [0.5938], [0.4473], [0.2217], [0.4434], [0.4004], [0.9688], [0.6016], [0.9805], [0.1826], [0.1748]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.2500], [0.5547], [0.4668], [0.5000], [0.7500], [0.2500], [0.2002], [0.5000], [0.5000], [1.0000], [0.7500], [1.0000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018157958984375 loss: 0.0020751953125 loss: 0.00262451171875 loss: 0.00592041015625 predicted value: tensor([[0.3867], [0.1895], [0.2344], [0.7539], [0.3945], [0.9727], [0.6133], [0.5117], [0.6250], [0.4883], [0.6016], [0.3809], [0.3496], [0.4102], [0.2080], [0.2080]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.2500], [0.8008], [0.4668], [1.0000], [0.4668], [0.5000], [0.7500], [0.4668], [0.7500], [0.5000], [0.4004], [0.6680], [0.0908], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00167083740234375 loss: 0.0027618408203125 loss: 0.0012664794921875 loss: 0.00140380859375 predicted value: tensor([[0.4121], [0.4160], [0.9375], [0.4082], [0.2314], [0.5742], [0.3906], [0.5117], [0.6797], [0.5273], [0.0126], [0.3770], [0.5664], [0.2090], [0.1787], [0.2217]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.4668], [0.2500], [0.8008], [0.4668], [0.6016], [0.6680], [0.5000], [0.0400], [0.4004], [0.6016], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00177001953125 loss: 0.0012359619140625 loss: 0.00131988525390625loss: 0.0016326904296875 49%|████▉ | 240/492 [2:10:25<2:15:13, 32.19s/it] {'loss': 0.0085, 'learning_rate': 1e-05, 'epoch': 0.49} 49%|████▉ | 240/492 [2:10:25<2:15:13, 32.19s/it]predicted value: tensor([[0.6289], [1.0703], [0.7305], [0.4883], [0.3828], [0.7188], [0.6914], [0.8594], [1.0469], [0.2441], [1.0391], [0.4629], [0.4707], [0.2754], [0.4570], [0.3027]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.7500], [0.4668], [0.3340], [0.4668], [0.6016], [0.8008], [1.0000], [0.2002], [1.0000], [0.4004], [0.3340], [0.2002], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0008392333984375 loss: 0.0030059814453125 loss: 0.00201416015625 loss: 0.00225830078125 predicted value: tensor([[0.7734], [0.3828], [0.3203], [0.5352], [0.6680], [1.0703], [0.3301], [0.5078], [1.0156], [0.6094], [0.4434], [0.4863], [0.4219], [0.3223], [0.2930], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.3340], [0.2500], [0.4668], [0.6016], [1.0000], [0.2500], [0.4668], [1.0000], [0.5000], [0.5000], [0.5000], [0.5000], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00162506103515625 loss: 0.00191497802734375 loss: 0.0011749267578125 loss: 0.0012664794921875 predicted value: tensor([[0.4473], [0.8711], [0.4941], [0.8906], [0.4961], [1.0625], [1.0312], [0.2988], [0.6055], [1.0391], [1.0312], [0.6328], [0.4375], [0.4297], [0.2754], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.3750], [0.8320], [0.3750], [1.0000], [1.0000], [0.2002], [0.5000], [1.0000], [1.0000], [0.6016], [0.5000], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00469970703125 loss: 0.0016021728515625 loss: 0.00116729736328125 loss: 0.001922607421875 predicted value: tensor([[1.0312], [0.9062], [0.7695], [0.6094], [0.2471], [0.4941], [0.8594], [0.4688], [0.5117], [0.5664], [0.4551], [0.6562], [0.4590], [0.3047], [0.4648], [0.2949]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.8320], [0.6016], [0.2002], [0.4668], [0.8008], [0.3750], [0.4668], [0.3145], [0.5000], [0.6016], [0.3340], [0.2500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00113677978515625 loss: 0.0012359619140625 loss: 0.00146484375 loss: 0.0019989013671875 49%|████▉ | 241/492 [2:10:57<2:14:00, 32.03s/it] {'loss': 0.0073, 'learning_rate': 1e-05, 'epoch': 0.49} 49%|████▉ | 241/492 [2:10:57<2:14:00, 32.03s/it]predicted value: tensor([[0.6367], [1.0547], [0.5117], [0.4980], [0.4902], [0.7656], [0.6172], [0.6367], [1.0234], [0.4395], [0.3418], [0.4258], [0.4121], [0.4746], [0.2754], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.4668], [0.3750], [0.6680], [0.6016], [0.6016], [1.0000], [0.3340], [0.6016], [0.5000], [0.6016], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0005950927734375 loss: 0.002532958984375loss: 0.004150390625 loss: 0.000873565673828125 predicted value: tensor([[0.6211], [0.5703], [0.5078], [0.8008], [0.2910], [0.5156], [0.3379], [0.6211], [0.8320], [0.4668], [1.0312], [0.5508], [0.4258], [0.2871], [0.2969], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [0.8320], [0.2500], [0.3750], [0.2500], [0.6016], [0.8320], [0.4004], [1.0000], [0.7500], [0.4004], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00102996826171875 loss: 0.0014801025390625 loss: 0.00127410888671875 loss: 0.00150299072265625 predicted value: tensor([[0.3164], [0.5195], [0.8242], [0.8438], [1.0234], [1.0547], [0.3418], [0.3164], [0.4844], [1.0469], [0.5430], [0.4121], [0.4727], [0.2695], [0.2559], [0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.4668], [0.8320], [0.8008], [1.0000], [1.0000], [0.2500], [0.2500], [0.4004], [1.0000], [0.6016], [0.4004], [0.1670], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00152587890625 loss: 0.00238037109375 loss: 0.0013580322265625 loss: 0.005157470703125 predicted value: tensor([[0.4844], [0.5273], [0.7969], [0.5078], [1.0469], [0.4277], [0.5000], [0.3984], [0.3652], [0.5430], [0.3672], [0.4043], [0.4785], [0.2969], [0.2891], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8008], [0.4668], [1.0000], [0.3145], [0.4668], [0.3340], [0.3340], [0.6016], [0.3340], [0.5000], [0.5000], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000965118408203125 loss: 0.00160980224609375 loss: 0.00110626220703125 loss: 0.002105712890625 49%|████▉ | 242/492 [2:11:28<2:13:16, 31.99s/it] {'loss': 0.0074, 'learning_rate': 1e-05, 'epoch': 0.49} 49%|████▉ | 242/492 [2:11:28<2:13:16, 31.99s/it]predicted value: tensor([[0.9805], [0.2227], [0.7148], [0.6094], [0.2754], [0.6562], [0.3730], [0.1836], [0.7227], [0.5078], [0.9531], [0.3789], [0.4844], [0.3770], [0.1631], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.8008], [0.6680], [0.2500], [0.7500], [0.4668], [0.2002], [0.8008], [0.6016], [1.0000], [0.5000], [0.6016], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001678466796875 loss: 0.00144195556640625loss: 0.003631591796875 loss: 0.001861572265625 predicted value: tensor([[0.4570], [0.4453], [0.4082], [0.7070], [0.9375], [0.7148], [0.2832], [0.5586], [0.6680], [0.5234], [0.2197], [0.4688], [0.3125], [0.6055], [0.3574], [0.1836]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.6250], [1.0000], [0.6680], [0.2500], [0.7500], [0.8008], [0.5000], [0.2500], [0.5000], [0.6016], [0.7500], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004364013671875 loss: 0.0020904541015625 loss: 0.0027923583984375 loss: 0.00115203857421875 predicted value: tensor([[0.3887], [0.7227], [0.2139], [0.4883], [0.7695], [0.9688], [0.2080], [0.3945], [0.9141], [0.4629], [0.9570], [0.4316], [0.4141], [0.3730], [0.3145], [0.0806]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.1670], [0.5000], [0.8008], [1.0000], [0.2500], [0.3750], [1.0000], [0.4668], [1.0000], [0.5000], [0.4004], [0.5000], [0.5000], [0.0278]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00139617919921875 loss: 0.00286865234375 loss: 0.001251220703125 loss: 0.004608154296875 predicted value: tensor([[0.9727], [0.7500], [0.4375], [0.9531], [0.2617], [0.2246], [0.6641], [0.2422], [0.4004], [0.5000], [0.4062], [0.3984], [0.0249], [0.4316], [0.1729], [0.1816]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.3750], [1.0000], [0.3340], [0.2500], [0.6680], [0.3340], [0.5703], [0.6016], [0.8008], [0.5000], [0.0625], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002105712890625 loss: 0.0038299560546875loss: 0.000873565673828125 loss: 0.00116729736328125 49%|████▉ | 243/492 [2:12:00<2:12:25, 31.91s/it] {'loss': 0.0093, 'learning_rate': 1e-05, 'epoch': 0.49} 49%|████▉ | 243/492 [2:12:00<2:12:25, 31.91s/it]predicted value: tensor([[0.9727], [0.3711], [0.4316], [0.3828], [0.9609], [0.6328], [0.9531], [0.2520], [0.3848], [0.5625], [0.1982], [0.2969], [0.5859], [0.3340], [0.2256], [0.2031]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.4668], [1.0000], [0.6680], [1.0000], [0.3340], [0.6016], [0.5000], [0.2002], [0.3340], [0.6016], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00106048583984375 loss: 0.0014495849609375 loss: 0.00138092041015625 loss: 0.00555419921875 predicted value: tensor([[0.4648], [0.7031], [0.9844], [0.4766], [0.6914], [0.9297], [0.4297], [0.9570], [0.6523], [0.6602], [0.3809], [0.6875], [0.5078], [0.5859], [0.1621], [0.1787]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.7148], [1.0000], [0.4668], [0.8008], [1.0000], [0.5000], [1.0000], [0.8008], [0.6680], [0.3340], [0.8008], [0.6016], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000904083251953125 loss: 0.001129150390625 loss: 0.00122833251953125 loss: 0.0029754638671875 predicted value: tensor([[0.4395], [0.4824], [0.7266], [0.7227], [0.4805], [0.2598], [0.6211], [0.9531], [0.3027], [0.9531], [0.3047], [0.4551], [0.2949], [0.3320], [0.2051], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8008], [0.8008], [0.5000], [0.3340], [0.8555], [1.0000], [0.2500], [1.0000], [0.3340], [0.7500], [0.2500], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011444091796875 loss: 0.0027313232421875 loss: 0.00173187255859375 loss: 0.004302978515625 predicted value: tensor([[0.7422], [0.2051], [0.9648], [0.3672], [0.4004], [0.2773], [0.6562], [0.3887], [0.5469], [0.5586], [0.4121], [0.3496], [0.3594], [0.3027], [0.2041], [0.4531]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.2002], [1.0000], [0.3750], [0.4668], [0.3340], [0.8008], [0.3750], [0.6016], [0.5000], [0.4668], [0.5000], [0.4004], [0.3340], [0.2500], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001129150390625 loss: 0.002777099609375 loss: 0.00213623046875 loss: 0.003692626953125 50%|████▉ | 244/492 [2:12:32<2:11:33, 31.83s/it] {'loss': 0.0088, 'learning_rate': 1e-05, 'epoch': 0.5} 50%|████▉ | 244/492 [2:12:32<2:11:33, 31.83s/it]predicted value: tensor([[0.8438], [0.7734], [0.5664], [0.7461], [0.7461], [1.0234], [0.6797], [0.8477], [0.7266], [1.0469], [0.6836], [0.4863], [0.4160], [0.2773], [0.2656], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.8008], [0.4668], [0.7500], [0.6016], [1.0000], [0.6016], [0.8008], [0.8008], [1.0000], [0.7500], [0.3340], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028228759765625 loss: 0.00160980224609375loss: 0.0019073486328125 loss: 0.001434326171875 predicted value: tensor([[0.4961], [0.9062], [1.0391], [0.5469], [0.5391], [0.4668], [0.4883], [1.0469], [0.7148], [0.7148], [0.6094], [0.5391], [0.4961], [0.4668], [0.2500], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [1.0000], [0.4668], [0.4668], [0.3145], [0.3750], [1.0000], [0.6016], [0.6016], [0.6016], [0.5000], [0.5000], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000804901123046875 loss: 0.00150299072265625 loss: 0.000759124755859375 loss: 0.001617431640625 predicted value: tensor([[0.3555], [0.7891], [0.4883], [0.6094], [0.8125], [0.8086], [0.5078], [0.6367], [1.0469], [0.6289], [1.0312], [0.5195], [0.3848], [0.2949], [0.2637], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.4668], [0.4668], [0.5547], [0.8008], [0.6680], [0.4668], [0.6016], [1.0000], [0.5000], [1.0000], [0.5000], [0.3340], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0042724609375 loss: 0.001220703125 loss: 0.0026702880859375 loss: 0.00225830078125 predicted value: tensor([[0.8672], [0.3301], [0.6602], [0.2578], [1.0469], [0.6758], [0.3340], [0.7578], [1.0312], [0.5117], [0.5039], [0.5273], [0.5938], [1.0156], [0.2949], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.2500], [0.6016], [0.1670], [1.0000], [0.6016], [0.3340], [0.8008], [1.0000], [0.4668], [0.4668], [0.5000], [0.7500], [1.0000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000904083251953125 loss: 0.0024566650390625 loss: 0.0012054443359375 loss: 0.00099945068359375 50%|████▉ | 245/492 [2:13:03<2:09:53, 31.55s/it] {'loss': 0.0071, 'learning_rate': 1e-05, 'epoch': 0.5} 50%|████▉ | 245/492 [2:13:03<2:09:53, 31.55s/it]predicted value: tensor([[0.6445], [1.0547], [1.0469], [0.4961], [0.7812], [0.7695], [1.0312], [1.0469], [1.0391], [1.0469], [0.7852], [0.4141], [0.4355], [0.2598], [0.2344], [0.2451]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [1.0000], [1.0000], [0.4668], [0.6680], [0.6680], [1.0000], [1.0000], [1.0000], [1.0000], [0.8008], [0.4004], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003265380859375 loss: 0.00084686279296875loss: 0.00189971923828125 loss: 0.00113677978515625 predicted value: tensor([[0.6055], [0.6172], [0.4883], [1.0469], [0.7656], [1.0312], [1.0469], [0.5898], [0.8281], [0.8008], [1.0312], [0.3828], [1.0234], [0.4141], [0.2119], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.5547], [0.4668], [1.0000], [0.6680], [1.0000], [1.0000], [0.5000], [0.8008], [0.8008], [1.0000], [0.2500], [1.0000], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00116729736328125 loss: 0.0015106201171875 loss: 0.004119873046875 loss: 0.00098419189453125 predicted value: tensor([[0.5195], [0.5078], [0.6367], [0.5039], [0.4902], [0.5664], [0.5859], [0.2988], [0.4062], [0.3047], [0.4785], [0.5859], [0.4941], [0.3809], [0.1187], [0.2373]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [0.4668], [0.4668], [0.8008], [0.6016], [0.2500], [0.2500], [0.2500], [0.2002], [0.6016], [0.4004], [0.6016], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115966796875 loss: 0.002227783203125 loss: 0.00592041015625 loss: 0.00164794921875 predicted value: tensor([[0.4785], [0.8711], [1.0312], [0.7852], [1.0312], [0.7852], [0.7930], [1.0469], [0.5273], [0.2852], [1.0469], [0.7656], [0.6953], [0.2158], [0.2373], [0.2393]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [1.0000], [0.8008], [1.0000], [0.6680], [0.8008], [1.0000], [0.3750], [0.2002], [1.0000], [0.7500], [0.5000], [0.1670], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024566650390625 loss: 0.00112152099609375 loss: 0.0020599365234375 loss: 0.00177764892578125 50%|█████ | 246/492 [2:13:35<2:09:47, 31.66s/it] {'loss': 0.0083, 'learning_rate': 1e-05, 'epoch': 0.5} 50%|█████ | 246/492 [2:13:35<2:09:47, 31.66s/it]predicted value: tensor([[0.5430], [0.4023], [0.7617], [0.2344], [0.6875], [0.7383], [0.3633], [0.9297], [0.6445], [0.6914], [0.3105], [0.3906], [0.6055], [0.5234], [0.1680], [0.1426]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.8008], [0.2500], [0.3750], [0.8008], [0.3145], [1.0000], [0.7500], [0.8008], [0.3340], [0.5000], [0.7500], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00186920166015625 loss: 0.00098419189453125 loss: 0.00445556640625 loss: 0.0023956298828125 predicted value: tensor([[0.4395], [0.4297], [0.7617], [0.6758], [0.9609], [0.5938], [0.5781], [0.9766], [0.5117], [0.9688], [0.5156], [0.3574], [0.5664], [0.3906], [0.1138], [0.1494]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [0.6680], [1.0000], [0.6016], [0.7500], [1.0000], [0.6016], [1.0000], [0.5000], [0.5000], [0.4668], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000732421875 loss: 0.00140380859375loss: 0.00070953369140625 loss: 0.0013580322265625 predicted value: tensor([[0.2695], [0.8320], [0.7812], [0.5469], [0.7305], [0.6367], [0.6328], [0.2061], [0.6484], [0.6914], [0.5273], [0.6211], [0.3672], [0.9492], [0.3555], [0.1592]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.8555], [0.8320], [0.7500], [0.8008], [0.6680], [0.7500], [0.2500], [0.7500], [0.8008], [0.6016], [0.6016], [0.3340], [1.0000], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00186920166015625 loss: 0.00164794921875loss: 0.001190185546875 loss: 0.00089263916015625 predicted value: tensor([[0.6719], [0.5391], [0.8242], [0.9453], [0.2324], [0.4277], [0.6719], [0.7461], [0.1973], [0.9648], [0.3516], [0.4648], [0.6758], [0.3359], [0.2021], [0.1426]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.5547], [0.8320], [1.0000], [0.3340], [0.4668], [0.8008], [0.6680], [0.2500], [1.0000], [0.2500], [0.5000], [0.4668], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00084686279296875 loss: 0.00167083740234375 loss: 0.001312255859375 loss: 0.00185394287109375 50%|█████ | 247/492 [2:14:06<2:09:24, 31.69s/it] {'loss': 0.0063, 'learning_rate': 1e-05, 'epoch': 0.5} 50%|█████ | 247/492 [2:14:06<2:09:24, 31.69s/it]predicted value: tensor([[0.5078], [0.8203], [0.4199], [0.9844], [1.0000], [0.4043], [1.0000], [0.5586], [0.6445], [0.9883], [0.3457], [0.3672], [0.3809], [0.1924], [0.1904], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8203], [0.8320], [0.3750], [1.0000], [1.0000], [0.4668], [1.0000], [0.6016], [0.6016], [1.0000], [0.4004], [0.4004], [0.4004], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00225830078125 loss: 0.000911712646484375 loss: 0.0019378662109375 loss: 0.0030517578125 predicted value: tensor([[0.5156], [0.3965], [0.1953], [0.2773], [0.5977], [0.4297], [0.3535], [0.2480], [0.6758], [0.9844], [0.6445], [0.1865], [0.6016], [0.2617], [0.1562], [0.1650]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.2500], [0.3340], [0.5703], [0.4668], [0.3750], [0.2500], [0.8008], [1.0000], [0.7500], [0.5000], [0.7500], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000789642333984375 loss: 0.0025787353515625 loss: 0.000579833984375 loss: 0.0019989013671875 predicted value: tensor([[0.4375], [0.3750], [0.3496], [0.2412], [0.7188], [0.6055], [0.1982], [0.3809], [0.6211], [0.9883], [0.0298], [0.6680], [0.3906], [0.3340], [0.3691], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3145], [0.2500], [0.8008], [0.5547], [0.2002], [0.4668], [0.6016], [1.0000], [0.0278], [0.8008], [0.5000], [0.3340], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0008392333984375 loss: 0.000911712646484375 loss: 0.0005035400390625 loss: 0.0008697509765625 predicted value: tensor([[0.5117], [0.3926], [0.6641], [0.9766], [0.3945], [0.4473], [0.2559], [0.9766], [0.9336], [0.6055], [0.7578], [0.4004], [0.2988], [0.3672], [0.1377], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.6680], [1.0000], [0.3750], [0.3750], [0.2002], [1.0000], [1.0000], [0.7500], [0.8008], [0.3340], [0.4004], [0.7500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00103759765625 loss: 0.0032958984375 loss: 0.000942230224609375 loss: 0.003814697265625 50%|█████ | 248/492 [2:14:38<2:08:59, 31.72s/it] {'loss': 0.0066, 'learning_rate': 1e-05, 'epoch': 0.5} 50%|█████ | 248/492 [2:14:38<2:08:59, 31.72s/it]predicted value: tensor([[0.5820], [1.0625], [0.5586], [0.6094], [1.0625], [1.0703], [0.3516], [1.0703], [0.5859], [0.3633], [0.6289], [1.0391], [0.4863], [0.2852], [0.2295], [0.2373]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [1.0000], [0.3750], [0.8008], [1.0000], [1.0000], [0.2500], [1.0000], [0.6016], [0.3340], [0.5000], [1.0000], [0.4004], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00213623046875 loss: 0.0022125244140625loss: 0.0016632080078125 loss: 0.003387451171875 predicted value: tensor([[0.6328], [0.4609], [0.5117], [0.7227], [0.8867], [1.0469], [1.0703], [0.3223], [0.7227], [0.8516], [0.4980], [0.4863], [0.5039], [0.4922], [0.2207], [0.2451]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.3750], [0.4668], [0.6016], [0.8320], [1.0000], [1.0000], [0.2500], [0.7500], [0.8008], [0.3340], [0.2500], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029754638671875 loss: 0.0022735595703125 loss: 0.00103759765625 loss: 0.00139617919921875 predicted value: tensor([[0.5117], [1.0391], [0.4980], [0.7773], [0.2734], [1.0781], [0.7031], [0.7812], [0.2930], [0.6445], [0.5039], [0.5039], [0.4902], [0.2773], [0.2285], [0.2295]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [0.6680], [0.2500], [1.0000], [0.7500], [0.8008], [0.2500], [0.6016], [0.4668], [0.5000], [0.4004], [0.1670], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00201416015625 loss: 0.0011749267578125loss: 0.0027008056640625 loss: 0.0008392333984375 predicted value: tensor([[0.4766], [0.9102], [0.7930], [1.0703], [0.2812], [0.4785], [1.0625], [0.7812], [0.3223], [0.6523], [0.6445], [1.0703], [0.4590], [0.4375], [0.2266], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.8008], [1.0000], [0.2500], [0.5000], [1.0000], [0.7148], [0.2500], [0.6016], [0.6016], [1.0000], [0.4004], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000728607177734375 loss: 0.0010528564453125 loss: 0.00113677978515625 loss: 0.000823974609375 51%|█████ | 249/492 [2:15:10<2:08:09, 31.65s/it] {'loss': 0.0069, 'learning_rate': 1e-05, 'epoch': 0.51} 51%|█████ | 249/492 [2:15:10<2:08:09, 31.65s/it]predicted value: tensor([[0.8516], [0.8984], [1.0781], [0.5039], [0.4023], [0.6250], [0.8086], [0.1318], [0.7422], [0.9219], [0.6484], [0.6367], [0.6602], [0.4336], [0.2695], [0.2480]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.8320], [1.0000], [0.4668], [0.3340], [0.5547], [0.8008], [0.0625], [0.6016], [0.8320], [0.2500], [0.6016], [0.6016], [0.4004], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0014190673828125 loss: 0.0021209716796875 loss: 0.0035247802734375 loss: 0.0016937255859375 predicted value: tensor([[0.5859], [0.8086], [0.8047], [0.5664], [1.0391], [0.3203], [0.5625], [0.2734], [0.3574], [0.8047], [0.8281], [0.7227], [0.6680], [1.0938], [0.2578], [0.2217]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.7148], [0.8320], [0.4668], [1.0000], [0.2002], [0.5547], [0.2500], [0.2500], [0.7500], [0.8008], [0.7500], [0.6016], [1.0000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002593994140625 loss: 0.0010986328125loss: 0.00176239013671875 loss: 0.00151824951171875 predicted value: tensor([[1.0781], [0.5117], [1.0625], [0.5039], [0.3555], [0.5195], [0.6914], [0.8125], [0.2637], [0.4570], [0.5977], [0.4902], [0.5586], [0.7695], [0.4609], [0.2324]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [1.0000], [0.3145], [0.3340], [0.4668], [0.6016], [0.8008], [0.2002], [0.3340], [0.4277], [0.7500], [0.4004], [0.7500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00112152099609375 loss: 0.0034637451171875 loss: 0.0018768310546875 loss: 0.0020294189453125 predicted value: tensor([[1.0547], [1.0547], [1.0781], [0.4219], [0.4453], [0.5195], [0.1108], [0.2676], [0.7344], [0.4922], [0.3984], [0.4941], [0.4902], [0.4688], [0.2354], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [1.0000], [0.3750], [0.2500], [0.4668], [0.0625], [0.1670], [0.6680], [0.4004], [0.2500], [0.4004], [0.7500], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00174713134765625 loss: 0.00141143798828125 loss: 0.004058837890625 loss: 0.0030364990234375 51%|█████ | 250/492 [2:15:41<2:07:14, 31.55s/it] {'loss': 0.0086, 'learning_rate': 1e-05, 'epoch': 0.51} 51%|█████ | 250/492 [2:15:41<2:07:14, 31.55s/it]predicted value: tensor([[0.9883], [0.3867], [0.3711], [0.4727], [0.9922], [0.4746], [0.4473], [0.6055], [0.6797], [0.4043], [0.2031], [0.4570], [0.1709], [0.1338], [0.1592], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.3750], [0.6016], [1.0000], [0.8008], [0.4668], [0.7500], [0.7500], [0.5000], [0.2500], [0.5000], [0.2002], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00057220458984375 loss: 0.0026092529296875loss: 0.0023651123046875 loss: 0.001617431640625 predicted value: tensor([[0.4277], [0.9727], [0.2715], [0.9805], [0.9727], [0.7070], [0.6523], [0.4727], [0.9805], [0.0067], [0.6641], [0.4531], [0.4824], [0.1553], [0.1611], [0.2021]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.2002], [1.0000], [1.0000], [0.8008], [0.7500], [0.5547], [1.0000], [0.0278], [0.6016], [0.4004], [0.5000], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016021728515625 loss: 0.000827789306640625loss: 0.00016498565673828125 loss: 0.00119781494140625 predicted value: tensor([[0.5117], [0.4160], [0.6250], [0.7266], [0.3750], [0.4219], [0.9883], [0.2139], [0.5898], [0.2432], [0.7695], [0.4102], [0.2051], [0.1934], [0.1455], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.5703], [0.8008], [0.4668], [0.4668], [1.0000], [0.2500], [0.6016], [0.3340], [0.8008], [0.4004], [0.2002], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001953125 loss: 0.0007171630859375loss: 0.000637054443359375 loss: 0.00089263916015625 predicted value: tensor([[0.9727], [0.3848], [0.3633], [0.3535], [0.2363], [0.7539], [0.2324], [0.1484], [0.9727], [0.6992], [0.2715], [0.3984], [0.3906], [0.6836], [0.1338], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.4668], [0.2500], [0.8320], [0.2500], [0.2002], [1.0000], [0.8008], [0.6016], [0.5000], [0.3340], [0.8008], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000644683837890625 loss: 0.00104522705078125 loss: 0.0016021728515625 loss: 0.0030059814453125 51%|█████ | 251/492 [2:16:12<2:06:06, 31.39s/it] {'loss': 0.0054, 'learning_rate': 1e-05, 'epoch': 0.51} 51%|█████ | 251/492 [2:16:12<2:06:06, 31.39s/it]predicted value: tensor([[0.4414], [0.7305], [0.2061], [0.4141], [0.5781], [0.2246], [0.9688], [0.9766], [0.2373], [0.3965], [0.4629], [0.9766], [0.3750], [0.3555], [0.1807], [0.1465]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.6680], [0.3340], [0.4668], [0.4668], [0.2500], [1.0000], [1.0000], [0.2500], [0.4668], [0.2500], [1.0000], [0.3340], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010528564453125 loss: 0.00193023681640625loss: 0.00045013427734375 loss: 0.000698089599609375 predicted value: tensor([[0.2432], [0.3496], [0.9766], [0.9805], [0.1973], [0.4375], [0.2158], [0.7539], [0.2754], [0.2197], [0.1924], [0.4922], [0.5156], [0.1729], [0.1553], [0.1416]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.3750], [1.0000], [1.0000], [0.2500], [0.4668], [0.2500], [0.8008], [0.2500], [0.2500], [0.2002], [0.5000], [0.5000], [0.2500], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00183868408203125 loss: 0.000316619873046875loss: 0.00127410888671875 loss: 0.000431060791015625 predicted value: tensor([[0.4180], [0.9844], [0.3652], [0.7227], [0.2656], [0.4863], [0.5664], [0.5234], [0.9805], [0.2490], [0.5898], [0.9727], [0.5039], [0.4316], [0.1514], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [0.8008], [0.2500], [0.5547], [0.5000], [0.5000], [1.0000], [0.2500], [0.6016], [1.0000], [0.3340], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00193023681640625 loss: 0.00102996826171875loss: 0.000797271728515625 loss: 0.000774383544921875 predicted value: tensor([[0.3750], [0.9727], [0.9961], [0.8047], [0.7656], [0.2559], [0.6484], [0.1377], [0.5234], [1.0000], [0.7109], [0.3496], [0.3887], [0.4082], [0.1436], [0.1631]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [1.0000], [0.8555], [0.8320], [0.2500], [0.7500], [0.2002], [0.5000], [1.0000], [0.7500], [0.3340], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012054443359375 loss: 0.00148773193359375 loss: 0.000553131103515625 loss: 0.004913330078125 51%|█████ | 252/492 [2:16:43<2:05:18, 31.33s/it] {'loss': 0.0052, 'learning_rate': 1e-05, 'epoch': 0.51} 51%|█████ | 252/492 [2:16:43<2:05:18, 31.33s/it]predicted value: tensor([[0.5156], [0.5156], [0.4707], [0.3848], [0.6055], [1.0547], [1.0312], [0.7930], [1.0547], [0.7188], [0.4004], [0.4648], [0.4980], [0.4629], [0.2578], [0.4180]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.2500], [0.8008], [1.0000], [1.0000], [0.8008], [1.0000], [0.6016], [0.0625], [0.4004], [0.4004], [0.3340], [0.2002], [0.2852]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012664794921875 loss: 0.001373291015625 loss: 0.003997802734375 loss: 0.004730224609375 predicted value: tensor([[0.4648], [0.6055], [0.8711], [0.2656], [0.7656], [0.5469], [1.0469], [0.6445], [0.4004], [1.0625], [0.6914], [0.4902], [0.5508], [0.2520], [0.2334], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8320], [0.2002], [0.6680], [0.4004], [1.0000], [0.5000], [0.3340], [1.0000], [0.7500], [0.4004], [0.5000], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00131988525390625 loss: 0.00145721435546875loss: 0.00138092041015625 loss: 0.00098419189453125 predicted value: tensor([[0.4941], [0.7383], [0.4551], [0.6250], [0.3145], [0.4863], [0.7500], [0.7383], [0.5273], [0.7656], [0.3457], [0.6992], [0.4980], [0.2422], [0.2891], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.7148], [0.3750], [0.5547], [0.2002], [0.4668], [0.6680], [0.6016], [0.4668], [0.6680], [0.3340], [0.6016], [0.5000], [0.0400], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00121307373046875 loss: 0.00213623046875loss: 0.001556396484375 loss: 0.004241943359375 predicted value: tensor([[0.4961], [0.6836], [0.3809], [0.8555], [0.4805], [1.0391], [1.0547], [0.8984], [0.4473], [1.0547], [0.5508], [1.0547], [0.5039], [0.2715], [0.2637], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5547], [0.3340], [0.8008], [0.4668], [1.0000], [1.0000], [0.8320], [0.4668], [1.0000], [0.4004], [1.0000], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00128173828125 loss: 0.0014495849609375 loss: 0.00112152099609375 loss: 0.005706787109375 51%|█████▏ | 253/492 [2:17:15<2:05:02, 31.39s/it] {'loss': 0.0088, 'learning_rate': 1e-05, 'epoch': 0.51} 51%|█████▏ | 253/492 [2:17:15<2:05:02, 31.39s/it]predicted value: tensor([[0.5391], [0.4570], [1.0547], [0.5391], [0.7656], [0.3848], [0.5312], [0.4961], [0.8047], [0.3828], [1.0391], [1.0234], [0.3398], [0.2373], [0.2266], [0.2891]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.4668], [0.8008], [0.2500], [0.4668], [0.4668], [0.7500], [0.2500], [1.0000], [1.0000], [0.2002], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00099945068359375 loss: 0.00130462646484375loss: 0.002349853515625 loss: 0.00390625 predicted value: tensor([[0.7148], [0.9336], [0.4980], [0.7969], [0.4805], [0.3340], [0.6992], [0.6758], [0.3066], [0.6797], [0.5547], [0.6562], [0.6680], [0.3887], [0.3086], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.4668], [0.8008], [0.4668], [0.2500], [0.7500], [0.8008], [0.2500], [0.5000], [0.5000], [0.5000], [0.6016], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00168609619140625 loss: 0.0028076171875 loss: 0.0026397705078125 loss: 0.00115203857421875 predicted value: tensor([[0.5195], [0.5547], [1.0391], [1.0234], [0.5117], [0.4844], [1.0312], [0.8164], [0.3574], [0.5781], [0.3340], [0.5234], [0.4922], [0.5000], [0.4961], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [1.0000], [0.4668], [0.4668], [1.0000], [0.8008], [0.3340], [0.5000], [0.3340], [0.5000], [0.4004], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00159454345703125 loss: 0.0024871826171875 loss: 0.000835418701171875 loss: 0.00174713134765625 predicted value: tensor([[0.5742], [1.0469], [0.4844], [0.5312], [0.3594], [0.6836], [0.3008], [0.8047], [0.5586], [0.5352], [0.7305], [0.5039], [0.7188], [0.4453], [0.5234], [0.3262]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [1.0000], [0.4668], [0.4668], [0.3340], [0.3340], [0.2002], [0.8008], [0.3750], [0.4668], [0.7500], [0.3340], [0.7500], [0.4004], [0.5000], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002777099609375 loss: 0.002197265625 loss: 0.00128936767578125 loss: 0.00396728515625 52%|█████▏ | 254/492 [2:17:46<2:04:42, 31.44s/it] {'loss': 0.0084, 'learning_rate': 1e-05, 'epoch': 0.52} 52%|█████▏ | 254/492 [2:17:46<2:04:42, 31.44s/it]predicted value: tensor([[1.0078], [0.7305], [0.7617], [0.7461], [0.3047], [0.3945], [0.9531], [0.7344], [0.7500], [0.3789], [0.4766], [0.4082], [0.3887], [0.2930], [0.3828], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.8320], [0.8008], [0.3340], [0.4668], [1.0000], [0.8008], [0.8008], [0.4004], [0.6016], [0.2852], [0.4004], [0.0400], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000782012939453125 loss: 0.0020751953125loss: 0.00070953369140625 loss: 0.0028533935546875 predicted value: tensor([[0.4531], [0.4727], [0.6719], [0.9453], [0.5117], [0.7070], [0.9648], [0.4648], [0.5391], [0.2275], [0.2402], [0.9570], [0.4316], [0.2002], [0.0459], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [1.0000], [0.5547], [0.6680], [1.0000], [0.7500], [0.6016], [0.2002], [0.2500], [1.0000], [0.3340], [0.3340], [0.0625], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001556396484375 loss: 0.00457763671875 loss: 0.0023956298828125 loss: 0.001129150390625 predicted value: tensor([[0.3691], [0.5391], [0.4609], [0.5234], [0.4258], [0.7070], [0.9766], [0.5195], [0.2656], [0.7305], [0.3848], [0.6797], [0.4199], [0.4023], [0.1709], [0.2178]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.4668], [0.5547], [0.3750], [0.8008], [1.0000], [0.5000], [0.3340], [0.7500], [0.5000], [0.8008], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013427734375 loss: 0.0023956298828125 loss: 0.00286865234375 loss: 0.002166748046875 predicted value: tensor([[0.4219], [0.4434], [0.9414], [0.1826], [0.4160], [0.4434], [0.3926], [0.5625], [0.5664], [0.3848], [0.6133], [0.3398], [0.3281], [0.3750], [0.1953], [0.1982]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.2500], [0.4668], [0.6016], [0.4668], [0.6016], [0.6016], [0.6016], [0.6016], [0.3340], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035400390625 loss: 0.000396728515625 loss: 0.003936767578125 loss: 0.0019378662109375 52%|█████▏ | 255/492 [2:18:18<2:04:11, 31.44s/it] {'loss': 0.0087, 'learning_rate': 1e-05, 'epoch': 0.52} 52%|█████▏ | 255/492 [2:18:18<2:04:11, 31.44s/it]predicted value: tensor([[0.3672], [0.3281], [0.7344], [0.4219], [0.4219], [0.9688], [0.2715], [0.2988], [0.3027], [0.5781], [0.1729], [0.3906], [0.2949], [0.6562], [0.4609], [0.2188]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.2500], [0.8008], [0.4668], [0.4668], [1.0000], [0.2500], [0.2500], [0.3340], [0.6680], [0.0625], [0.5000], [0.4004], [0.7500], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004150390625 loss: 0.00115966796875loss: 0.00124359130859375 loss: 0.003265380859375 predicted value: tensor([[0.4219], [0.9531], [0.2676], [0.7148], [0.2334], [0.2617], [0.9570], [0.2471], [0.5156], [0.7070], [0.5234], [0.3984], [0.3105], [0.2314], [0.2070], [0.2012]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.2002], [0.8008], [0.2500], [0.2500], [1.0000], [0.2002], [0.8008], [0.7500], [0.5000], [0.4004], [0.4004], [0.0400], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010833740234375 loss: 0.0003204345703125 loss: 0.002593994140625 loss: 0.000751495361328125 predicted value: tensor([[0.9688], [0.5156], [0.5742], [0.4121], [0.2598], [0.8320], [0.9570], [0.4355], [0.5508], [0.9648], [0.3457], [0.2344], [0.3926], [0.2148], [0.2148], [0.2324]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.5547], [0.4668], [0.2500], [0.8320], [1.0000], [0.4668], [0.4668], [1.0000], [0.3340], [0.2500], [0.4004], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00113677978515625 loss: 0.000301361083984375 loss: 0.00179290771484375 loss: 0.0009765625 predicted value: tensor([[0.5625], [0.4355], [0.7461], [0.4004], [0.7578], [0.6602], [0.9688], [0.2773], [0.7344], [0.6719], [0.2354], [0.7422], [0.4023], [0.1226], [0.1914], [0.1807]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8008], [0.3145], [0.8008], [0.7500], [1.0000], [0.2500], [0.8320], [0.8008], [0.2500], [0.8008], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00148773193359375 loss: 0.0008087158203125 loss: 0.00390625 loss: 0.0031280517578125 52%|█████▏ | 256/492 [2:18:50<2:04:08, 31.56s/it] {'loss': 0.007, 'learning_rate': 1e-05, 'epoch': 0.52} 52%|█████▏ | 256/492 [2:18:50<2:04:08, 31.56s/it]predicted value: tensor([[1.0859], [0.7773], [0.3457], [0.1396], [0.5430], [0.5547], [0.7969], [0.6953], [0.4922], [0.6953], [0.7969], [0.4082], [0.4980], [0.4316], [0.2715], [0.2930]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.3340], [0.0278], [0.3750], [0.4668], [0.8008], [0.7500], [0.3750], [0.6016], [0.7500], [0.4004], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001739501953125 loss: 0.00121307373046875 loss: 0.00180816650390625 loss: 0.0028533935546875 predicted value: tensor([[0.6016], [0.4902], [0.3203], [0.5234], [0.3867], [1.0625], [1.0469], [0.7227], [0.6133], [1.0469], [0.7266], [0.4180], [0.4199], [0.2871], [0.1895], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.2500], [0.4668], [0.5000], [1.0000], [1.0000], [0.6016], [0.6016], [1.0000], [0.6016], [0.4004], [0.3340], [0.1670], [0.0625], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016326904296875 loss: 0.0016632080078125 loss: 0.0010528564453125 loss: 0.001922607421875 predicted value: tensor([[0.5273], [1.0547], [1.0547], [1.0391], [0.3262], [0.6484], [0.5273], [0.5312], [1.0625], [0.7812], [0.5117], [1.0781], [1.0469], [0.2520], [0.4141], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [1.0000], [0.2500], [0.3750], [0.3750], [0.4668], [1.0000], [0.7500], [0.5000], [1.0000], [1.0000], [0.1670], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000888824462890625 loss: 0.005218505859375 loss: 0.002410888671875 loss: 0.0015106201171875 predicted value: tensor([[0.5938], [1.0312], [0.5039], [1.0781], [1.0547], [0.3379], [0.7812], [0.8242], [0.5664], [0.4844], [0.6211], [0.7617], [0.4648], [0.4551], [0.2715], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [1.0000], [1.0000], [0.2500], [0.7500], [0.8008], [0.5000], [0.4004], [0.5000], [0.7500], [0.4004], [0.5000], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002838134765625 loss: 0.0015869140625 loss: 0.00078582763671875 loss: 0.00091552734375 52%|█████▏ | 257/492 [2:19:21<2:03:36, 31.56s/it] {'loss': 0.0075, 'learning_rate': 1e-05, 'epoch': 0.52} 52%|█████▏ | 257/492 [2:19:21<2:03:36, 31.56s/it]predicted value: tensor([[0.7773], [0.8516], [0.6172], [0.5000], [0.4961], [0.6836], [0.6680], [0.4980], [0.8086], [0.5977], [0.3945], [0.5234], [0.2197], [0.2715], [0.2617], [0.2275]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.9336], [0.8008], [0.5547], [0.4668], [0.5000], [0.7500], [0.6016], [0.4668], [0.8008], [0.5000], [0.2500], [0.5000], [0.0278], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00146484375 loss: 0.001953125loss: 0.0023345947265625 loss: 0.0014190673828125 predicted value: tensor([[0.4199], [0.5469], [1.0547], [0.2637], [0.8477], [0.6992], [0.8711], [0.3281], [0.3262], [0.5859], [0.5703], [0.4375], [0.4434], [0.4551], [0.2295], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [1.0000], [0.2002], [0.8320], [0.6680], [0.8008], [0.3340], [0.2500], [0.6016], [0.6016], [0.2500], [0.3340], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0022125244140625 loss: 0.0013275146484375 loss: 0.00185394287109375 loss: 0.0027923583984375 predicted value: tensor([[0.8750], [0.3203], [1.0781], [0.2910], [0.7852], [0.7461], [1.0781], [0.7578], [0.3008], [0.8008], [0.4258], [0.4668], [0.6875], [0.2891], [0.2559], [0.2490]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.2500], [1.0000], [0.2002], [0.6680], [0.7500], [1.0000], [0.6680], [0.2002], [0.8008], [0.2002], [0.4004], [0.6016], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00201416015625 loss: 0.002227783203125 loss: 0.0020294189453125 loss: 0.002777099609375 predicted value: tensor([[0.5469], [1.0625], [0.8047], [0.7852], [0.3145], [1.0469], [0.6328], [1.0703], [0.3613], [0.6484], [0.4863], [0.4180], [0.4336], [0.4805], [0.2871], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8008], [0.6680], [0.2500], [1.0000], [0.6016], [1.0000], [0.3340], [0.6016], [0.5000], [0.5000], [0.4004], [0.2852], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00151824951171875 loss: 0.00145721435546875loss: 0.0019989013671875 loss: 0.00115203857421875 52%|█████▏ | 258/492 [2:19:53<2:03:22, 31.64s/it] {'loss': 0.0076, 'learning_rate': 1e-05, 'epoch': 0.52} 52%|█████▏ | 258/492 [2:19:53<2:03:22, 31.64s/it]predicted value: tensor([[0.3965], [0.7852], [0.4199], [0.4375], [1.0000], [0.7188], [0.4121], [0.7656], [0.3535], [0.3223], [0.4668], [0.3359], [0.6406], [0.1895], [0.3613], [0.1807]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.4668], [0.4668], [1.0000], [0.6680], [0.4668], [0.8008], [0.3750], [0.6016], [0.3750], [0.5000], [0.6016], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001129150390625 loss: 0.0020751953125loss: 0.001251220703125 loss: 0.00141143798828125 predicted value: tensor([[0.9883], [0.4062], [1.0078], [0.8008], [0.5820], [0.6367], [0.4258], [0.7148], [0.6484], [0.9844], [0.3125], [0.4805], [0.4453], [0.1797], [0.2969], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.8320], [0.7500], [0.7500], [0.3750], [0.8008], [0.6016], [1.0000], [0.3340], [0.5000], [0.5000], [0.2002], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00531005859375 loss: 0.00127410888671875loss: 0.0015869140625 loss: 0.000820159912109375 predicted value: tensor([[0.8359], [0.4023], [0.4473], [0.2354], [0.4277], [0.4629], [0.5156], [0.2598], [0.2197], [0.3711], [0.5195], [0.4180], [0.3809], [0.3633], [0.3496], [0.1807]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.4668], [0.2500], [0.4668], [0.3145], [0.6016], [0.2500], [0.2500], [0.5000], [0.5000], [0.3340], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010528564453125 loss: 0.000766754150390625 loss: 0.0013427734375 loss: 0.00075531005859375 predicted value: tensor([[0.8359], [0.8125], [0.4082], [0.4121], [0.5117], [0.2041], [0.9766], [0.1885], [0.2070], [0.6836], [0.5039], [0.3320], [0.1650], [0.3203], [0.1846], [0.1807]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.8320], [0.3750], [0.4668], [0.5547], [0.2002], [1.0000], [0.2002], [0.2500], [0.7500], [0.3340], [0.3340], [0.1426], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010223388671875 loss: 0.001434326171875 loss: 0.00189971923828125 loss: 0.00118255615234375 53%|█████▎ | 259/492 [2:20:24<2:02:24, 31.52s/it] {'loss': 0.0061, 'learning_rate': 1e-05, 'epoch': 0.53} 53%|█████▎ | 259/492 [2:20:24<2:02:24, 31.52s/it]predicted value: tensor([[0.4141], [0.7930], [0.9961], [0.4863], [0.6094], [0.4629], [0.5039], [0.9883], [0.9922], [0.1582], [0.5234], [0.3496], [0.2949], [0.3887], [0.1816], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [1.0000], [0.3750], [0.6016], [0.7500], [0.4668], [1.0000], [1.0000], [0.2500], [0.6016], [0.5000], [0.3340], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00079345703125 loss: 0.0022125244140625loss: 0.00139617919921875 loss: 0.000705718994140625 predicted value: tensor([[0.5625], [0.4355], [0.4336], [0.2236], [0.2451], [0.9922], [0.9648], [0.9609], [0.9805], [0.9766], [0.9883], [0.5938], [0.1572], [0.3164], [0.1699], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.2500], [0.2500], [1.0000], [1.0000], [1.0000], [1.0000], [1.0000], [1.0000], [0.6016], [0.0400], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000644683837890625 loss: 0.0003814697265625loss: 0.000537872314453125 loss: 0.0008087158203125 predicted value: tensor([[0.4160], [0.9922], [0.2617], [0.5703], [0.9961], [0.4199], [0.9805], [0.9844], [0.6133], [0.2129], [0.6523], [0.2988], [0.6328], [0.4043], [0.1738], [0.1768]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.2500], [0.6016], [1.0000], [0.4668], [1.0000], [1.0000], [0.6016], [0.2002], [0.7500], [0.3340], [0.4668], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009765625 loss: 0.000858306884765625 loss: 0.00189208984375 loss: 0.00140380859375 predicted value: tensor([[0.5977], [0.5469], [0.8438], [0.9961], [1.0000], [1.0000], [0.9805], [0.7148], [0.4512], [0.4707], [1.0156], [0.3477], [0.3770], [0.1865], [0.0452], [0.2080]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.8320], [1.0000], [1.0000], [1.0000], [1.0000], [0.7500], [0.4668], [0.7500], [1.0000], [0.3340], [0.5000], [0.2002], [0.0625], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115966796875 loss: 0.00153350830078125loss: 0.001922607421875 loss: 0.001251220703125 53%|█████▎ | 260/492 [2:20:56<2:02:08, 31.59s/it] {'loss': 0.0046, 'learning_rate': 1e-05, 'epoch': 0.53} 53%|█████▎ | 260/492 [2:20:56<2:02:08, 31.59s/it]predicted value: tensor([[0.6602], [1.0859], [0.8984], [0.2852], [0.2852], [0.4805], [0.9062], [0.6758], [0.6328], [0.5625], [0.6289], [0.5078], [0.2207], [0.3301], [0.2715], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8203], [1.0000], [0.8320], [0.2500], [0.2002], [0.3750], [0.8320], [0.5000], [0.8008], [0.5000], [0.5000], [0.5000], [0.1670], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023956298828125 loss: 0.002288818359375 loss: 0.00244140625 loss: 0.0017547607421875 predicted value: tensor([[1.0625], [0.4180], [0.3730], [0.5117], [0.4941], [0.2695], [0.6680], [0.8164], [0.5312], [0.4102], [0.7617], [0.7461], [0.5312], [0.2715], [0.3047], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2715], [0.3340], [0.4668], [0.4668], [0.2002], [0.6016], [0.7500], [0.4668], [0.3340], [0.6016], [0.7500], [0.4004], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037841796875 loss: 0.0017242431640625loss: 0.00164794921875 loss: 0.0015411376953125 predicted value: tensor([[0.6289], [0.3398], [0.5508], [0.4980], [0.5078], [0.3418], [0.4863], [1.0391], [0.7109], [0.7812], [0.7344], [0.3398], [0.5508], [0.4785], [0.2676], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.4668], [0.4668], [0.4668], [0.2500], [0.4668], [1.0000], [0.6680], [0.7500], [0.6016], [0.2002], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003387451171875 loss: 0.000675201416015625 loss: 0.0026702880859375 loss: 0.001983642578125 predicted value: tensor([[0.5117], [0.9102], [0.4980], [0.9180], [0.6367], [0.7227], [0.7656], [0.8359], [0.6680], [1.0391], [0.7734], [0.3750], [0.4863], [0.2656], [0.2832], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.4668], [0.8320], [0.4668], [0.6016], [0.7500], [0.6680], [0.5000], [1.0000], [0.5000], [0.2500], [0.5000], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033416748046875 loss: 0.0014190673828125 loss: 0.0037841796875 loss: 0.0029144287109375 53%|█████▎ | 261/492 [2:21:28<2:01:44, 31.62s/it] {'loss': 0.0094, 'learning_rate': 1e-05, 'epoch': 0.53} 53%|█████▎ | 261/492 [2:21:28<2:01:44, 31.62s/it]predicted value: tensor([[0.6758], [0.4902], [0.7734], [0.6094], [0.5430], [0.8594], [0.4629], [0.7422], [0.5078], [0.4121], [0.3848], [0.5938], [0.5078], [0.5039], [0.2402], [0.2441]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.6680], [0.5547], [0.4668], [0.8008], [0.7500], [0.6016], [0.5000], [0.6016], [0.5000], [0.6016], [0.5000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030975341796875 loss: 0.00360107421875loss: 0.002044677734375 loss: 0.0017547607421875 predicted value: tensor([[0.4766], [0.8203], [0.4961], [0.5156], [0.8203], [0.9023], [1.0625], [0.5469], [0.5742], [0.7852], [0.4609], [0.6484], [0.3730], [0.2715], [0.2676], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.3750], [0.4668], [0.6680], [0.8320], [1.0000], [0.6016], [0.4668], [0.7500], [0.2500], [0.6016], [0.0400], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027618408203125 loss: 0.0037994384765625 loss: 0.00177001953125 loss: 0.0032196044921875 predicted value: tensor([[1.0469], [1.0391], [1.0625], [0.4746], [0.6172], [0.3164], [0.4785], [1.0469], [0.3770], [1.0547], [0.5742], [0.4043], [0.3457], [0.4570], [0.4629], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [1.0000], [0.4668], [0.4668], [0.3340], [0.4668], [1.0000], [0.3340], [1.0000], [0.5000], [0.3340], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021514892578125 loss: 0.000934600830078125 loss: 0.001220703125 loss: 0.00122833251953125 predicted value: tensor([[0.5938], [1.0625], [1.0469], [0.7031], [0.3203], [0.5273], [1.0547], [1.0781], [0.6914], [0.5312], [0.8672], [0.7344], [0.4980], [0.2598], [0.3125], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.7500], [0.3340], [0.2500], [1.0000], [1.0000], [0.6016], [0.4004], [0.8008], [0.6016], [0.4004], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00213623046875 loss: 0.0010986328125 loss: 0.00180816650390625 loss: 0.0027008056640625 53%|█████▎ | 262/492 [2:21:59<2:01:19, 31.65s/it] {'loss': 0.0088, 'learning_rate': 1e-05, 'epoch': 0.53} 53%|█████▎ | 262/492 [2:21:59<2:01:19, 31.65s/it]predicted value: tensor([[0.6484], [0.7617], [0.4102], [0.9570], [0.6250], [0.3711], [0.2295], [0.5117], [0.7109], [0.4863], [0.5625], [0.4570], [0.9609], [0.3672], [0.2324], [0.2061]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.4668], [1.0000], [0.6016], [0.3340], [0.2002], [0.5547], [0.6680], [0.5000], [0.6016], [0.4004], [1.0000], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0004138946533203125 loss: 0.000698089599609375loss: 0.000640869140625 loss: 0.00121307373046875 predicted value: tensor([[0.3887], [0.2539], [0.4180], [0.3633], [0.4219], [0.6406], [0.6016], [0.2793], [0.2119], [0.4980], [0.5469], [0.3789], [0.5039], [0.6172], [0.1836], [0.2061]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.3750], [0.4668], [0.7500], [0.7500], [0.6680], [0.3340], [0.3340], [0.5000], [0.5000], [0.4668], [0.4004], [0.6016], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00055694580078125 loss: 0.0028533935546875 loss: 0.000423431396484375 loss: 0.00157928466796875 predicted value: tensor([[0.3945], [0.5117], [0.6914], [0.7773], [0.2412], [0.1963], [0.6953], [0.9844], [0.9609], [0.6719], [0.5664], [0.3926], [0.1924], [0.1641], [0.2080], [0.2432]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.6680], [0.8320], [0.2500], [0.2002], [0.4668], [1.0000], [1.0000], [0.7500], [0.5000], [0.4004], [0.1670], [0.1670], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00054168701171875 loss: 0.00015735626220703125 loss: 0.00121307373046875 loss: 0.00115966796875 predicted value: tensor([[0.5156], [0.9648], [0.5039], [0.9609], [0.4121], [0.8164], [0.9766], [0.2754], [0.2578], [0.9375], [0.6602], [0.3789], [0.3535], [0.3691], [0.2207], [0.3398]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.5547], [1.0000], [0.4668], [0.8008], [1.0000], [0.2500], [0.2500], [1.0000], [0.7500], [0.4004], [0.4004], [0.3340], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016326904296875 loss: 0.00048828125loss: 0.0021514892578125 loss: 0.0011138916015625 53%|█████▎ | 263/492 [2:22:31<2:01:19, 31.79s/it] {'loss': 0.0042, 'learning_rate': 1e-05, 'epoch': 0.53} 53%|█████▎ | 263/492 [2:22:31<2:01:19, 31.79s/it]predicted value: tensor([[0.5469], [0.0884], [0.7422], [0.7188], [0.9805], [0.2539], [0.2793], [0.6797], [0.6172], [0.5859], [0.2578], [0.9883], [0.4199], [0.2041], [0.1963], [0.2041]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5312], [0.0400], [0.5547], [0.8008], [1.0000], [0.2500], [0.3340], [0.6016], [0.6016], [0.5000], [0.3340], [1.0000], [0.4004], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010833740234375 loss: 0.00109100341796875 loss: 0.000972747802734375 loss: 0.00089263916015625 predicted value: tensor([[0.5508], [0.5391], [0.4023], [0.9648], [0.9727], [0.4238], [0.2363], [0.4023], [0.6445], [0.5859], [0.4062], [0.3809], [0.3457], [0.3340], [0.1953], [0.1992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [1.0000], [1.0000], [0.3750], [0.2002], [0.4668], [0.7500], [0.6016], [0.4668], [0.4004], [0.2852], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003265380859375 loss: 0.0005340576171875 loss: 0.000659942626953125 loss: 0.000949859619140625 predicted value: tensor([[0.4297], [0.3691], [0.7930], [0.2217], [0.3320], [0.2754], [0.4434], [0.3906], [0.9805], [0.4062], [0.5898], [0.5703], [0.4785], [0.4023], [0.4023], [0.2090]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8320], [0.2002], [0.3750], [0.2500], [0.4668], [0.4668], [1.0000], [0.3750], [0.6016], [0.3340], [0.4004], [0.5000], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000545501708984375 loss: 0.0014801025390625loss: 0.000934600830078125 loss: 0.000545501708984375 predicted value: tensor([[0.4453], [0.5664], [0.9922], [0.2695], [0.6211], [0.5508], [1.0156], [0.5391], [1.0078], [0.5391], [0.3984], [1.0312], [0.3984], [0.1875], [0.4160], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [1.0000], [0.2500], [0.5000], [0.5547], [1.0000], [0.5000], [1.0000], [0.5000], [0.4668], [1.0000], [0.4004], [0.1670], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00038909912109375 loss: 0.00201416015625 loss: 0.001129150390625 loss: 0.0015106201171875 54%|█████▎ | 264/492 [2:23:03<2:00:48, 31.79s/it] {'loss': 0.0045, 'learning_rate': 1e-05, 'epoch': 0.54} 54%|█████▎ | 264/492 [2:23:03<2:00:48, 31.79s/it]predicted value: tensor([[0.8750], [0.5039], [0.4629], [0.8594], [0.6016], [0.2910], [0.3359], [0.3379], [1.0938], [0.1240], [0.6641], [0.4883], [0.3926], [0.4707], [0.2598], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.3750], [0.2500], [0.8320], [0.5547], [0.2002], [0.2500], [0.2500], [1.0000], [0.0625], [0.6016], [0.4004], [0.3340], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001434326171875 loss: 0.001953125loss: 0.00286865234375 loss: 0.0037689208984375 predicted value: tensor([[0.9141], [0.5469], [0.5156], [0.8516], [0.3594], [0.7031], [0.7227], [0.8477], [1.0859], [0.4688], [0.4941], [0.3457], [0.4922], [0.2832], [0.2539], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.4668], [0.8008], [0.2500], [0.6016], [0.6680], [0.8008], [1.0000], [0.5000], [0.4004], [0.3340], [0.5000], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004058837890625 loss: 0.0010986328125loss: 0.002471923828125 loss: 0.00213623046875 predicted value: tensor([[0.8594], [0.8281], [0.4844], [0.5977], [1.0859], [0.4844], [0.4551], [0.7148], [0.7148], [0.5195], [0.4531], [0.5000], [0.5000], [0.4336], [0.2754], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8320], [0.3750], [0.5547], [1.0000], [0.4668], [0.3750], [0.7500], [0.6016], [0.4668], [0.3340], [0.4004], [0.4004], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00335693359375 loss: 0.00145721435546875 loss: 0.00244140625 loss: 0.001800537109375 predicted value: tensor([[0.6719], [0.5039], [0.3730], [0.4355], [0.5000], [0.8672], [0.6484], [0.6289], [0.5391], [0.4023], [0.7461], [0.6602], [0.4355], [0.5859], [0.2891], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3340], [0.3340], [0.4668], [0.8320], [0.6016], [0.7500], [0.4668], [0.3340], [0.6016], [0.6016], [0.3340], [0.6016], [0.2500], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021820068359375 loss: 0.0015869140625 loss: 0.001953125 loss: 0.001800537109375 54%|█████▍ | 265/492 [2:23:34<1:59:36, 31.62s/it] {'loss': 0.0091, 'learning_rate': 1e-05, 'epoch': 0.54} 54%|█████▍ | 265/492 [2:23:34<1:59:36, 31.62s/it]predicted value: tensor([[0.5234], [1.0938], [0.8008], [1.1328], [1.1016], [0.7695], [0.8516], [0.5234], [0.5859], [0.4297], [0.6094], [0.5195], [0.4297], [0.4199], [0.5156], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8320], [1.0000], [1.0000], [0.5000], [0.8008], [0.3750], [0.8008], [0.2500], [0.6016], [0.5000], [0.3340], [0.4004], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001922607421875 loss: 0.00157928466796875 loss: 0.00101470947265625 loss: 0.0035552978515625 predicted value: tensor([[0.3750], [0.8516], [0.1602], [1.0859], [0.4980], [0.7656], [0.4375], [0.3945], [0.3320], [0.5820], [0.6914], [1.1094], [0.4609], [0.5273], [0.2734], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.8320], [0.0400], [1.0000], [0.3750], [0.8008], [0.2500], [0.2500], [0.2500], [0.5000], [0.6016], [1.0000], [0.4004], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00390625 loss: 0.002349853515625loss: 0.0016021728515625 loss: 0.0022430419921875 predicted value: tensor([[0.8711], [0.5117], [1.0938], [0.5664], [0.7422], [1.0859], [1.0938], [0.8242], [0.8125], [0.3887], [0.6367], [0.4258], [0.0933], [0.2539], [0.2871], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.4668], [1.0000], [0.4668], [0.8008], [1.0000], [1.0000], [0.8008], [0.8008], [0.2500], [0.6016], [0.3340], [0.0400], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001708984375 loss: 0.00141143798828125 loss: 0.0023345947265625 loss: 0.001800537109375 predicted value: tensor([[0.8672], [0.8945], [0.8164], [0.8438], [0.5820], [0.6406], [0.2715], [0.3477], [0.3496], [1.0938], [0.5312], [0.5977], [0.4609], [0.4141], [0.4766], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8555], [0.7148], [0.8320], [0.3340], [0.6016], [0.2500], [0.2500], [0.3340], [1.0000], [0.4668], [0.6016], [0.4004], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00177764892578125 loss: 0.002227783203125 loss: 0.0019073486328125 loss: 0.0023040771484375 54%|█████▍ | 266/492 [2:24:06<1:58:38, 31.50s/it] {'loss': 0.0084, 'learning_rate': 1e-05, 'epoch': 0.54} 54%|█████▍ | 266/492 [2:24:06<1:58:38, 31.50s/it]predicted value: tensor([[1.0469], [0.4688], [0.6484], [0.5469], [0.2812], [0.7539], [0.6875], [1.0078], [0.2832], [0.5820], [0.5703], [0.3477], [0.3477], [0.3770], [0.1777], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.4668], [0.3340], [0.8008], [0.8008], [1.0000], [0.2002], [0.6016], [0.6016], [0.4004], [0.4004], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0004024505615234375 loss: 0.000457763671875 loss: 0.0012359619140625 loss: 0.0023956298828125 predicted value: tensor([[1.0000], [0.6914], [0.3906], [1.0078], [0.4023], [0.7773], [0.6250], [0.4473], [0.2539], [1.0078], [0.7227], [0.4590], [0.1758], [0.1660], [0.3613], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.3750], [1.0000], [0.4668], [0.8008], [0.6016], [0.4004], [0.2002], [1.0000], [0.5547], [0.3750], [0.0400], [0.1670], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00130462646484375 loss: 0.0025177001953125 loss: 0.00127410888671875 loss: 0.000850677490234375 predicted value: tensor([[0.8164], [0.2129], [0.6797], [0.7383], [0.4492], [0.7266], [0.3438], [0.6992], [0.6328], [1.0078], [0.4512], [0.4980], [0.2578], [0.3125], [0.1621], [0.1729]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.2500], [0.6680], [0.8008], [0.3750], [0.6680], [0.2002], [0.7500], [0.7500], [1.0000], [0.5000], [0.6016], [0.2500], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005584716796875 loss: 0.00115203857421875loss: 0.00113677978515625 loss: 0.00193023681640625 predicted value: tensor([[0.5352], [0.5820], [0.6211], [0.4375], [0.3203], [0.4785], [1.0234], [1.0156], [1.0391], [0.6367], [0.2480], [0.6211], [0.4258], [0.4102], [0.1797], [0.1621]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.6680], [0.3750], [0.2500], [0.6016], [1.0000], [1.0000], [1.0000], [0.4668], [0.2500], [0.6680], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00091552734375 loss: 0.00130462646484375 loss: 0.002655029296875 loss: 0.001190185546875 54%|█████▍ | 267/492 [2:24:38<1:58:35, 31.62s/it] {'loss': 0.0066, 'learning_rate': 1e-05, 'epoch': 0.54} 54%|█████▍ | 267/492 [2:24:38<1:58:35, 31.62s/it]predicted value: tensor([[0.4434], [0.7344], [0.4355], [0.7930], [0.6328], [0.4531], [0.9883], [0.6680], [1.0000], [0.3398], [0.6211], [0.3477], [0.3203], [0.2832], [0.2158], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8555], [0.4668], [0.8320], [0.7500], [0.3750], [1.0000], [0.6680], [1.0000], [0.3340], [0.6016], [0.4004], [0.4004], [0.0625], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00079345703125 loss: 0.00150299072265625 loss: 0.00042724609375 loss: 0.000812530517578125 predicted value: tensor([[0.5234], [0.7891], [0.9961], [0.2617], [0.4785], [0.4824], [0.4141], [0.6445], [0.9922], [0.7031], [0.3965], [0.3906], [0.3867], [0.4023], [0.3652], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [1.0000], [0.2500], [0.3750], [0.6016], [0.3750], [0.6680], [1.0000], [0.7500], [0.5000], [0.4004], [0.3340], [0.5000], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0014190673828125 loss: 0.0008544921875 loss: 0.00087738037109375 loss: 0.00179290771484375 predicted value: tensor([[0.4629], [0.7617], [0.4180], [0.5938], [0.6680], [1.0000], [0.4414], [0.5039], [0.9805], [0.4355], [0.4316], [0.5977], [0.4473], [0.3418], [0.1953], [0.2002]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.3750], [0.4668], [0.6680], [1.0000], [0.4668], [0.6016], [1.0000], [0.4668], [0.4668], [0.6016], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030670166015625 loss: 0.0006561279296875loss: 0.002166748046875 loss: 0.0015106201171875 predicted value: tensor([[0.7734], [0.5547], [0.4160], [0.5469], [0.2217], [0.9805], [0.6875], [0.5664], [0.9766], [0.9922], [0.3320], [0.4199], [0.3438], [0.4082], [0.1602], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.3750], [0.5547], [0.2002], [1.0000], [0.7500], [0.6016], [1.0000], [1.0000], [0.3340], [0.4004], [0.2500], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021514892578125 loss: 0.001556396484375 loss: 0.000335693359375 loss: 0.000614166259765625 54%|█████▍ | 268/492 [2:25:09<1:57:20, 31.43s/it] {'loss': 0.0051, 'learning_rate': 1e-05, 'epoch': 0.54} 54%|█████▍ | 268/492 [2:25:09<1:57:20, 31.43s/it]predicted value: tensor([[0.4883], [0.4707], [0.6406], [0.3398], [0.5312], [0.3691], [0.4414], [0.4805], [0.3926], [0.5938], [0.8438], [0.4434], [0.4590], [0.4805], [0.4355], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [0.3340], [0.3750], [0.2500], [0.7500], [0.4668], [0.3340], [0.6016], [0.8008], [0.4004], [0.2852], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00164794921875 loss: 0.00136566162109375 loss: 0.0029449462890625 loss: 0.00118255615234375 predicted value: tensor([[0.6406], [0.7188], [1.0547], [0.2793], [1.0703], [1.0859], [0.6250], [0.5586], [0.3008], [1.0859], [0.5391], [0.4648], [0.4258], [0.4844], [0.2520], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [1.0000], [0.1670], [1.0000], [1.0000], [0.6016], [0.7500], [0.2500], [1.0000], [0.4668], [0.4004], [0.4004], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004852294921875 loss: 0.00176239013671875 loss: 0.0029449462890625 loss: 0.00262451171875 predicted value: tensor([[1.0625], [1.0703], [0.5664], [0.7188], [0.7734], [0.4961], [0.8008], [0.3438], [0.7070], [0.5234], [0.4492], [0.6602], [0.4707], [0.2832], [0.2480], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.8008], [0.6680], [0.4668], [0.8008], [0.2500], [0.6016], [0.4668], [0.4004], [0.7500], [0.4004], [0.2500], [0.1426], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001678466796875 loss: 0.0014190673828125 loss: 0.0030670166015625 loss: 0.0012359619140625 predicted value: tensor([[0.5312], [0.6133], [0.3047], [0.3184], [0.3633], [0.7969], [1.0859], [0.4980], [0.2373], [1.1016], [0.4492], [0.3223], [0.5977], [0.5078], [0.2656], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.2500], [0.2500], [0.3340], [0.6680], [1.0000], [0.4668], [0.2500], [1.0000], [0.5000], [0.4004], [0.6016], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0022735595703125 loss: 0.0019683837890625 loss: 0.00115966796875 loss: 0.0012969970703125 55%|█████▍ | 269/492 [2:25:41<1:57:43, 31.68s/it] {'loss': 0.0084, 'learning_rate': 1e-05, 'epoch': 0.55} 55%|█████▍ | 269/492 [2:25:41<1:57:43, 31.68s/it]predicted value: tensor([[0.6133], [0.5820], [0.2871], [0.2441], [0.3027], [0.2891], [0.3340], [0.5352], [0.2773], [0.7188], [0.2500], [0.4395], [0.5117], [0.2812], [0.2891], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4004], [0.2500], [0.2500], [0.2500], [0.2500], [0.3340], [0.5000], [0.3340], [0.6016], [0.2002], [0.3340], [0.3340], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00201416015625 loss: 0.0035552978515625 loss: 0.0027923583984375 loss: 0.00396728515625 predicted value: tensor([[1.0391], [0.2891], [0.8672], [1.0547], [1.0391], [0.4531], [1.0312], [1.0469], [0.8047], [1.0625], [0.4551], [0.7227], [1.0000], [0.4043], [0.2754], [0.2598]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [0.8008], [1.0000], [1.0000], [0.4668], [1.0000], [1.0000], [0.7500], [1.0000], [0.4004], [0.7500], [1.0000], [0.3340], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004608154296875 loss: 0.0005950927734375loss: 0.0027008056640625 loss: 0.00439453125 predicted value: tensor([[0.4688], [0.4531], [0.4199], [0.9258], [0.3105], [0.5117], [0.4707], [0.3086], [0.4473], [0.3691], [0.4609], [0.3301], [0.2754], [0.4727], [0.2832], [0.2412]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.8320], [0.2500], [0.7500], [0.4668], [0.7500], [0.4004], [0.6016], [0.4668], [0.2002], [0.4004], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035552978515625 loss: 0.0057373046875loss: 0.00128173828125 loss: 0.00185394287109375 predicted value: tensor([[0.4922], [0.4961], [0.4980], [0.4766], [0.4844], [1.0625], [0.3066], [0.8242], [0.3008], [0.2695], [0.4863], [0.4844], [0.3281], [0.2852], [0.3047], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.8008], [0.5547], [1.0000], [0.2500], [0.7500], [0.2500], [0.2500], [0.3340], [0.5000], [0.5000], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023193359375 loss: 0.003021240234375 loss: 0.002655029296875 loss: 0.00191497802734375 55%|█████▍ | 270/492 [2:26:12<1:56:40, 31.53s/it] {'loss': 0.0117, 'learning_rate': 1e-05, 'epoch': 0.55} 55%|█████▍ | 270/492 [2:26:12<1:56:40, 31.53s/it]predicted value: tensor([[0.6094], [0.4062], [0.9219], [0.3672], [0.5391], [0.9297], [0.7383], [0.5312], [0.2100], [0.2559], [0.1924], [0.3555], [0.4199], [0.1641], [0.2383], [0.1592]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [1.0000], [0.4668], [0.8008], [1.0000], [0.8320], [0.5547], [0.6016], [0.3340], [0.2500], [0.4004], [0.6016], [0.1426], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003875732421875 loss: 0.00567626953125 loss: 0.0036773681640625 loss: 0.00136566162109375 predicted value: tensor([[0.2041], [0.3945], [0.9141], [0.5586], [0.4629], [0.1611], [0.9102], [0.4902], [0.4570], [0.2930], [0.2969], [0.2412], [0.3164], [0.1855], [0.0126], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.4668], [1.0000], [0.8320], [0.6680], [0.4004], [1.0000], [0.4277], [0.7500], [0.6016], [0.4004], [0.2500], [0.4004], [0.2002], [0.0278], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00341796875 loss: 0.00653076171875loss: 0.001861572265625 loss: 0.00146484375 predicted value: tensor([[0.8984], [0.6328], [0.9141], [0.9492], [0.9336], [0.3926], [0.2012], [0.5703], [0.1680], [0.5117], [0.5156], [0.3730], [0.4141], [0.4023], [0.2158], [0.2002]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5703], [1.0000], [1.0000], [1.0000], [0.3750], [0.2500], [0.5000], [0.1670], [0.8008], [0.5000], [0.4004], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00543212890625 loss: 0.0020751953125loss: 0.0017547607421875 loss: 0.0084228515625 predicted value: tensor([[0.3750], [0.4121], [0.3828], [0.3613], [0.2695], [0.6250], [0.3301], [0.8984], [0.6094], [0.2324], [0.3320], [0.6133], [0.4805], [0.4004], [0.2061], [0.1855]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.4668], [0.7500], [0.6680], [0.6016], [1.0000], [0.5000], [0.3340], [0.3145], [0.7500], [0.4004], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00616455078125 loss: 0.004791259765625 loss: 0.0048828125 loss: 0.006439208984375 55%|█████▌ | 271/492 [2:26:43<1:55:52, 31.46s/it] {'loss': 0.017, 'learning_rate': 1e-05, 'epoch': 0.55} 55%|█████▌ | 271/492 [2:26:43<1:55:52, 31.46s/it]predicted value: tensor([[0.5039], [0.2910], [0.4180], [0.4863], [0.6562], [0.8711], [0.5430], [0.5859], [0.7422], [0.2891], [0.2451], [0.3574], [0.4023], [0.1855], [0.3789], [0.2168]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.4668], [0.4668], [0.7500], [1.0000], [0.7500], [0.5000], [0.7500], [0.2500], [0.2500], [0.3340], [0.4004], [0.1670], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004486083984375 loss: 0.00185394287109375 loss: 0.0013580322265625 loss: 0.0023345947265625 predicted value: tensor([[0.8633], [0.5117], [0.4004], [0.4688], [0.8828], [0.2559], [0.7539], [0.6328], [0.8750], [0.3105], [0.1484], [0.3652], [0.1738], [0.1914], [0.1836], [0.2061]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.4668], [0.4668], [1.0000], [0.3340], [0.8008], [0.6016], [1.0000], [0.7500], [0.6680], [0.5000], [0.1250], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00396728515625 loss: 0.004119873046875 loss: 0.00860595703125 loss: 0.00128936767578125 predicted value: tensor([[0.8281], [0.4453], [0.8828], [0.5898], [0.8711], [0.4336], [0.8984], [0.2148], [0.3984], [0.4375], [0.2656], [0.4531], [0.5898], [0.6758], [0.1943], [0.1982]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [1.0000], [0.6016], [1.0000], [0.4668], [1.0000], [0.2500], [0.3750], [0.4668], [0.2500], [0.5000], [0.6016], [0.7500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00128173828125 loss: 0.00083160400390625loss: 0.00145721435546875 loss: 0.001373291015625 predicted value: tensor([[0.4004], [0.4395], [0.9102], [0.4277], [0.8867], [0.7500], [0.9023], [0.4004], [0.8125], [0.7070], [0.2637], [0.3711], [0.2344], [0.1514], [0.2090], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.4668], [1.0000], [0.8008], [1.0000], [0.4668], [0.8008], [0.7500], [0.2002], [0.3340], [0.4004], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00099945068359375 loss: 0.00146484375 loss: 0.0012664794921875 loss: 0.0037689208984375 55%|█████▌ | 272/492 [2:27:14<1:54:36, 31.26s/it] {'loss': 0.0101, 'learning_rate': 1e-05, 'epoch': 0.55} 55%|█████▌ | 272/492 [2:27:14<1:54:36, 31.26s/it]predicted value: tensor([[0.6289], [0.6133], [0.5586], [0.5469], [0.9219], [0.8203], [0.5430], [0.7344], [0.7930], [0.4336], [0.7266], [0.4258], [0.5469], [0.4824], [0.2539], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [0.4668], [0.8008], [0.8008], [0.4668], [0.8320], [0.7500], [0.3340], [0.8008], [0.3340], [0.6016], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001373291015625 loss: 0.00135040283203125loss: 0.0017852783203125 loss: 0.00168609619140625 predicted value: tensor([[0.8906], [0.5117], [0.4492], [0.8320], [0.8242], [0.5391], [0.4668], [0.2832], [0.6953], [0.5039], [0.4355], [0.7383], [0.5547], [0.2695], [0.2910], [0.2373]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.3145], [0.8008], [0.6680], [0.4668], [0.3340], [0.1670], [0.6016], [0.3750], [0.3340], [0.5000], [0.6016], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00238037109375 loss: 0.0023193359375 loss: 0.0029754638671875 loss: 0.0019073486328125 predicted value: tensor([[0.9844], [0.9414], [0.5273], [0.6367], [0.9570], [0.5703], [0.8594], [0.9414], [0.6562], [0.9297], [0.6133], [0.4941], [0.5703], [0.4473], [0.2891], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.3750], [0.8320], [1.0000], [0.4668], [0.8008], [1.0000], [0.6016], [1.0000], [0.6016], [0.4004], [0.7500], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0022735595703125 loss: 0.0021514892578125loss: 0.003173828125 loss: 0.00408935546875 predicted value: tensor([[0.8320], [0.9609], [0.5312], [0.5195], [0.5430], [0.9180], [0.3848], [0.7539], [0.3086], [0.6172], [0.2197], [0.5078], [0.4531], [0.3984], [0.2773], [0.3027]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.4668], [0.4668], [0.4668], [1.0000], [0.2500], [0.6680], [0.2500], [0.8008], [0.0400], [0.4668], [0.5000], [0.3340], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00201416015625loss: 0.0023345947265625 loss: 0.0014495849609375 loss: 0.00274658203125 55%|█████▌ | 273/492 [2:27:45<1:53:53, 31.20s/it] {'loss': 0.009, 'learning_rate': 1e-05, 'epoch': 0.55} 55%|█████▌ | 273/492 [2:27:45<1:53:53, 31.20s/it]predicted value: tensor([[0.9844], [0.8906], [0.9570], [0.5273], [0.9531], [0.2969], [0.6289], [0.3516], [0.9336], [0.7227], [0.5117], [0.4199], [0.4590], [0.5117], [0.2910], [0.2354]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [1.0000], [0.4668], [1.0000], [0.2002], [0.5000], [0.2002], [1.0000], [0.7500], [0.4004], [0.5000], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00250244140625 loss: 0.00151824951171875loss: 0.002471923828125 loss: 0.00225830078125 predicted value: tensor([[0.9648], [0.5352], [0.5938], [0.9219], [0.9102], [0.7188], [0.5547], [0.6836], [0.9453], [0.5352], [0.4746], [0.4941], [0.6328], [0.3125], [0.2754], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.4668], [1.0000], [1.0000], [0.6016], [0.4668], [0.5000], [1.0000], [0.5000], [0.6016], [0.3340], [0.6016], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032806396484375 loss: 0.002197265625 loss: 0.002655029296875 loss: 0.0030059814453125 predicted value: tensor([[0.5039], [0.6094], [0.4746], [0.5938], [0.7734], [0.5586], [0.9180], [0.7617], [0.4473], [0.9023], [0.6523], [0.4453], [0.4551], [0.3555], [0.2354], [0.2119]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5547], [0.3750], [0.6680], [0.6680], [0.4668], [1.0000], [0.8008], [0.2002], [1.0000], [0.6016], [0.3340], [0.3340], [0.3340], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021209716796875 loss: 0.002655029296875 loss: 0.0012664794921875 loss: 0.0027313232421875 predicted value: tensor([[0.5586], [0.6680], [0.4062], [0.5898], [0.6562], [0.7617], [0.6641], [0.9141], [0.4316], [0.7070], [0.5117], [0.4668], [0.7031], [0.5469], [0.2520], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6172], [0.2500], [0.4668], [0.5547], [0.6016], [0.5547], [1.0000], [0.2500], [0.6016], [0.7500], [0.5000], [0.7500], [0.6680], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015869140625 loss: 0.001495361328125 loss: 0.003631591796875loss: 0.00128936767578125 56%|█████▌ | 274/492 [2:28:16<1:53:16, 31.18s/it] {'loss': 0.0092, 'learning_rate': 1e-05, 'epoch': 0.56} 56%|█████▌ | 274/492 [2:28:16<1:53:16, 31.18s/it]predicted value: tensor([[0.5625], [0.4590], [0.2617], [0.2617], [0.5391], [0.2617], [0.5977], [0.7227], [0.5000], [0.5898], [0.5312], [0.4219], [0.3164], [0.3730], [0.1602], [0.1582]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.2500], [0.2500], [0.5547], [0.2500], [0.6016], [0.8008], [0.6016], [0.7500], [0.5000], [0.4004], [0.5000], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028533935546875 loss: 0.00125885009765625loss: 0.0011749267578125 loss: 0.000942230224609375 predicted value: tensor([[0.5547], [0.6836], [0.5156], [0.9141], [0.5703], [0.2500], [0.7422], [0.5977], [0.4629], [0.9180], [0.4277], [0.5078], [0.3281], [0.0771], [0.3477], [0.1270]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6680], [0.3750], [1.0000], [0.4668], [0.2002], [0.8008], [0.6016], [0.4668], [1.0000], [0.4004], [0.6016], [0.4004], [0.0625], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00286865234375 loss: 0.00115203857421875loss: 0.0021209716796875 loss: 0.0030670166015625 predicted value: tensor([[0.3848], [0.2295], [0.3770], [0.3594], [0.4141], [0.4336], [0.8711], [0.8984], [0.4258], [0.4531], [0.3164], [0.6953], [0.4004], [0.1621], [0.1602], [0.1572]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.4668], [0.4668], [0.3750], [0.4668], [1.0000], [1.0000], [0.4004], [0.4668], [0.4004], [0.6016], [0.5000], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00125885009765625 loss: 0.00147247314453125loss: 0.0020751953125 loss: 0.00162506103515625 predicted value: tensor([[0.8906], [0.7969], [0.4492], [0.2812], [0.4609], [0.2432], [0.2637], [0.5625], [0.7148], [0.3965], [0.6406], [0.8555], [0.1543], [0.2051], [0.1768], [0.1562]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.3750], [0.2500], [0.4668], [0.2500], [0.2500], [0.6016], [0.8008], [0.2500], [0.6680], [1.0000], [0.2002], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032958984375 loss: 0.00116729736328125loss: 0.0020751953125 loss: 0.0018768310546875 56%|█████▌ | 275/492 [2:28:47<1:52:39, 31.15s/it] {'loss': 0.0076, 'learning_rate': 1e-05, 'epoch': 0.56} 56%|█████▌ | 275/492 [2:28:47<1:52:39, 31.15s/it]predicted value: tensor([[0.3945], [0.6133], [0.3105], [0.4219], [0.4355], [0.9414], [0.6328], [0.9453], [0.7227], [0.4766], [0.4512], [0.3945], [0.6562], [0.1562], [0.3730], [0.3496]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.2500], [0.4668], [0.4668], [1.0000], [0.4668], [1.0000], [0.6680], [0.3340], [0.3750], [0.3340], [0.7500], [0.2002], [0.4004], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00225830078125 loss: 0.00146484375 loss: 0.00299072265625 loss: 0.0011138916015625 predicted value: tensor([[0.9688], [0.3984], [0.2734], [0.9531], [0.7852], [0.7266], [0.2520], [0.5469], [0.5430], [0.6953], [0.3984], [0.4023], [0.4023], [0.2266], [0.1504], [0.1729]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.2500], [1.0000], [0.8320], [0.8008], [0.2500], [0.3750], [0.6016], [0.8008], [0.4004], [0.7500], [0.4004], [0.0400], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024871826171875 loss: 0.00347900390625loss: 0.00177764892578125 loss: 0.001129150390625 predicted value: tensor([[0.5859], [0.5664], [0.9727], [0.5312], [0.9414], [0.3223], [0.5977], [0.2539], [0.9297], [0.5078], [0.4824], [0.9297], [0.3477], [0.3398], [0.1328], [0.2354]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.5547], [1.0000], [0.5547], [1.0000], [0.2500], [0.7500], [0.2002], [1.0000], [0.5000], [0.5000], [1.0000], [0.4004], [0.3340], [0.2500], [0.0625]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0036468505859375 loss: 0.00154876708984375 loss: 0.001678466796875 loss: 0.00136566162109375 predicted value: tensor([[0.4629], [0.6953], [0.5312], [0.9805], [0.8242], [0.9141], [0.5977], [0.5508], [0.2480], [0.3516], [0.6406], [0.2891], [0.3145], [0.3613], [0.1533], [0.1562]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.5547], [1.0000], [0.8320], [1.0000], [0.5000], [0.6016], [0.2500], [0.3340], [0.7500], [0.3340], [0.2852], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00101470947265625 loss: 0.00148773193359375 loss: 0.0008697509765625 loss: 0.0012969970703125 56%|█████▌ | 276/492 [2:29:19<1:52:31, 31.26s/it] {'loss': 0.0074, 'learning_rate': 1e-05, 'epoch': 0.56} 56%|█████▌ | 276/492 [2:29:19<1:52:31, 31.26s/it]predicted value: tensor([[0.4004], [1.1172], [0.5547], [1.1172], [0.8789], [0.8242], [0.7656], [0.6211], [0.6367], [0.3398], [0.4160], [1.0078], [1.0234], [0.4336], [0.1992], [0.2275]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [1.0000], [0.8320], [0.4668], [0.4668], [0.6680], [0.7500], [0.2500], [0.2500], [1.0000], [1.0000], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029296875 loss: 0.0050048828125 loss: 0.0022125244140625 loss: 0.00469970703125 predicted value: tensor([[0.9336], [1.0625], [0.7773], [1.1016], [1.0703], [0.4746], [0.8750], [0.3926], [0.5039], [0.6328], [0.4473], [0.4785], [0.4141], [0.4883], [0.3965], [0.2490]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.8008], [1.0000], [1.0000], [0.2500], [0.8320], [0.2500], [0.6016], [0.5000], [0.3340], [0.5000], [0.3340], [0.2500], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0048828125 loss: 0.003326416015625loss: 0.00179290771484375 loss: 0.0023345947265625 predicted value: tensor([[0.5352], [1.1328], [1.1094], [1.1094], [0.3125], [0.5234], [1.1016], [0.5273], [0.5664], [0.6836], [0.6680], [0.4961], [0.3594], [0.4180], [0.4316], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [1.0000], [0.3340], [0.4668], [1.0000], [0.8008], [0.4668], [0.4668], [0.6016], [0.3340], [0.5000], [0.5000], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00830078125 loss: 0.00408935546875loss: 0.0032806396484375 loss: 0.0028839111328125 predicted value: tensor([[0.4434], [0.4629], [0.5469], [0.5547], [0.2910], [0.6719], [1.0859], [0.7930], [0.6328], [0.4375], [0.5859], [0.4727], [0.6719], [0.3711], [0.2695], [0.2246]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.4668], [0.4668], [0.3340], [0.6016], [1.0000], [0.8008], [0.6016], [0.2500], [0.5000], [0.4004], [0.6016], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029296875 loss: 0.00144195556640625 loss: 0.00141143798828125 loss: 0.0028533935546875 56%|█████▋ | 277/492 [2:29:50<1:52:10, 31.30s/it] {'loss': 0.0136, 'learning_rate': 1e-05, 'epoch': 0.56} 56%|█████▋ | 277/492 [2:29:50<1:52:10, 31.30s/it]predicted value: tensor([[0.4512], [0.6953], [0.3594], [0.9062], [0.2891], [0.2559], [0.5508], [0.5352], [0.4551], [0.6836], [0.4570], [0.4746], [0.3379], [0.4863], [0.2373], [0.2383]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.8320], [0.3340], [0.2500], [0.4668], [0.3750], [0.4004], [0.6016], [0.4004], [0.4004], [0.4004], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003387451171875 loss: 0.00153350830078125 loss: 0.001983642578125 loss: 0.0037994384765625 predicted value: tensor([[0.8008], [0.8125], [0.2559], [1.0078], [1.1172], [1.1250], [0.2793], [0.3438], [0.4980], [1.1719], [0.5898], [0.0972], [0.4531], [0.1875], [0.2100], [0.2402]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8320], [0.2500], [0.8320], [1.0000], [1.0000], [0.2500], [0.2002], [0.5000], [1.0000], [0.6016], [0.0278], [0.4004], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016632080078125 loss: 0.00213623046875 loss: 0.00189208984375 loss: 0.004241943359375 predicted value: tensor([[0.5586], [0.5195], [0.5859], [0.2383], [0.8555], [0.8281], [0.5977], [0.4902], [0.4238], [0.5273], [0.6680], [0.4277], [0.5117], [0.4492], [0.2178], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [0.2500], [0.8008], [0.8008], [0.7500], [0.4668], [0.2500], [0.3750], [0.6016], [0.4004], [0.3340], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001861572265625 loss: 0.00213623046875loss: 0.004180908203125 loss: 0.002471923828125 predicted value: tensor([[1.0781], [0.6992], [0.4824], [0.7383], [1.1562], [0.4434], [0.6094], [0.2793], [0.4922], [0.4453], [0.7969], [0.4453], [0.4531], [0.5117], [0.2334], [0.2246]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.4668], [0.5547], [1.0000], [0.4668], [0.3750], [0.2002], [0.5000], [0.4004], [0.8008], [0.4004], [0.4004], [0.7500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003265380859375 loss: 0.002288818359375 loss: 0.004638671875loss: 0.003265380859375 57%|█████▋ | 278/492 [2:30:22<1:52:31, 31.55s/it] {'loss': 0.0112, 'learning_rate': 1e-05, 'epoch': 0.57} 57%|█████▋ | 278/492 [2:30:22<1:52:31, 31.55s/it]predicted value: tensor([[1.0781], [0.4277], [0.3301], [0.7656], [0.5977], [0.5234], [0.6445], [0.0703], [0.6445], [0.3184], [0.2852], [0.3828], [0.1484], [0.3105], [0.1064], [0.1504]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5000], [0.4668], [0.8008], [0.6016], [0.5547], [0.7500], [0.0278], [0.7500], [0.4004], [0.2002], [0.5000], [0.2002], [0.3340], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003082275390625 loss: 0.00164794921875 loss: 0.0015716552734375 loss: 0.0032806396484375 predicted value: tensor([[0.2754], [0.3789], [0.4434], [0.3945], [0.4883], [0.5234], [0.3105], [0.6328], [0.4961], [1.0312], [0.5469], [0.5938], [0.2871], [0.1328], [0.1260], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.6016], [0.4648], [0.7500], [0.6016], [0.3750], [0.8008], [0.6016], [1.0000], [0.7500], [0.7500], [0.3340], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003082275390625 loss: 0.0031280517578125 loss: 0.004364013671875 loss: 0.0037994384765625 predicted value: tensor([[0.4629], [0.3438], [0.4375], [1.0391], [0.6094], [0.4648], [0.2383], [0.6172], [0.2246], [1.0547], [0.3477], [0.3789], [0.4121], [0.1396], [0.1387], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.5547], [1.0000], [0.7500], [0.3750], [0.3340], [0.5547], [0.2500], [1.0000], [0.3340], [0.4004], [0.4004], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002777099609375 loss: 0.0015411376953125 loss: 0.002716064453125 loss: 0.0028228759765625 predicted value: tensor([[0.4180], [0.3105], [0.3223], [0.6875], [0.4043], [1.0312], [0.4316], [0.6719], [0.7305], [0.6562], [0.3613], [0.4512], [0.6211], [0.1846], [0.1572], [0.1768]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.8008], [0.4668], [1.0000], [0.3340], [0.8008], [0.7148], [0.8008], [0.4004], [0.5000], [0.7500], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001434326171875 loss: 0.0021209716796875 loss: 0.002716064453125 loss: 0.00144195556640625 57%|█████▋ | 279/492 [2:30:53<1:51:22, 31.37s/it] {'loss': 0.0104, 'learning_rate': 1e-05, 'epoch': 0.57} 57%|█████▋ | 279/492 [2:30:53<1:51:22, 31.37s/it]predicted value: tensor([[0.4531], [1.0547], [0.7266], [0.3496], [0.3457], [0.5938], [0.2754], [1.0469], [0.5312], [0.3633], [0.2207], [0.1865], [0.2100], [0.2100], [0.1826], [0.1689]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [1.0000], [0.8008], [0.6680], [0.2002], [0.7500], [0.2500], [1.0000], [0.4277], [0.4668], [0.2002], [0.0400], [0.2500], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00152587890625 loss: 0.002532958984375 loss: 0.00372314453125 loss: 0.00122833251953125 predicted value: tensor([[0.2969], [1.0312], [1.0703], [0.2773], [1.0312], [0.4297], [0.7031], [1.0469], [0.2695], [0.7539], [0.4004], [0.5195], [0.3691], [0.1543], [0.1602], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [0.4668], [1.0000], [0.4668], [0.8008], [1.0000], [0.2002], [0.8320], [0.5000], [0.6016], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019073486328125 loss: 0.0018463134765625loss: 0.0032501220703125 loss: 0.00457763671875 predicted value: tensor([[1.0391], [0.5234], [0.7617], [0.4453], [0.9492], [0.5391], [0.3145], [0.4668], [0.4668], [1.0078], [0.1641], [0.1992], [0.5273], [0.3242], [0.1523], [0.9648]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.8320], [0.4668], [1.0000], [0.6016], [0.4668], [0.5000], [0.5000], [1.0000], [0.0278], [0.2002], [0.7500], [0.3340], [0.2002], [1.0000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00138092041015625 loss: 0.0029296875loss: 0.0035400390625 loss: 0.002197265625 predicted value: tensor([[1.0078], [0.1416], [1.0078], [0.9961], [1.0234], [0.5039], [0.6172], [0.7227], [0.2930], [0.5273], [0.3711], [0.4082], [0.3477], [0.3965], [0.1289], [0.2041]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [1.0000], [1.0000], [1.0000], [0.5000], [0.8008], [0.8008], [0.2500], [0.6016], [0.2500], [0.6016], [0.5000], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000957489013671875 loss: 0.002227783203125 loss: 0.002105712890625 loss: 0.0022430419921875 57%|█████▋ | 280/492 [2:31:25<1:51:02, 31.43s/it] {'loss': 0.0095, 'learning_rate': 1e-05, 'epoch': 0.57} 57%|█████▋ | 280/492 [2:31:25<1:51:02, 31.43s/it]predicted value: tensor([[0.3164], [0.4062], [0.2520], [0.2119], [0.4492], [0.4434], [0.3359], [0.4082], [0.3887], [0.4062], [0.2314], [0.4238], [0.0417], [0.2773], [0.2832], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.3340], [0.2002], [0.4004], [0.4668], [0.3340], [0.2500], [0.2500], [0.2500], [0.0625], [0.4004], [0.0400], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001739501953125 loss: 0.00193023681640625 loss: 0.00170135498046875 loss: 0.002197265625 predicted value: tensor([[0.5273], [0.3867], [0.5664], [0.5273], [0.4434], [0.8438], [1.0703], [0.2910], [0.4492], [0.5117], [0.4004], [0.5039], [0.4043], [0.4258], [0.2832], [0.2373]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.5547], [0.4668], [0.4668], [0.8320], [1.0000], [0.2500], [0.4668], [0.3340], [0.4004], [0.5000], [0.4004], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00159454345703125 loss: 0.000888824462890625 loss: 0.000888824462890625 loss: 0.00225830078125 predicted value: tensor([[0.2285], [0.2988], [0.8320], [0.4316], [0.9023], [0.3281], [1.0703], [0.2559], [0.7148], [0.2471], [0.6016], [0.3418], [0.3906], [0.2715], [0.3223], [0.2314]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.1670], [0.2500], [0.8320], [0.4668], [0.8320], [0.3340], [1.0000], [0.3340], [0.7500], [0.2002], [0.6016], [0.2500], [0.3340], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000850677490234375 loss: 0.000858306884765625loss: 0.002685546875 loss: 0.00157928466796875 predicted value: tensor([[0.3242], [0.8711], [0.4863], [0.1650], [0.8516], [1.1172], [1.0781], [0.3008], [0.3242], [0.5664], [0.5703], [1.0312], [0.4668], [0.0540], [0.2598], [0.2451]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.7500], [0.0156], [0.7148], [1.0000], [1.0000], [0.2500], [0.2500], [0.6016], [0.6016], [1.0000], [0.4004], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00182342529296875 loss: 0.002716064453125loss: 0.000904083251953125 loss: 0.001983642578125 57%|█████▋ | 281/492 [2:31:57<1:50:48, 31.51s/it] {'loss': 0.0066, 'learning_rate': 1e-05, 'epoch': 0.57} 57%|█████▋ | 281/492 [2:31:57<1:50:48, 31.51s/it]predicted value: tensor([[0.5820], [0.5664], [0.8164], [0.4551], [0.8125], [1.0391], [0.3398], [0.7461], [0.2734], [0.4062], [0.3516], [0.3691], [0.4785], [0.4609], [0.2432], [0.2207]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.5547], [0.8008], [0.4668], [0.8008], [1.0000], [0.2002], [0.7500], [0.2500], [0.4004], [0.0278], [0.3340], [0.3340], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002838134765625 loss: 0.0024871826171875 loss: 0.000759124755859375 loss: 0.0020751953125 predicted value: tensor([[0.2432], [0.1943], [1.0156], [1.0312], [0.6680], [1.0469], [0.5000], [0.6680], [0.5781], [1.0469], [0.4863], [0.6172], [0.4199], [0.2793], [0.2715], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.2500], [1.0000], [1.0000], [0.7500], [1.0000], [0.4668], [0.6680], [0.6016], [1.0000], [0.3750], [0.6016], [0.3340], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115203857421875 loss: 0.0008087158203125 loss: 0.00142669677734375 loss: 0.0020294189453125 predicted value: tensor([[0.5195], [0.3926], [1.0391], [0.5977], [0.8125], [0.4902], [0.5820], [0.4316], [0.4238], [0.4082], [0.6328], [0.5547], [0.7383], [0.2578], [0.2305], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.4668], [1.0000], [0.4668], [0.8008], [0.6016], [0.2676], [0.4668], [0.4668], [0.2500], [0.7500], [0.7500], [0.7500], [0.1670], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00110626220703125 loss: 0.003631591796875loss: 0.00164794921875 loss: 0.0034942626953125 predicted value: tensor([[0.7891], [0.4121], [0.6953], [0.7148], [0.5430], [0.3965], [0.8086], [0.2344], [0.4688], [0.9922], [0.5039], [0.4922], [0.2559], [0.3652], [0.2500], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.6680], [0.4668], [0.3750], [0.8008], [0.2002], [0.5000], [1.0000], [0.5000], [0.4004], [0.2002], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0038299560546875 loss: 0.00168609619140625 loss: 0.00133514404296875 loss: 0.001434326171875 57%|█████▋ | 282/492 [2:32:28<1:50:13, 31.49s/it] {'loss': 0.0079, 'learning_rate': 1e-05, 'epoch': 0.57} 57%|█████▋ | 282/492 [2:32:28<1:50:13, 31.49s/it]predicted value: tensor([[0.2852], [0.7070], [0.3711], [0.9258], [0.6484], [0.9414], [0.9648], [0.4102], [0.2598], [0.9375], [0.3672], [0.5469], [0.3633], [0.0459], [0.2090], [0.2051]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.3750], [1.0000], [0.6680], [1.0000], [1.0000], [0.4668], [0.3340], [1.0000], [0.4004], [0.6016], [0.4004], [0.0278], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002777099609375 loss: 0.000743865966796875loss: 0.00173187255859375 loss: 0.004791259765625 predicted value: tensor([[0.3242], [0.9102], [0.4102], [0.3066], [0.2100], [0.3965], [0.8906], [0.1973], [0.1855], [0.7383], [0.2002], [0.5664], [0.3574], [0.3867], [0.1465], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.3750], [0.2500], [0.3750], [1.0000], [0.2500], [0.2002], [0.8008], [0.3340], [0.7500], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00225830078125 loss: 0.002044677734375loss: 0.00142669677734375 loss: 0.001251220703125 predicted value: tensor([[0.9375], [0.8984], [0.9219], [0.7188], [0.3477], [0.9180], [0.5977], [0.6094], [0.1973], [0.4844], [0.5898], [0.5117], [0.3535], [0.2031], [0.1973], [0.1992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [1.0000], [0.8320], [0.4668], [1.0000], [0.5000], [0.6016], [0.3340], [0.6016], [0.7500], [0.6016], [0.5000], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00162506103515625 loss: 0.00396728515625 loss: 0.002410888671875 loss: 0.0024871826171875 predicted value: tensor([[0.5273], [0.1572], [0.4434], [0.4609], [0.6445], [0.9297], [0.9180], [0.3223], [0.8945], [0.7500], [0.3379], [0.2930], [0.5586], [0.4316], [0.1611], [0.1992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2002], [0.4668], [0.4668], [0.7500], [1.0000], [1.0000], [0.3340], [1.0000], [0.8008], [0.3340], [0.2500], [0.7500], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027923583984375 loss: 0.0025482177734375 loss: 0.00144195556640625 loss: 0.004302978515625 58%|█████▊ | 283/492 [2:33:00<1:49:44, 31.50s/it] {'loss': 0.0097, 'learning_rate': 1e-05, 'epoch': 0.58} 58%|█████▊ | 283/492 [2:33:00<1:49:44, 31.50s/it]predicted value: tensor([[0.4922], [0.5820], [0.6797], [0.7031], [0.8906], [0.4707], [0.6641], [0.5703], [0.5859], [0.6250], [0.4023], [0.3145], [0.2148], [0.1338], [0.3945], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8008], [0.8008], [1.0000], [0.4668], [0.6680], [0.7500], [0.6016], [0.6016], [0.4004], [0.4004], [0.2002], [0.2500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018157958984375 loss: 0.001495361328125 loss: 0.0026702880859375 loss: 0.0022125244140625 predicted value: tensor([[0.5000], [0.9141], [0.5781], [0.4277], [0.7148], [0.9180], [0.8672], [0.5586], [0.4375], [0.4062], [0.2256], [0.3809], [0.3164], [0.2090], [0.1689], [0.1973]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.4668], [0.8008], [1.0000], [1.0000], [0.8008], [0.4668], [0.4004], [0.2500], [0.4004], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018157958984375 loss: 0.0023956298828125 loss: 0.00125885009765625 loss: 0.00110626220703125 predicted value: tensor([[0.8203], [0.7969], [0.8125], [0.7812], [0.3848], [0.2637], [0.4746], [0.4062], [0.5078], [0.6172], [0.2295], [0.8984], [0.6172], [0.3809], [0.2041], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8320], [0.8008], [0.8320], [0.4668], [0.2002], [0.6016], [0.4668], [0.6016], [0.6016], [0.2002], [1.0000], [0.7500], [0.2852], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021209716796875 loss: 0.00128936767578125 loss: 0.001312255859375 loss: 0.0025634765625 predicted value: tensor([[0.7344], [0.6172], [0.5781], [0.3691], [0.5469], [0.9414], [0.4004], [0.9023], [0.1387], [0.3730], [0.5391], [0.5039], [0.2969], [0.2988], [0.1904], [0.1582]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.5547], [0.4668], [0.3750], [0.6016], [1.0000], [0.3145], [1.0000], [0.2500], [0.4004], [0.5000], [0.6016], [0.2500], [0.0625], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001983642578125 loss: 0.001556396484375 loss: 0.002044677734375 loss: 0.001129150390625 58%|█████▊ | 284/492 [2:33:31<1:48:51, 31.40s/it] {'loss': 0.0072, 'learning_rate': 1e-05, 'epoch': 0.58} 58%|█████▊ | 284/492 [2:33:31<1:48:51, 31.40s/it]predicted value: tensor([[0.7930], [0.4961], [0.7031], [0.8789], [0.7930], [0.9766], [0.9648], [0.7852], [0.6445], [0.6719], [0.6758], [0.5703], [0.5625], [0.4355], [0.2773], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3750], [0.4668], [0.8008], [0.8008], [1.0000], [1.0000], [0.8008], [0.6016], [0.6016], [0.5000], [0.5000], [0.5000], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000896453857421875 loss: 0.002288818359375 loss: 0.003814697265625 loss: 0.0017547607421875 predicted value: tensor([[1.0312], [0.5977], [0.8555], [0.8633], [0.9844], [0.8281], [0.5977], [0.3223], [0.6992], [0.3145], [0.8320], [0.7266], [0.4297], [0.2812], [0.2715], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.8008], [0.8320], [1.0000], [0.8008], [0.7500], [0.3340], [0.6016], [0.3340], [0.8008], [0.5703], [0.5000], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00180816650390625 loss: 0.0012969970703125 loss: 0.00153350830078125 loss: 0.000946044921875 predicted value: tensor([[0.6016], [0.4902], [0.9023], [0.3145], [0.3652], [0.7617], [0.1924], [0.5625], [0.3535], [0.8672], [0.9766], [0.6914], [0.2393], [0.4141], [0.1768], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8008], [0.2500], [0.2002], [0.6680], [0.2002], [0.6680], [0.2500], [0.8008], [1.0000], [0.7500], [0.2500], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021514892578125 loss: 0.0014495849609375 loss: 0.00133514404296875 loss: 0.002838134765625 predicted value: tensor([[0.9688], [0.5977], [1.0000], [0.5039], [0.6445], [0.4941], [0.3906], [0.3848], [0.5898], [0.5742], [0.8438], [0.4219], [0.4062], [0.3984], [0.2471], [0.2305]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [1.0000], [0.4668], [0.5547], [0.4668], [0.3340], [0.3340], [0.6016], [0.6016], [0.8008], [0.2500], [0.3340], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001983642578125 loss: 0.00115203857421875 loss: 0.00104522705078125 loss: 0.0012664794921875 58%|█████▊ | 285/492 [2:34:02<1:48:21, 31.41s/it] {'loss': 0.0069, 'learning_rate': 1e-05, 'epoch': 0.58} 58%|█████▊ | 285/492 [2:34:02<1:48:21, 31.41s/it]predicted value: tensor([[0.5898], [0.5195], [0.5742], [1.0156], [0.4590], [0.4609], [0.9844], [0.5977], [0.5078], [0.4590], [0.1924], [0.5586], [0.5469], [0.2637], [0.2148], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.4668], [0.3750], [1.0000], [0.2500], [0.3145], [1.0000], [0.5000], [0.3340], [0.4004], [0.0278], [0.4668], [0.5000], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00122833251953125 loss: 0.00311279296875 loss: 0.0035858154296875 loss: 0.0032958984375 predicted value: tensor([[0.7617], [0.2656], [0.5156], [0.4570], [1.0312], [0.8438], [0.7578], [0.6445], [0.7031], [0.8086], [0.6211], [0.4082], [0.2637], [0.4629], [0.2324], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [0.3750], [0.4668], [1.0000], [0.8008], [0.8008], [0.5000], [0.6016], [0.7148], [0.3750], [0.4004], [0.2002], [0.4277], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001129150390625 loss: 0.0021820068359375 loss: 0.0012969970703125 loss: 0.00171661376953125 predicted value: tensor([[0.8594], [0.5547], [0.5859], [0.8555], [1.0234], [0.2852], [0.5664], [0.3555], [0.6797], [0.6602], [0.5508], [0.4492], [0.4199], [0.4863], [0.2393], [0.2432]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.5547], [0.8008], [1.0000], [0.3340], [0.3750], [0.2500], [0.4668], [0.7500], [0.3750], [0.3340], [0.3340], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00191497802734375 loss: 0.0026397705078125 loss: 0.000705718994140625 loss: 0.00128173828125 predicted value: tensor([[0.5859], [0.5352], [1.0312], [1.0156], [0.4258], [0.4863], [0.2676], [0.8672], [0.6797], [0.3359], [0.6016], [0.4414], [0.4512], [0.3750], [0.2373], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [1.0000], [1.0000], [0.2500], [0.4668], [0.2500], [0.8008], [0.6016], [0.2500], [0.6016], [0.4004], [0.4004], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018157958984375 loss: 0.00174713134765625 loss: 0.0028228759765625 loss: 0.0034637451171875 58%|█████▊ | 286/492 [2:34:34<1:48:09, 31.50s/it] {'loss': 0.0085, 'learning_rate': 1e-05, 'epoch': 0.58} 58%|█████▊ | 286/492 [2:34:34<1:48:09, 31.50s/it]predicted value: tensor([[0.9297], [0.2734], [0.9531], [0.3477], [0.2197], [0.4023], [0.3086], [0.4316], [0.5820], [0.2578], [0.2520], [0.8867], [0.3906], [0.2695], [0.1338], [0.3711]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [1.0000], [0.4668], [0.2500], [0.3750], [0.3340], [0.4668], [0.6016], [0.3340], [0.2002], [1.0000], [0.5000], [0.3340], [0.2500], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001434326171875 loss: 0.00121307373046875 loss: 0.00213623046875 loss: 0.0014801025390625 predicted value: tensor([[0.4590], [0.6836], [0.9531], [0.4355], [0.8984], [0.4531], [0.1885], [0.4004], [0.4414], [0.1523], [0.5625], [0.3320], [0.5039], [0.2969], [0.1069], [0.0947]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6680], [1.0000], [0.4668], [1.0000], [0.4668], [0.2002], [0.3750], [0.6016], [0.0400], [0.6016], [0.3340], [0.6016], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013427734375 loss: 0.0020904541015625loss: 0.0022735595703125 loss: 0.00118255615234375 predicted value: tensor([[0.4219], [0.7695], [0.9258], [0.9766], [0.4023], [0.2109], [0.4258], [0.6484], [0.4434], [0.4062], [0.4629], [0.3359], [0.3105], [0.2236], [0.2578], [0.1338]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [1.0000], [1.0000], [0.4668], [0.2002], [0.3750], [0.8008], [0.3340], [0.5000], [0.3340], [0.5000], [0.3340], [0.2002], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003173828125 loss: 0.002166748046875 loss: 0.00154876708984375 loss: 0.001495361328125 predicted value: tensor([[0.9531], [0.6953], [0.4180], [0.9688], [0.5078], [0.5625], [0.4863], [0.1406], [0.5312], [0.5859], [0.5586], [0.5273], [0.3906], [0.5273], [0.2422], [0.1050]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [0.4668], [1.0000], [0.5547], [0.6016], [0.6016], [0.2500], [0.6016], [0.7500], [0.6016], [0.5000], [0.4004], [0.7500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009765625 loss: 0.0023956298828125 loss: 0.00151824951171875 loss: 0.0032196044921875 58%|█████▊ | 287/492 [2:35:05<1:47:26, 31.45s/it] {'loss': 0.0074, 'learning_rate': 1e-05, 'epoch': 0.58} 58%|█████▊ | 287/492 [2:35:05<1:47:26, 31.45s/it]predicted value: tensor([[0.9648], [0.6250], [0.6953], [0.4922], [0.1592], [0.3066], [0.2158], [0.9805], [0.6250], [0.2129], [0.6211], [0.2715], [0.2773], [0.1660], [0.3613], [0.1050]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.8008], [0.5547], [0.2500], [0.2500], [0.2002], [1.0000], [0.8008], [0.2500], [0.7500], [0.5000], [0.2500], [0.1670], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00182342529296875 loss: 0.002685546875loss: 0.00093841552734375 loss: 0.00156402587890625 predicted value: tensor([[0.7578], [0.3750], [0.4355], [0.8047], [0.5781], [0.9375], [0.9180], [0.1826], [0.3418], [0.3965], [0.3672], [0.6289], [0.2578], [0.3633], [0.3711], [0.1377]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.5547], [0.8320], [0.7500], [1.0000], [1.0000], [0.2002], [0.3340], [0.4668], [0.2500], [0.6680], [0.2500], [0.3340], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001312255859375 loss: 0.0035400390625 loss: 0.00186920166015625 loss: 0.000644683837890625 predicted value: tensor([[0.4785], [0.3965], [0.7461], [0.6914], [0.7227], [0.1660], [0.7188], [0.9258], [0.6406], [0.4590], [0.9609], [0.9375], [0.3262], [0.6523], [0.1621], [0.1250]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.4668], [0.8008], [0.8320], [0.8008], [0.3340], [0.8008], [1.0000], [0.6680], [0.6016], [1.0000], [1.0000], [0.4004], [0.7500], [0.0625], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004302978515625 loss: 0.00238037109375 loss: 0.0020599365234375 loss: 0.0028076171875 predicted value: tensor([[0.4824], [1.0000], [0.4512], [0.4824], [0.7227], [0.6367], [0.6719], [0.9727], [0.5000], [0.5820], [0.9570], [0.7422], [0.8672], [0.3789], [0.1709], [0.1064]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.4648], [0.8008], [0.4668], [0.8008], [1.0000], [0.6016], [0.5000], [1.0000], [0.8320], [1.0000], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023193359375 loss: 0.00225830078125loss: 0.0032501220703125 loss: 0.000675201416015625 59%|█████▊ | 288/492 [2:35:36<1:46:38, 31.37s/it] {'loss': 0.0086, 'learning_rate': 1e-05, 'epoch': 0.59} 59%|█████▊ | 288/492 [2:35:36<1:46:38, 31.37s/it]predicted value: tensor([[0.4922], [0.5117], [0.4590], [0.2695], [0.8516], [0.5742], [1.0469], [0.6406], [1.0469], [0.3281], [0.5117], [0.4492], [0.4316], [0.2002], [0.2559], [0.2275]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.2500], [0.8008], [0.4668], [1.0000], [0.6016], [1.0000], [0.2500], [0.6016], [0.3340], [0.3340], [0.1670], [0.2500], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002197265625 loss: 0.00103759765625 loss: 0.00153350830078125 loss: 0.0017242431640625 predicted value: tensor([[0.2754], [0.5234], [1.0703], [0.7109], [0.7930], [0.7734], [0.7266], [0.5430], [0.5430], [0.5586], [0.5781], [0.3906], [1.0234], [0.4316], [0.2031], [0.2305]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.4668], [1.0000], [0.8008], [0.6680], [0.6680], [0.6016], [0.5547], [0.6016], [0.4004], [0.3340], [0.4004], [1.0000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009765625 loss: 0.0023651123046875loss: 0.002471923828125 loss: 0.00323486328125 predicted value: tensor([[0.8477], [0.8125], [0.2256], [0.2354], [0.8164], [0.7617], [0.7695], [0.4668], [0.7617], [0.4512], [0.4004], [0.3359], [0.3867], [0.2148], [0.2227], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.8008], [0.2500], [0.2002], [0.6680], [0.8008], [0.6680], [0.3750], [0.7500], [0.4668], [0.2002], [0.2500], [0.2500], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015869140625 loss: 0.00174713134765625loss: 0.001861572265625 loss: 0.001129150390625 predicted value: tensor([[0.4629], [0.6172], [0.3066], [0.4727], [0.5117], [1.0703], [1.0859], [0.6641], [0.4824], [0.6445], [0.5117], [0.4844], [0.6133], [0.4297], [0.2188], [0.1650]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6172], [0.2500], [0.3750], [0.4668], [1.0000], [1.0000], [0.6016], [0.3145], [0.7500], [0.5000], [0.5000], [0.6016], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00238037109375 loss: 0.00173187255859375 loss: 0.00115203857421875 loss: 0.00182342529296875 59%|█████▊ | 289/492 [2:36:08<1:46:22, 31.44s/it] {'loss': 0.0072, 'learning_rate': 1e-05, 'epoch': 0.59} 59%|█████▊ | 289/492 [2:36:08<1:46:22, 31.44s/it]predicted value: tensor([[0.4648], [0.4902], [0.8203], [1.0781], [0.5117], [0.4004], [0.3066], [0.4512], [1.0625], [0.4551], [0.7812], [0.6328], [0.3828], [0.3281], [0.1875], [0.2139]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8008], [1.0000], [0.4668], [0.3145], [0.2500], [0.3340], [1.0000], [0.4668], [0.6680], [0.5000], [0.4004], [0.2852], [0.0278], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000885009765625 loss: 0.00061798095703125 loss: 0.001495361328125 loss: 0.00311279296875 predicted value: tensor([[0.7969], [0.8633], [0.7969], [0.7070], [0.4648], [0.3438], [0.2832], [0.3887], [0.3340], [0.4863], [0.8086], [0.5820], [0.5195], [0.4121], [0.2021], [0.2402]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8320], [0.8320], [0.6016], [0.4668], [0.2002], [0.2500], [0.4668], [0.2500], [0.6016], [0.8320], [0.6016], [0.4668], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00095367431640625 loss: 0.00115966796875loss: 0.0015411376953125 loss: 0.00084686279296875 predicted value: tensor([[0.5742], [1.0391], [1.0859], [0.4785], [0.6250], [0.8516], [0.6641], [1.0703], [0.6758], [0.5430], [0.5586], [0.6367], [0.4395], [0.4082], [0.4141], [0.2266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.4668], [0.8008], [0.6680], [0.7500], [1.0000], [0.5547], [0.6016], [0.6016], [0.6016], [0.5000], [0.3340], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00109100341796875 loss: 0.00193023681640625loss: 0.0010223388671875 loss: 0.00177001953125 predicted value: tensor([[0.8477], [0.4883], [0.7969], [0.7070], [0.5312], [0.8789], [0.2832], [0.5547], [0.6914], [0.2490], [0.4766], [0.3594], [0.3906], [0.4473], [0.2148], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.8320], [0.4668], [0.5547], [0.8320], [0.2500], [0.6016], [0.6016], [0.2002], [0.4004], [0.2500], [0.3340], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00165557861328125 loss: 0.001617431640625loss: 0.00154876708984375 loss: 0.00194549560546875 59%|█████▉ | 290/492 [2:36:39<1:45:51, 31.44s/it] {'loss': 0.0058, 'learning_rate': 1e-05, 'epoch': 0.59} 59%|█████▉ | 290/492 [2:36:39<1:45:51, 31.44s/it]predicted value: tensor([[0.4316], [0.4258], [0.3887], [0.9648], [0.3613], [0.3281], [0.9141], [0.9570], [0.5586], [0.4219], [0.5938], [0.1758], [0.3457], [0.4902], [0.1562], [0.1299]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [1.0000], [0.4668], [0.4004], [1.0000], [1.0000], [0.5547], [0.6016], [0.6016], [0.0625], [0.4004], [0.8008], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023193359375 loss: 0.0029449462890625 loss: 0.0018310546875 loss: 0.00119781494140625 predicted value: tensor([[0.3438], [0.9727], [0.9648], [0.6641], [0.5898], [0.5859], [0.6406], [0.4414], [0.3672], [0.2598], [0.3535], [0.4980], [0.2812], [0.3574], [0.1035], [0.1465]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [1.0000], [0.8008], [0.6016], [0.6680], [0.8008], [0.4668], [0.4668], [0.2500], [0.5000], [0.5000], [0.4004], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018768310546875 loss: 0.002227783203125loss: 0.00182342529296875 loss: 0.00116729736328125 predicted value: tensor([[1.0000], [0.7148], [0.8008], [1.0000], [0.2598], [0.9727], [0.5430], [0.9141], [0.2256], [0.1758], [0.3457], [0.2441], [0.5859], [0.3711], [0.1826], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.8320], [1.0000], [0.2500], [1.0000], [0.6680], [1.0000], [0.3340], [0.2500], [0.4004], [0.2500], [0.6016], [0.5000], [0.2500], [0.2852]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003387451171875 loss: 0.001251220703125 loss: 0.0023193359375 loss: 0.002960205078125 predicted value: tensor([[0.7656], [0.3984], [0.3926], [0.1582], [0.4766], [0.9531], [0.7031], [0.4785], [0.4551], [0.3164], [0.2305], [0.6172], [0.2227], [0.2363], [0.1357], [0.1406]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.6016], [0.1670], [0.5547], [1.0000], [0.8008], [0.7500], [0.4668], [0.2500], [0.2500], [0.6016], [0.5000], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00121307373046875 loss: 0.0015411376953125 loss: 0.0025787353515625 loss: 0.003631591796875 59%|█████▉ | 291/492 [2:37:11<1:45:12, 31.40s/it] {'loss': 0.0086, 'learning_rate': 1e-05, 'epoch': 0.59} 59%|█████▉ | 291/492 [2:37:11<1:45:12, 31.40s/it]predicted value: tensor([[0.3809], [0.7578], [0.5625], [0.5547], [0.9688], [0.2148], [0.9258], [0.5547], [0.6523], [0.4512], [0.9453], [0.3887], [0.4863], [0.1387], [0.1201], [0.1079]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.4668], [0.8320], [1.0000], [0.3340], [1.0000], [0.5547], [0.7500], [0.7500], [1.0000], [0.4004], [0.6016], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001983642578125 loss: 0.00154876708984375 loss: 0.00396728515625 loss: 0.002777099609375 predicted value: tensor([[0.4883], [0.9414], [0.3359], [0.6094], [0.6445], [0.7422], [0.4023], [0.3672], [0.6562], [0.6406], [0.4238], [0.9180], [0.4141], [0.3770], [0.1338], [0.1455]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3145], [0.6680], [0.6680], [0.8008], [0.4668], [0.5000], [0.8320], [0.7500], [0.5000], [1.0000], [0.5000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0017852783203125 loss: 0.00164031982421875 loss: 0.0013885498046875 loss: 0.0031280517578125 predicted value: tensor([[0.5234], [0.4082], [0.9648], [0.2041], [0.2266], [0.9492], [0.4141], [0.5547], [0.9531], [0.6289], [0.3281], [0.4727], [0.3184], [0.1924], [0.1777], [0.1768]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.3750], [1.0000], [0.2500], [0.2500], [1.0000], [0.4668], [0.6680], [1.0000], [0.6016], [0.5000], [0.4277], [0.2500], [0.2500], [0.1426], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015869140625 loss: 0.001190185546875 loss: 0.00128173828125 loss: 0.003143310546875 predicted value: tensor([[0.3770], [0.4336], [0.6094], [0.5859], [0.4043], [0.5625], [0.4980], [0.9688], [0.6172], [0.4961], [0.5391], [0.1592], [0.3613], [0.3262], [0.1455], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.4668], [0.4668], [0.4668], [0.7500], [0.8008], [1.0000], [0.6680], [0.6016], [0.5000], [0.0625], [0.4004], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001739501953125 loss: 0.0019683837890625 loss: 0.001068115234375 loss: 0.003692626953125 59%|█████▉ | 292/492 [2:37:42<1:44:52, 31.46s/it] {'loss': 0.0085, 'learning_rate': 1e-05, 'epoch': 0.59} 59%|█████▉ | 292/492 [2:37:42<1:44:52, 31.46s/it]predicted value: tensor([[1.0156], [1.0000], [0.4766], [0.7031], [0.9922], [0.7461], [0.6562], [0.3555], [0.4082], [0.7617], [0.4258], [0.2773], [0.4492], [0.2422], [0.3789], [0.2314]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.6680], [1.0000], [0.6680], [0.6016], [0.3340], [0.3340], [0.8008], [0.5000], [0.2002], [0.4004], [0.3340], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001312255859375 loss: 0.00087738037109375loss: 0.0029449462890625 loss: 0.0009613037109375 predicted value: tensor([[0.8281], [0.7695], [0.4707], [0.2734], [0.5156], [0.7852], [0.3574], [0.5039], [0.7109], [0.5898], [0.7383], [0.6055], [0.3848], [0.4004], [0.2285], [0.2070]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.7148], [0.4668], [0.3340], [0.5000], [0.8008], [0.3340], [0.5547], [0.7500], [0.6016], [0.6680], [0.6016], [0.4004], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000705718994140625 loss: 0.000339508056640625 loss: 0.00148773193359375 loss: 0.002655029296875 predicted value: tensor([[0.4707], [0.4844], [0.7891], [0.5664], [0.7422], [0.5820], [0.4570], [0.5664], [0.7188], [0.4648], [0.5000], [0.5586], [0.4355], [0.4375], [0.2490], [0.2324]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8555], [0.8008], [0.8008], [0.4668], [0.3750], [0.6016], [0.7500], [0.3750], [0.4004], [0.5000], [0.5000], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00140380859375 loss: 0.00180816650390625loss: 0.000782012939453125 loss: 0.001800537109375 predicted value: tensor([[0.7812], [0.5352], [1.0234], [0.8203], [0.5859], [0.6836], [1.0078], [0.6836], [0.4160], [0.5742], [0.5273], [0.2676], [0.5391], [0.2637], [0.2363], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.5547], [1.0000], [0.8320], [0.5547], [0.8008], [1.0000], [0.8320], [0.4668], [0.5000], [0.6016], [0.2002], [0.6016], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0007171630859375 loss: 0.0010223388671875 loss: 0.00127410888671875 loss: 0.00107574462890625 60%|█████▉ | 293/492 [2:38:14<1:44:29, 31.50s/it] {'loss': 0.0053, 'learning_rate': 1e-05, 'epoch': 0.6} 60%|█████▉ | 293/492 [2:38:14<1:44:29, 31.50s/it]predicted value: tensor([[0.4961], [0.4746], [0.9766], [0.2734], [1.0234], [0.7070], [0.3340], [0.3926], [0.2969], [0.7578], [0.4512], [0.9844], [0.3262], [0.3066], [0.2871], [0.2139]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.2500], [1.0000], [0.7500], [0.2500], [0.6016], [0.3340], [0.6680], [0.2500], [1.0000], [0.2002], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00102996826171875 loss: 0.0020599365234375 loss: 0.000797271728515625 loss: 0.0010986328125 predicted value: tensor([[0.9922], [0.4785], [0.4961], [0.4727], [0.4902], [0.7422], [0.3281], [0.9961], [0.9648], [0.9805], [0.6758], [1.0156], [0.6523], [0.2754], [0.1689], [0.2432]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.4668], [0.4668], [0.6680], [0.2500], [1.0000], [1.0000], [1.0000], [0.6680], [1.0000], [0.7500], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00107574462890625 loss: 0.0005035400390625 loss: 0.00124359130859375 loss: 0.000720977783203125 predicted value: tensor([[0.4805], [0.7969], [0.4473], [0.2754], [0.5508], [0.4883], [0.5117], [0.6172], [0.2471], [0.7812], [0.5859], [0.9844], [0.3652], [0.4062], [0.2256], [0.2031]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.4668], [0.3340], [0.7500], [0.4668], [0.4668], [0.4668], [0.2500], [0.8008], [0.6016], [1.0000], [0.2500], [0.3340], [0.1670], [0.0400]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000316619873046875 loss: 0.00186920166015625loss: 0.000576019287109375 loss: 0.001678466796875 predicted value: tensor([[0.2871], [0.4492], [0.8320], [0.4648], [0.5117], [0.3281], [0.7305], [0.7930], [0.7227], [0.5430], [0.3652], [0.4512], [0.3906], [0.3848], [0.2305], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.3750], [0.8320], [0.4668], [0.4668], [0.3340], [0.8320], [0.8008], [0.5703], [0.4668], [0.3340], [0.4004], [0.4004], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00167083740234375 loss: 0.0009765625 loss: 0.00110626220703125 loss: 0.000644683837890625 60%|█████▉ | 294/492 [2:38:46<1:44:19, 31.61s/it] {'loss': 0.0043, 'learning_rate': 1e-05, 'epoch': 0.6} 60%|█████▉ | 294/492 [2:38:46<1:44:19, 31.61s/it]predicted value: tensor([[0.4727], [0.9609], [0.3613], [0.3867], [0.9258], [0.1895], [0.4941], [0.9688], [0.4902], [0.5938], [0.5391], [0.2041], [0.1943], [0.1660], [0.3789], [0.1260]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.3750], [0.4668], [1.0000], [0.2500], [0.8008], [1.0000], [0.7500], [0.6016], [0.6016], [0.0625], [0.2002], [0.2002], [0.5000], [0.1250]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00130462646484375 loss: 0.0034332275390625 loss: 0.002227783203125 loss: 0.0021209716796875 predicted value: tensor([[0.7305], [0.9688], [0.3555], [0.6289], [0.4102], [0.3535], [0.6016], [0.3027], [0.6797], [0.3418], [0.3770], [0.3359], [0.2383], [0.3281], [0.1758], [0.1660]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [1.0000], [0.4668], [0.6016], [0.3750], [0.3750], [0.6016], [0.3340], [0.6680], [0.4004], [0.5000], [0.3340], [0.0400], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00156402587890625 loss: 0.00152587890625loss: 0.00177764892578125 loss: 0.0021514892578125 predicted value: tensor([[0.4238], [0.9297], [0.9258], [0.7148], [0.3867], [0.3770], [0.3438], [0.6484], [0.2383], [0.5703], [0.5625], [0.3965], [0.3633], [0.2109], [0.1426], [0.2031]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.8008], [0.4668], [0.4668], [0.4668], [0.7500], [0.2500], [0.8008], [0.6016], [0.4004], [0.4004], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002044677734375 loss: 0.0020751953125 loss: 0.00176239013671875 loss: 0.00141143798828125 predicted value: tensor([[ 0.3711], [ 0.9453], [ 0.4082], [ 0.0938], [ 0.3379], [ 0.7227], [ 0.1885], [ 0.3867], [ 0.6328], [ 0.7070], [ 0.1719], [-0.0237], [ 0.3633], [ 0.1113], [ 0.3340], [ 0.1172]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [0.2500], [0.3340], [0.8008], [0.2500], [0.5547], [0.7500], [0.8008], [0.2002], [0.0625], [0.4004], [0.1426], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00165557861328125 loss: 0.00177001953125 loss: 0.0027313232421875 loss: 0.0033111572265625 60%|█████▉ | 295/492 [2:39:17<1:43:38, 31.56s/it] {'loss': 0.0082, 'learning_rate': 1e-05, 'epoch': 0.6} 60%|█████▉ | 295/492 [2:39:17<1:43:38, 31.56s/it]predicted value: tensor([[0.9570], [0.5508], [0.9727], [0.9531], [0.1904], [0.7148], [0.9297], [0.5938], [0.5898], [0.3223], [0.4023], [0.5898], [0.2949], [0.1826], [0.1621], [0.1533]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8555], [1.0000], [1.0000], [0.3340], [0.8008], [1.0000], [0.6016], [0.6016], [0.4668], [0.4004], [0.6016], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00286865234375 loss: 0.0021209716796875 loss: 0.0026702880859375 loss: 0.002685546875 predicted value: tensor([[0.4355], [0.2168], [0.3691], [0.2324], [0.9570], [0.9453], [0.6680], [0.5586], [0.5859], [0.3691], [0.2334], [0.4121], [0.4766], [0.4512], [0.1426], [0.2080]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [0.4668], [0.3340], [1.0000], [1.0000], [0.6016], [0.6016], [0.6016], [0.3340], [0.2500], [0.4004], [0.5000], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00148773193359375 loss: 0.00107574462890625loss: 0.001953125 loss: 0.003936767578125 predicted value: tensor([[0.4648], [0.9883], [0.7773], [0.7695], [0.1855], [0.2441], [0.5273], [0.4414], [0.4336], [0.3359], [0.1641], [0.9375], [0.3359], [0.3965], [0.1484], [0.1562]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8320], [0.8320], [0.2002], [0.3340], [0.6016], [0.4004], [0.5000], [0.4668], [0.2500], [1.0000], [0.4004], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025787353515625 loss: 0.0012664794921875 loss: 0.0029144287109375 loss: 0.0015106201171875 predicted value: tensor([[0.9531], [0.9648], [0.6836], [0.7656], [0.9531], [0.3789], [0.9805], [0.7305], [0.5859], [0.1030], [0.2334], [0.3145], [0.3379], [0.1709], [0.3184], [0.1465]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.6680], [0.8008], [1.0000], [0.4668], [1.0000], [0.8320], [0.7500], [0.0400], [0.2002], [0.4004], [0.2852], [0.2002], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013427734375 loss: 0.00102996826171875 loss: 0.00145721435546875 loss: 0.002197265625 60%|██████ | 296/492 [2:39:50<1:43:49, 31.79s/it] {'loss': 0.0083, 'learning_rate': 1e-05, 'epoch': 0.6} 60%|██████ | 296/492 [2:39:50<1:43:49, 31.79s/it]predicted value: tensor([[0.5273], [0.7656], [0.4375], [0.5195], [0.2021], [1.0469], [0.6758], [0.6641], [0.3574], [0.4258], [0.6914], [0.5195], [0.3594], [0.2432], [0.2197], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [0.3750], [0.4668], [0.2002], [1.0000], [0.6016], [0.5000], [0.2500], [0.4277], [0.7500], [0.4668], [0.4004], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011444091796875 loss: 0.001220703125 loss: 0.001251220703125 loss: 0.00244140625 predicted value: tensor([[0.2715], [0.2871], [0.5273], [0.6914], [0.6758], [0.6406], [1.0547], [0.5391], [0.6680], [0.6680], [0.3633], [0.6367], [0.2100], [0.2188], [0.2100], [0.1787]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.2500], [0.4668], [0.5703], [0.5000], [0.7500], [1.0000], [0.6016], [0.3750], [0.6016], [0.2002], [0.5000], [0.0400], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0008087158203125 loss: 0.003692626953125 loss: 0.0028228759765625 loss: 0.0020294189453125 predicted value: tensor([[0.3086], [0.8125], [0.8711], [1.0547], [0.8242], [0.4844], [0.2676], [0.4277], [0.4863], [0.4551], [0.4004], [0.2305], [0.5273], [0.5234], [0.4414], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.6680], [0.8320], [1.0000], [0.8008], [0.4668], [0.2500], [0.4668], [0.4668], [0.4004], [0.2500], [0.2002], [0.5000], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001617431640625 loss: 0.0011444091796875 loss: 0.000934600830078125 loss: 0.001220703125 predicted value: tensor([[0.8359], [0.4863], [0.6016], [0.7812], [0.4395], [0.5391], [1.0312], [0.2734], [0.6133], [0.7734], [0.4961], [0.6836], [0.4023], [0.2139], [0.2119], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.5312], [0.8008], [0.4668], [0.5547], [1.0000], [0.2500], [0.5000], [0.7500], [0.4004], [0.6016], [0.4004], [0.1670], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023651123046875 loss: 0.000919342041015625 loss: 0.000644683837890625 loss: 0.000690460205078125 60%|██████ | 297/492 [2:40:22<1:43:57, 31.99s/it] {'loss': 0.0062, 'learning_rate': 1e-05, 'epoch': 0.6} 60%|██████ | 297/492 [2:40:22<1:43:57, 31.99s/it]predicted value: tensor([[0.5352], [0.3359], [0.7969], [0.5117], [0.7461], [0.4648], [0.4434], [0.4453], [0.2402], [0.6797], [0.4141], [1.0312], [0.5820], [0.3418], [0.2617], [0.2090]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [0.8320], [0.4648], [0.6680], [0.3750], [0.3750], [0.3340], [0.3340], [0.6016], [0.4004], [1.0000], [0.5000], [0.2500], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000518798828125 loss: 0.0011138916015625 loss: 0.00183868408203125 loss: 0.00067901611328125 predicted value: tensor([[0.4102], [0.2656], [0.4551], [0.8086], [0.6797], [1.0625], [0.2236], [0.3984], [0.3398], [0.5586], [0.6406], [0.4316], [0.3867], [0.3594], [0.1895], [0.4961]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.2500], [0.4668], [0.8008], [0.5547], [1.0000], [0.2500], [0.2715], [0.2500], [0.5000], [0.5000], [0.4004], [0.4004], [0.3340], [0.1670], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0005340576171875 loss: 0.00112152099609375loss: 0.00180816650390625 loss: 0.0010986328125 predicted value: tensor([[0.4961], [0.5156], [0.4512], [1.0781], [0.4297], [0.7344], [0.6250], [0.7891], [0.7969], [0.2637], [0.5078], [0.4043], [0.4199], [0.4844], [0.2012], [0.2412]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [1.0000], [0.4668], [0.8008], [0.6016], [0.8008], [0.8320], [0.2002], [0.4004], [0.4004], [0.5000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010986328125 loss: 0.000766754150390625 loss: 0.0016021728515625 loss: 0.000675201416015625 predicted value: tensor([[1.0625], [1.0391], [0.4082], [0.6016], [0.5938], [1.0234], [0.2471], [0.3066], [0.2656], [1.0469], [0.6836], [0.3164], [1.0234], [0.4023], [0.4258], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.5547], [0.6016], [1.0000], [0.2500], [0.3340], [0.2002], [1.0000], [0.2500], [0.2002], [1.0000], [0.5000], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00101470947265625 loss: 0.0036163330078125 loss: 0.0016632080078125 loss: 0.00112152099609375 61%|██████ | 298/492 [2:40:55<1:44:29, 32.32s/it] {'loss': 0.0051, 'learning_rate': 1e-05, 'epoch': 0.61} 61%|██████ | 298/492 [2:40:55<1:44:29, 32.32s/it]predicted value: tensor([[0.7695], [1.0312], [0.5938], [0.6680], [0.5820], [0.9805], [0.9609], [0.1279], [0.4844], [0.3242], [0.2002], [0.4180], [0.3574], [0.1680], [0.1729], [0.1074]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [1.0000], [0.7500], [0.8008], [0.6016], [1.0000], [1.0000], [0.2500], [0.6016], [0.4004], [0.5000], [0.4277], [0.4004], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0022430419921875 loss: 0.002105712890625 loss: 0.00323486328125 loss: 0.00193023681640625 predicted value: tensor([[0.7891], [0.7188], [0.9648], [0.6641], [0.5039], [0.4727], [0.6016], [0.9766], [0.6719], [0.4023], [0.2080], [0.5664], [0.1865], [0.3945], [0.1309], [0.1162]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [1.0000], [0.7148], [0.5000], [0.5547], [0.6016], [1.0000], [0.7500], [0.4668], [0.2500], [0.7500], [0.2002], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018310546875 loss: 0.00154876708984375 loss: 0.00170135498046875 loss: 0.0014801025390625 predicted value: tensor([[0.6875], [0.3535], [0.8008], [0.9688], [0.7539], [0.4941], [0.9180], [0.1885], [0.5547], [0.5742], [0.1670], [0.3066], [0.4180], [0.3359], [0.1338], [0.1426]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.8320], [1.0000], [0.8008], [0.5547], [1.0000], [0.3340], [0.5000], [0.5000], [0.2500], [0.5000], [0.5000], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00299072265625 loss: 0.0025482177734375loss: 0.0019989013671875 loss: 0.00311279296875 predicted value: tensor([[ 0.3594], [ 0.3848], [ 0.7656], [ 0.6055], [ 0.5859], [ 0.9844], [ 0.4121], [ 0.5820], [ 0.6875], [ 0.7031], [ 0.2412], [ 0.6914], [ 0.3379], [ 0.2637], [-0.0094], [ 0.1680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [0.6680], [0.5547], [1.0000], [0.3750], [0.5000], [0.8008], [0.8008], [0.3340], [0.7500], [0.3340], [0.2500], [0.0400], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00099945068359375 loss: 0.0026092529296875 loss: 0.0025177001953125 loss: 0.003448486328125 61%|██████ | 299/492 [2:41:28<1:44:32, 32.50s/it] {'loss': 0.0091, 'learning_rate': 1e-05, 'epoch': 0.61} 61%|██████ | 299/492 [2:41:28<1:44:32, 32.50s/it]predicted value: tensor([[1.0156], [0.2051], [0.2061], [0.3750], [0.5664], [0.4609], [0.6758], [0.9766], [0.3438], [0.3438], [0.2012], [0.5547], [0.9883], [0.1719], [0.1064], [0.1089]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [0.2500], [0.4668], [0.6016], [0.4648], [0.8008], [1.0000], [0.4668], [0.2500], [0.2500], [0.6016], [1.0000], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024566650390625 loss: 0.0017852783203125 loss: 0.00173187255859375 loss: 0.002044677734375 predicted value: tensor([[0.4180], [0.6523], [0.4668], [0.7461], [0.3691], [0.1846], [0.7461], [0.3320], [0.6133], [0.4766], [0.5820], [0.5000], [0.3848], [0.1885], [0.1328], [0.1318]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.6016], [0.8008], [0.4668], [0.2002], [0.8008], [0.3750], [0.6680], [0.6016], [0.7500], [0.6016], [0.5000], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026397705078125 loss: 0.0030670166015625 loss: 0.00101470947265625 loss: 0.0018463134765625 predicted value: tensor([[0.4023], [0.7344], [0.7383], [0.3340], [0.9805], [0.4395], [0.6445], [0.6211], [0.6250], [0.1436], [0.3535], [0.7031], [0.3379], [0.3672], [0.3418], [0.1108]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.8320], [0.3750], [1.0000], [0.4668], [0.6016], [0.6680], [0.8008], [0.2002], [0.3340], [0.7500], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00104522705078125 loss: 0.00162506103515625 loss: 0.001983642578125 loss: 0.00075531005859375 predicted value: tensor([[0.3730], [0.9531], [0.6875], [0.3750], [0.9727], [0.3809], [0.2002], [0.9648], [0.2021], [0.2295], [0.2324], [0.5547], [0.2598], [0.3516], [0.1553], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.6680], [0.4668], [1.0000], [0.4668], [0.2500], [1.0000], [0.3340], [0.2500], [0.3340], [0.5000], [0.2500], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011138916015625 loss: 0.001373291015625 loss: 0.0032806396484375 loss: 0.001220703125 61%|██████ | 300/492 [2:42:01<1:44:09, 32.55s/it] {'loss': 0.0072, 'learning_rate': 1e-05, 'epoch': 0.61} 61%|██████ | 300/492 [2:42:01<1:44:09, 32.55s/it]Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 4096} /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( predicted value: tensor([[0.5312], [0.5117], [0.8477], [0.8672], [1.0156], [0.7500], [0.4609], [0.4688], [0.4629], [0.4473], [0.4902], [0.4219], [0.6797], [0.2314], [0.2324], [0.2119]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4004], [0.8320], [0.8008], [1.0000], [0.7500], [0.4668], [0.4668], [0.5703], [0.4668], [0.3340], [0.4004], [0.6016], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000782012939453125 loss: 0.000701904296875 loss: 0.0009918212890625 loss: 0.00140380859375 predicted value: tensor([[0.4746], [0.5156], [0.7930], [0.8008], [0.7930], [0.7891], [1.0156], [0.2812], [0.4727], [0.4785], [0.7344], [0.2500], [1.0469], [0.2402], [0.1826], [0.2334]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.8008], [0.8008], [0.8008], [0.6680], [1.0000], [0.2002], [0.4668], [0.4668], [0.7500], [0.4004], [1.0000], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00103759765625loss: 0.00225830078125 loss: 0.00250244140625 loss: 0.000919342041015625 predicted value: tensor([[0.4785], [0.6367], [1.0781], [0.2080], [0.8438], [0.7148], [0.2490], [0.5703], [0.7734], [1.0469], [0.6914], [0.5117], [0.3320], [0.4434], [0.2559], [0.2441]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5547], [1.0000], [0.2002], [0.8320], [0.6016], [0.2500], [0.5000], [0.6680], [1.0000], [0.6016], [0.5000], [0.0400], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000492095947265625 loss: 0.0024566650390625 loss: 0.0015716552734375 loss: 0.00164031982421875 predicted value: tensor([[0.4570], [0.6016], [1.0469], [0.2891], [0.2041], [0.4609], [1.0469], [1.0625], [1.0469], [0.2695], [0.3477], [0.5742], [1.0391], [0.2676], [0.2617], [0.2383]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [1.0000], [0.3340], [0.3340], [0.3750], [1.0000], [1.0000], [1.0000], [0.2500], [0.2500], [0.6016], [1.0000], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0036468505859375 loss: 0.0029754638671875 loss: 0.001190185546875 61%|██████ | 301/492 [2:44:02<3:07:53, 59.02s/it]loss: 0.0023651123046875 {'loss': 0.0067, 'learning_rate': 1e-05, 'epoch': 0.61} 61%|██████ | 301/492 [2:44:02<3:07:53, 59.02s/it]predicted value: tensor([[0.5781], [0.4590], [0.7969], [0.4609], [0.3086], [0.7227], [0.6562], [0.6797], [0.3027], [0.6797], [0.7461], [0.4258], [0.3867], [0.4043], [0.2734], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5312], [0.3750], [0.8320], [0.4668], [0.3340], [0.6680], [0.7500], [0.6016], [0.0400], [0.7500], [0.8008], [0.4004], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00072479248046875 loss: 0.0019989013671875 loss: 0.0017547607421875 loss: 0.0008087158203125 predicted value: tensor([[0.4648], [0.6680], [0.6719], [0.5078], [0.7773], [0.6758], [0.7227], [0.5703], [0.7773], [1.0156], [0.7031], [0.2930], [0.3848], [0.2559], [0.2031], [0.2178]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.7500], [0.4668], [0.8320], [0.6680], [0.6680], [0.6016], [0.8008], [1.0000], [0.7500], [0.3340], [0.3340], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000782012939453125 loss: 0.00116729736328125 loss: 0.001983642578125 loss: 0.000762939453125 predicted value: tensor([[1.0469], [0.6133], [0.6211], [0.2637], [1.0312], [1.0234], [0.4902], [0.5273], [0.6719], [0.6836], [0.5352], [0.5117], [0.5508], [0.2031], [0.2412], [0.2188]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.6016], [0.2500], [1.0000], [1.0000], [0.4668], [0.4668], [0.6016], [0.7500], [0.5000], [0.5000], [0.4277], [0.0400], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00151824951171875 loss: 0.00101470947265625loss: 0.001495361328125 loss: 0.00067138671875 predicted value: tensor([[0.4863], [0.5820], [0.7617], [0.2393], [0.5469], [0.4590], [0.7500], [0.4355], [0.4727], [0.3984], [0.4512], [0.4375], [0.4160], [0.4219], [0.2578], [0.3848]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.6172], [0.8320], [0.2500], [0.5547], [0.4668], [0.6680], [0.3750], [0.5000], [0.4004], [0.4004], [0.4004], [0.5000], [0.2500], [0.1426], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000579833984375 loss: 0.00225830078125 loss: 0.0016326904296875 loss: 0.00135040283203125 61%|██████▏ | 302/492 [2:44:34<2:41:57, 51.14s/it] {'loss': 0.0051, 'learning_rate': 1e-05, 'epoch': 0.61} 61%|██████▏ | 302/492 [2:44:34<2:41:57, 51.14s/it]predicted value: tensor([[0.4590], [0.1846], [0.1602], [0.7578], [0.9180], [0.6406], [0.5469], [0.2559], [0.2598], [0.8984], [0.3789], [0.2490], [0.5391], [0.3535], [0.2041], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3340], [0.2002], [0.8320], [1.0000], [0.8008], [0.4668], [0.2500], [0.2002], [1.0000], [0.4004], [0.2500], [0.5000], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032806396484375 loss: 0.00146484375 loss: 0.00131988525390625 loss: 0.002685546875 predicted value: tensor([[0.3789], [0.9297], [0.3770], [0.9453], [0.4023], [0.2305], [0.3848], [0.1650], [0.3105], [0.6406], [0.3535], [0.5859], [0.4160], [0.3184], [0.1631], [0.1221]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [1.0000], [0.3750], [0.2500], [0.4668], [0.1670], [0.1670], [0.6016], [0.4004], [0.6016], [0.6016], [0.3340], [0.1670], [0.0625]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002716064453125 loss: 0.0013580322265625 loss: 0.001190185546875 loss: 0.0019989013671875 predicted value: tensor([[0.3926], [0.3691], [0.6680], [0.7656], [0.6797], [0.3535], [0.2734], [0.1533], [0.2500], [0.7656], [0.8750], [0.2188], [0.3262], [0.3105], [0.1387], [0.1494]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.6680], [0.8320], [0.6680], [0.4668], [0.6016], [0.2500], [0.2002], [0.8320], [1.0000], [0.5000], [0.3340], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004547119140625 loss: 0.00139617919921875 loss: 0.004119873046875 loss: 0.0030059814453125 predicted value: tensor([[0.9648], [0.3906], [0.6641], [0.6953], [0.2363], [0.4922], [0.3594], [0.4141], [0.1465], [0.4531], [0.4727], [0.6602], [0.3867], [0.4238], [0.1963], [0.1826]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [0.6680], [0.2500], [0.5547], [0.4668], [0.4668], [0.0625], [0.4004], [0.4668], [0.6680], [0.5000], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0022125244140625 loss: 0.0009918212890625 loss: 0.001220703125 loss: 0.00238037109375 62%|██████▏ | 303/492 [2:45:07<2:23:44, 45.63s/it] {'loss': 0.009, 'learning_rate': 1e-05, 'epoch': 0.62} 62%|██████▏ | 303/492 [2:45:07<2:23:44, 45.63s/it]predicted value: tensor([[0.7578], [0.4219], [0.3691], [0.1768], [0.5938], [0.9219], [0.5430], [0.2041], [0.2871], [0.0972], [0.9062], [0.3438], [0.3848], [0.1670], [0.1914], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [0.4668], [0.4668], [0.2002], [0.6016], [1.0000], [0.6016], [0.3340], [0.2500], [0.0625], [1.0000], [0.5000], [0.4004], [0.1670], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00109100341796875 loss: 0.00116729736328125loss: 0.004791259765625 loss: 0.0011749267578125 predicted value: tensor([[0.4219], [0.6250], [0.9414], [0.4180], [0.7656], [0.9258], [0.2715], [0.5352], [0.4199], [0.3398], [0.3867], [0.3457], [0.2871], [0.3535], [0.1963], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [1.0000], [0.3750], [0.8320], [1.0000], [0.2500], [0.6016], [0.2715], [0.4004], [0.5000], [0.5000], [0.4004], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000823974609375 loss: 0.0020751953125loss: 0.002197265625 loss: 0.00099945068359375 predicted value: tensor([[0.4160], [0.4922], [0.1904], [0.7461], [0.7969], [0.4102], [0.5820], [0.5312], [0.6367], [0.4902], [0.9258], [0.3770], [0.2578], [0.2207], [0.1748], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4648], [0.2500], [0.8320], [0.8320], [0.4668], [0.6016], [0.8008], [0.6680], [0.6016], [1.0000], [0.3340], [0.5000], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002716064453125loss: 0.001708984375 loss: 0.005706787109375 loss: 0.00173187255859375 predicted value: tensor([[0.7344], [0.3945], [0.4004], [0.6523], [0.6992], [0.4062], [0.2910], [0.3711], [0.4082], [0.3711], [0.3926], [0.5742], [0.3359], [0.5938], [0.1924], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.3750], [0.8008], [0.8008], [0.4668], [0.2002], [0.2715], [0.5000], [0.5000], [0.4004], [0.7500], [0.4004], [0.7500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00122833251953125 loss: 0.0029449462890625 loss: 0.0025482177734375 loss: 0.000782012939453125 62%|██████▏ | 304/492 [2:45:40<2:10:51, 41.76s/it] {'loss': 0.0084, 'learning_rate': 1e-05, 'epoch': 0.62} 62%|██████▏ | 304/492 [2:45:40<2:10:51, 41.76s/it]predicted value: tensor([[0.5664], [1.0234], [0.9844], [0.5430], [0.2910], [0.9844], [0.6875], [0.6406], [0.3340], [0.9883], [0.4824], [0.3867], [0.3945], [0.2793], [0.2275], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.4668], [0.3340], [1.0000], [0.6016], [0.7500], [0.2500], [1.0000], [0.6016], [0.4004], [0.4004], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00099945068359375 loss: 0.00093841552734375loss: 0.000530242919921875 loss: 0.00122833251953125 predicted value: tensor([[0.4805], [0.4922], [0.7930], [1.0000], [0.2393], [0.6797], [1.0391], [0.6562], [0.4121], [0.7188], [0.5547], [0.4844], [0.3984], [0.2354], [0.2910], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.6680], [1.0000], [0.2002], [0.7500], [1.0000], [0.5000], [0.3340], [0.7500], [0.6016], [0.5000], [0.4004], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000560760498046875 loss: 0.0012054443359375 loss: 0.00147247314453125 loss: 0.00067901611328125 predicted value: tensor([[0.2754], [0.9688], [0.4238], [0.2871], [0.5195], [0.6133], [0.5391], [0.2793], [0.3047], [0.6367], [0.5352], [0.4824], [0.2656], [0.4395], [0.2520], [0.2412]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [1.0000], [0.3750], [0.2002], [0.4668], [0.6016], [0.5000], [0.2500], [0.2500], [0.6016], [0.5000], [0.4004], [0.2002], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018310546875 loss: 0.000736236572265625loss: 0.0016326904296875 loss: 0.0011749267578125 predicted value: tensor([[1.0312], [0.4316], [0.4922], [1.0078], [0.3652], [0.4785], [0.5938], [0.4902], [0.5781], [0.4004], [0.4277], [0.4043], [0.4180], [0.2559], [0.2598], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [1.0000], [0.3340], [0.4668], [0.5000], [0.4668], [0.8008], [0.5000], [0.4004], [0.2500], [0.4004], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009613037109375 loss: 0.00093841552734375 loss: 0.0015869140625 loss: 0.0029144287109375 62%|██████▏ | 305/492 [2:46:12<2:01:31, 38.99s/it] {'loss': 0.0048, 'learning_rate': 1e-05, 'epoch': 0.62} 62%|██████▏ | 305/492 [2:46:12<2:01:31, 38.99s/it]predicted value: tensor([[0.5078], [0.9844], [0.7656], [0.6328], [0.4121], [0.4941], [0.9766], [0.5352], [0.6719], [0.9766], [0.7539], [0.4355], [0.5000], [0.4922], [0.2539], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8320], [0.6016], [0.4668], [0.4668], [1.0000], [0.4668], [0.7500], [1.0000], [0.7500], [0.3340], [0.5000], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006103515625 loss: 0.000701904296875 loss: 0.000507354736328125 loss: 0.00165557861328125 predicted value: tensor([[1.0312], [0.4785], [1.0078], [1.0000], [0.5195], [0.7461], [0.6875], [0.5430], [0.6172], [0.6172], [0.5586], [0.4961], [0.4453], [0.4668], [0.3730], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [1.0000], [1.0000], [0.4668], [0.6680], [0.7500], [0.5000], [0.7500], [0.6016], [0.5000], [0.4668], [0.3340], [0.4004], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00250244140625 loss: 0.0024261474609375 loss: 0.001434326171875 loss: 0.0013885498046875 predicted value: tensor([[0.4688], [0.4629], [0.7656], [1.0234], [0.4785], [0.5000], [0.3320], [0.4434], [0.4922], [0.7383], [0.9883], [0.6797], [0.9766], [0.3770], [0.4180], [0.2480]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [0.8008], [1.0000], [0.4668], [0.4668], [0.3340], [0.3145], [0.4668], [0.7500], [1.0000], [0.6016], [1.0000], [0.5000], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00096893310546875 loss: 0.0009002685546875 loss: 0.00092315673828125 loss: 0.00115203857421875 predicted value: tensor([[0.4766], [0.5742], [0.7930], [0.4961], [0.9922], [0.3301], [0.5625], [0.9688], [0.2734], [0.5078], [0.2061], [0.5547], [0.6016], [0.5859], [0.2637], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8320], [0.4668], [1.0000], [0.2002], [0.5547], [1.0000], [0.3340], [0.4668], [0.0625], [0.5000], [0.7500], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032196044921875 loss: 0.0012969970703125 loss: 0.00130462646484375 loss: 0.000682830810546875 62%|██████▏ | 306/492 [2:46:45<1:54:47, 37.03s/it] {'loss': 0.0068, 'learning_rate': 1e-05, 'epoch': 0.62} 62%|██████▏ | 306/492 [2:46:45<1:54:47, 37.03s/it]predicted value: tensor([[0.4043], [0.7461], [0.7344], [0.3945], [0.2090], [0.1963], [0.5039], [0.2217], [0.3867], [0.5508], [0.4453], [0.1001], [0.3184], [0.2754], [0.3535], [0.1221]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.8008], [0.3750], [0.3340], [0.2500], [0.4668], [0.3340], [0.6016], [0.6016], [0.4668], [0.0278], [0.4004], [0.2002], [0.5000], [0.0669]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0048828125 loss: 0.00213623046875loss: 0.0012359619140625 loss: 0.0015411376953125 predicted value: tensor([[0.4941], [0.4102], [0.4902], [0.4180], [0.3945], [0.4609], [0.1572], [0.5039], [0.6055], [0.9258], [0.4570], [0.4023], [0.4375], [0.2051], [0.1494], [0.1689]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4648], [0.4668], [0.3750], [0.5000], [0.2500], [0.5000], [0.7500], [1.0000], [0.4668], [0.4004], [0.4668], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00118255615234375 loss: 0.0023651123046875 loss: 0.000888824462890625 loss: 0.0009765625 predicted value: tensor([[0.4922], [0.9297], [0.4141], [0.2383], [0.3984], [0.4375], [0.2041], [0.3984], [0.3027], [0.3672], [0.5547], [0.3457], [0.6367], [0.1602], [0.1367], [0.1445]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.3340], [0.4668], [0.4668], [0.2500], [0.4668], [0.2500], [0.4668], [0.7500], [0.5000], [0.7500], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00119781494140625 loss: 0.0020751953125 loss: 0.0029144287109375 loss: 0.0012359619140625 predicted value: tensor([[0.3652], [0.9648], [0.7031], [0.5117], [0.3848], [0.9688], [0.1787], [0.6562], [0.2129], [0.9375], [0.4980], [0.3066], [0.3730], [0.1475], [0.4844], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.8008], [0.5547], [0.3750], [1.0000], [0.2500], [0.8008], [0.2500], [1.0000], [0.7500], [0.4004], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00156402587890625 loss: 0.006072998046875 loss: 0.0027618408203125 loss: 0.0011444091796875 62%|██████▏ | 307/492 [2:47:17<1:49:34, 35.54s/it] {'loss': 0.0085, 'learning_rate': 1e-05, 'epoch': 0.62} 62%|██████▏ | 307/492 [2:47:17<1:49:34, 35.54s/it]predicted value: tensor([[0.4395], [0.7031], [0.3926], [0.9453], [0.4160], [0.2207], [0.6758], [0.6641], [0.6133], [0.4121], [0.3887], [0.2949], [0.3379], [0.0845], [0.1484], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.4668], [1.0000], [0.4668], [0.2500], [0.6680], [0.6680], [0.5000], [0.4004], [0.4004], [0.3340], [0.5000], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000858306884765625 loss: 0.001068115234375 loss: 0.000946044921875 loss: 0.000789642333984375 predicted value: tensor([[0.8438], [0.2178], [0.7930], [0.4023], [0.7617], [0.4824], [0.9492], [0.6836], [0.7031], [0.2119], [0.5625], [0.1055], [0.3301], [0.1689], [0.1660], [0.1377]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [0.8320], [0.4668], [0.8008], [0.8008], [1.0000], [0.7500], [0.6680], [0.2002], [0.6016], [0.2002], [0.4004], [0.1670], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002105712890625loss: 0.00128173828125 loss: 0.000732421875 loss: 0.0010528564453125 predicted value: tensor([[0.4004], [0.4297], [0.2637], [0.3945], [0.9805], [0.4941], [0.9570], [0.6094], [0.4785], [0.5156], [0.5117], [0.5781], [0.4531], [0.3926], [0.1387], [0.1533]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.2500], [0.4668], [1.0000], [0.6016], [1.0000], [0.6016], [0.6680], [0.5000], [0.6016], [0.6016], [0.5000], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023345947265625 loss: 0.00146484375 loss: 0.0006256103515625 loss: 0.00360107421875 predicted value: tensor([[0.4941], [0.9883], [0.4180], [0.9844], [0.5312], [0.9766], [0.5469], [0.2461], [0.4277], [0.6953], [0.2285], [0.5859], [0.5625], [0.4238], [0.1914], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [1.0000], [0.4668], [1.0000], [0.8008], [1.0000], [0.5547], [0.2500], [0.6016], [0.4668], [0.2500], [0.6016], [0.7500], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0031280517578125 loss: 0.00168609619140625 loss: 0.000690460205078125 loss: 0.0030975341796875 63%|██████▎ | 308/492 [2:47:49<1:45:59, 34.56s/it] {'loss': 0.0064, 'learning_rate': 1e-05, 'epoch': 0.63} 63%|██████▎ | 308/492 [2:47:49<1:45:59, 34.56s/it]predicted value: tensor([[0.6367], [0.4531], [0.5820], [0.5430], [1.0391], [1.0391], [0.5820], [0.8555], [0.3086], [0.3086], [0.5391], [1.0391], [0.6484], [0.4141], [0.2480], [0.2441]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.5547], [0.8008], [1.0000], [1.0000], [0.5000], [0.8008], [0.2500], [0.3340], [0.8008], [1.0000], [0.7500], [0.3340], [0.2500], [0.1250]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001922607421875 loss: 0.00299072265625loss: 0.001922607421875 loss: 0.00131988525390625 predicted value: tensor([[0.8555], [0.2793], [0.5000], [1.0312], [1.0703], [0.8359], [0.4277], [0.7422], [0.5781], [0.6250], [0.6758], [0.6562], [0.5078], [0.3555], [0.2578], [0.2305]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.3340], [0.4668], [1.0000], [1.0000], [0.8008], [0.2002], [0.7500], [0.6016], [0.6016], [0.5000], [0.6016], [0.4004], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00096893310546875 loss: 0.002227783203125 loss: 0.00079345703125 loss: 0.00115966796875 predicted value: tensor([[0.5000], [0.6250], [0.4746], [0.7031], [0.7461], [1.0703], [0.7891], [0.7266], [0.8438], [0.5352], [0.6562], [0.7070], [0.4395], [0.1973], [0.2197], [0.3594]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.4668], [0.8008], [0.7500], [1.0000], [0.8008], [0.7500], [0.8008], [0.5000], [0.6016], [0.6016], [0.4004], [0.2500], [0.2002], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010223388671875 loss: 0.00098419189453125loss: 0.0020751953125 loss: 0.0010528564453125 predicted value: tensor([[1.0625], [0.8281], [1.0547], [0.2930], [0.5039], [0.7305], [1.0547], [0.7070], [0.4961], [0.4902], [0.2432], [0.3359], [1.0391], [0.5703], [0.2070], [0.2295]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [1.0000], [0.2500], [0.4668], [0.6680], [1.0000], [0.7500], [0.4668], [0.4668], [0.2500], [0.2500], [1.0000], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0014495849609375 loss: 0.000560760498046875loss: 0.0016021728515625 loss: 0.00075531005859375 63%|██████▎ | 309/492 [2:48:22<1:43:27, 33.92s/it] {'loss': 0.0057, 'learning_rate': 1e-05, 'epoch': 0.63} 63%|██████▎ | 309/492 [2:48:22<1:43:27, 33.92s/it]predicted value: tensor([[0.7344], [0.2656], [0.7031], [0.4746], [0.4746], [1.0391], [0.4805], [0.5391], [1.0625], [0.4863], [1.0156], [0.4629], [0.4707], [0.2363], [0.1992], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.2002], [0.8008], [0.3750], [0.4668], [1.0000], [0.3750], [0.4668], [1.0000], [0.5000], [1.0000], [0.4004], [0.3340], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0014495849609375 loss: 0.0010528564453125 loss: 0.0013427734375 loss: 0.00096893310546875 predicted value: tensor([[0.7969], [0.4980], [0.3281], [0.3066], [0.5352], [1.0781], [0.7930], [0.8008], [0.2471], [0.8047], [1.0312], [0.5625], [0.4805], [0.2461], [0.2334], [0.2334]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.3750], [0.3340], [0.2002], [0.4668], [1.0000], [0.8008], [0.8008], [0.2500], [0.8008], [1.0000], [0.4004], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00136566162109375 loss: 0.00138092041015625 loss: 0.001251220703125 loss: 0.00165557861328125 predicted value: tensor([[0.5000], [0.2617], [1.0859], [0.5352], [1.0547], [0.5625], [0.3223], [0.3203], [0.3613], [0.6797], [0.6680], [0.4688], [0.4863], [0.2578], [0.2734], [0.2012]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [1.0000], [0.4668], [1.0000], [0.4668], [0.2500], [0.3340], [0.2500], [0.6016], [0.6016], [0.5000], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00189971923828125 loss: 0.00110626220703125 loss: 0.00439453125 loss: 0.0023651123046875 predicted value: tensor([[0.5234], [0.8945], [0.4883], [0.4512], [1.0312], [0.3359], [0.8594], [0.4785], [0.7617], [0.8242], [0.5977], [0.4688], [0.7148], [0.4316], [0.2080], [0.2246]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.3750], [0.4668], [1.0000], [0.3340], [0.8008], [0.4668], [0.6680], [0.8008], [0.5000], [0.4004], [0.6016], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00113677978515625 loss: 0.00191497802734375 loss: 0.000576019287109375 loss: 0.00106048583984375 63%|██████▎ | 310/492 [2:48:54<1:41:21, 33.41s/it] {'loss': 0.0062, 'learning_rate': 1e-05, 'epoch': 0.63} 63%|██████▎ | 310/492 [2:48:54<1:41:21, 33.41s/it]predicted value: tensor([[0.9414], [0.7188], [0.3652], [0.6406], [0.6094], [0.1582], [0.3613], [0.2080], [0.9727], [0.7656], [0.9727], [0.3672], [0.3809], [0.3613], [0.1680], [0.1455]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.3750], [0.6016], [0.8008], [0.2500], [0.4668], [0.2500], [1.0000], [0.8320], [1.0000], [0.4004], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003204345703125 loss: 0.0015716552734375loss: 0.002288818359375 loss: 0.002288818359375 predicted value: tensor([[0.9883], [0.3906], [0.3711], [0.1758], [0.4590], [0.7070], [0.6055], [0.3887], [0.3984], [0.5234], [0.3613], [0.5703], [0.9492], [0.4258], [0.1465], [0.1357]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3750], [0.2500], [0.4668], [0.8320], [0.5000], [0.4668], [0.4668], [0.5000], [0.4004], [0.6016], [1.0000], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00176239013671875 loss: 0.0012359619140625 loss: 0.000888824462890625 loss: 0.0014495849609375 predicted value: tensor([[0.5000], [0.3926], [0.5273], [0.5273], [0.1963], [0.3730], [0.1680], [0.1348], [0.6875], [0.9492], [0.6367], [0.3906], [0.2617], [0.3789], [0.1348], [0.4238]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [0.5547], [0.2500], [0.4668], [0.2002], [0.2500], [0.6680], [1.0000], [0.7500], [0.5000], [0.3340], [0.3340], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037384033203125 loss: 0.00115203857421875loss: 0.001190185546875 loss: 0.0029754638671875 predicted value: tensor([[0.3574], [0.9688], [0.9570], [0.6797], [0.3848], [0.2188], [0.9844], [0.1572], [0.1777], [0.5859], [0.7070], [0.4629], [0.3477], [0.3984], [0.3730], [0.1504]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [1.0000], [0.8008], [0.4668], [0.2002], [1.0000], [0.3340], [0.2500], [0.5703], [0.8008], [0.5000], [0.4004], [0.4004], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00142669677734375 loss: 0.0036163330078125 loss: 0.00135040283203125 loss: 0.00067138671875 63%|██████▎ | 311/492 [2:49:25<1:39:13, 32.89s/it] {'loss': 0.0077, 'learning_rate': 1e-05, 'epoch': 0.63} 63%|██████▎ | 311/492 [2:49:25<1:39:13, 32.89s/it]predicted value: tensor([[0.2021], [0.4023], [0.7148], [0.5078], [0.7422], [0.7617], [0.2246], [0.9609], [0.5859], [0.5664], [0.4199], [0.2695], [0.7031], [0.4707], [0.1816], [0.1377]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.4668], [0.8008], [0.5547], [0.8008], [0.8008], [0.3340], [1.0000], [0.6016], [0.6016], [0.4004], [0.7500], [0.7500], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00150299072265625 loss: 0.0045166015625 loss: 0.004425048828125 loss: 0.0007781982421875 predicted value: tensor([[0.7852], [0.5547], [0.4980], [0.3848], [0.3574], [0.3770], [0.1875], [0.1963], [0.6289], [0.7695], [0.6953], [0.3965], [0.3633], [0.4004], [0.1338], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.8008], [0.3750], [0.3750], [0.4668], [0.2500], [0.2500], [0.6016], [0.8320], [0.7500], [0.3340], [0.4004], [0.2852], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00104522705078125 loss: 0.0023345947265625loss: 0.00070953369140625 loss: 0.000728607177734375 predicted value: tensor([[0.8203], [0.5117], [0.4219], [0.9766], [0.6680], [0.9336], [0.5469], [0.6211], [0.2158], [0.3711], [0.3867], [0.4375], [0.4609], [0.1826], [0.2041], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.4668], [1.0000], [0.6680], [1.0000], [0.6016], [0.7500], [0.3340], [0.4004], [0.4004], [0.5000], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000946044921875 loss: 0.00079345703125loss: 0.0036163330078125 loss: 0.000701904296875 predicted value: tensor([[0.5273], [0.9688], [0.4297], [0.7539], [0.6953], [0.5391], [0.9531], [0.4336], [0.3906], [0.6250], [0.4375], [0.4375], [0.3770], [0.1494], [0.4531], [0.1611]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3750], [0.8008], [0.7500], [0.6016], [1.0000], [0.4668], [0.4277], [0.6016], [0.5000], [0.5000], [0.4004], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00052642822265625 loss: 0.0017242431640625 loss: 0.00177001953125 loss: 0.00074005126953125 63%|██████▎ | 312/492 [2:49:56<1:36:51, 32.29s/it] {'loss': 0.0067, 'learning_rate': 1e-05, 'epoch': 0.63} 63%|██████▎ | 312/492 [2:49:56<1:36:51, 32.29s/it]predicted value: tensor([[0.5938], [1.0547], [0.4629], [0.4258], [0.3750], [1.0156], [0.8203], [0.5352], [1.0469], [0.4492], [0.4688], [0.2559], [0.5234], [0.4219], [0.2266], [0.2246]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5312], [1.0000], [0.3750], [0.3750], [0.2500], [1.0000], [0.8008], [0.5000], [1.0000], [0.4004], [0.3340], [0.2500], [0.5000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00128936767578125 loss: 0.0021820068359375 loss: 0.0009765625 loss: 0.000659942626953125 predicted value: tensor([[0.6055], [0.5039], [0.4883], [0.4199], [0.5859], [0.4863], [0.6758], [0.3574], [0.8242], [0.5664], [0.4395], [0.7266], [0.5156], [0.2480], [0.2305], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.3750], [0.5547], [0.4668], [0.5000], [0.3340], [0.8008], [0.4277], [0.2500], [0.7500], [0.5000], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00341796875 loss: 0.0015106201171875 loss: 0.0013427734375 loss: 0.0013275146484375 predicted value: tensor([[0.4941], [0.4590], [1.0391], [0.8047], [1.0312], [0.5586], [1.0469], [0.6641], [0.3086], [0.5273], [0.3184], [0.3008], [0.5820], [0.0864], [0.4863], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.8320], [1.0000], [0.3750], [1.0000], [0.6016], [0.3340], [0.4277], [0.2500], [0.2002], [0.5000], [0.0400], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030517578125 loss: 0.00138092041015625 loss: 0.00274658203125 loss: 0.00048828125 predicted value: tensor([[0.5586], [1.0156], [1.0625], [0.4727], [0.7930], [0.6406], [0.7812], [0.3223], [1.0312], [1.0312], [0.6680], [0.3359], [0.3770], [0.4414], [0.2754], [0.2158]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [1.0000], [1.0000], [0.4668], [0.8008], [0.8008], [0.7500], [0.3340], [1.0000], [1.0000], [0.3340], [0.3340], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020904541015625 loss: 0.00115203857421875 loss: 0.0026702880859375 loss: 0.0038909912109375 64%|██████▎ | 313/492 [2:50:27<1:35:17, 31.94s/it] {'loss': 0.0075, 'learning_rate': 1e-05, 'epoch': 0.64} 64%|██████▎ | 313/492 [2:50:27<1:35:17, 31.94s/it]predicted value: tensor([[0.6680], [0.7188], [0.4766], [1.0234], [1.0469], [0.7539], [0.7305], [0.6680], [0.6406], [1.0234], [0.7109], [1.0234], [0.3945], [0.4160], [0.2100], [0.2490]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.6680], [0.4668], [1.0000], [1.0000], [0.8008], [0.7500], [0.6016], [0.6016], [1.0000], [0.6016], [1.0000], [0.4004], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00107574462890625 loss: 0.000492095947265625loss: 0.0009765625 loss: 0.0010528564453125 predicted value: tensor([[0.3105], [0.3613], [1.0312], [0.4590], [0.2617], [0.7539], [0.3691], [0.4414], [1.0156], [1.0156], [0.4824], [0.4941], [0.4512], [0.2559], [0.2832], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.2500], [1.0000], [0.3750], [0.2500], [0.6680], [0.2500], [0.3340], [1.0000], [1.0000], [0.4004], [0.4004], [0.5000], [0.2500], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013275146484375 loss: 0.00147247314453125 loss: 0.003143310546875 loss: 0.000698089599609375 predicted value: tensor([[0.4648], [1.0234], [0.5000], [0.2988], [0.4570], [0.2637], [0.5664], [0.5898], [0.2988], [0.4570], [0.7227], [0.0908], [0.4648], [0.4258], [0.2158], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.2500], [0.4668], [0.2500], [0.5547], [0.5000], [0.3340], [0.3750], [0.8008], [0.0625], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016021728515625 loss: 0.000637054443359375 loss: 0.00119781494140625 loss: 0.000797271728515625 predicted value: tensor([[1.0703], [1.0391], [0.7695], [1.0312], [0.7109], [0.6445], [0.7617], [0.7656], [0.6406], [0.7539], [0.4629], [0.4316], [0.5391], [0.4629], [0.2617], [0.2236]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.8008], [1.0000], [0.6680], [0.6016], [0.6680], [0.8008], [0.5000], [0.7500], [0.4004], [0.4004], [0.4004], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00110626220703125 loss: 0.000705718994140625 loss: 0.00128936767578125 loss: 0.0031585693359375 64%|██████▍ | 314/492 [2:50:59<1:34:14, 31.77s/it] {'loss': 0.0052, 'learning_rate': 1e-05, 'epoch': 0.64} 64%|██████▍ | 314/492 [2:50:59<1:34:14, 31.77s/it]predicted value: tensor([[0.4980], [0.4082], [0.6914], [0.9570], [0.5391], [0.3691], [0.6562], [0.9648], [0.5898], [0.1855], [0.5312], [0.3770], [0.9492], [0.1758], [0.4082], [0.1953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8008], [1.0000], [0.3340], [0.3750], [0.8008], [1.0000], [0.6016], [0.2500], [0.7500], [0.4004], [1.0000], [0.2002], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00106048583984375 loss: 0.000560760498046875 loss: 0.00238037109375 loss: 0.00145721435546875 predicted value: tensor([[0.9688], [0.7461], [0.3535], [0.7695], [0.4766], [0.6016], [0.7734], [0.9648], [0.9453], [0.7266], [0.4004], [0.5977], [0.5078], [0.4844], [0.1523], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.7148], [0.4668], [0.8008], [0.5000], [0.5547], [0.8320], [1.0000], [1.0000], [0.3750], [0.4004], [0.6016], [0.5000], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115966796875 loss: 0.002593994140625loss: 0.00112152099609375 loss: 0.00081634521484375 predicted value: tensor([[0.3887], [0.3652], [0.4473], [0.7031], [0.9609], [0.4023], [0.6289], [0.3789], [0.4141], [0.5469], [0.4297], [0.5391], [0.5820], [0.4688], [0.1650], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.4668], [0.8008], [1.0000], [0.3750], [0.6016], [0.3750], [0.3750], [0.6016], [0.3750], [0.6016], [0.5000], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00099945068359375 loss: 0.0008087158203125 loss: 0.000644683837890625 loss: 0.00112152099609375 predicted value: tensor([[0.4004], [0.5000], [0.6836], [0.9727], [0.9570], [0.9570], [0.4395], [0.2441], [0.4590], [0.2178], [0.4492], [0.6523], [0.2100], [0.3906], [0.1738], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8008], [1.0000], [1.0000], [1.0000], [0.3750], [0.2500], [0.4277], [0.2500], [0.6680], [0.7500], [0.2500], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00130462646484375 loss: 0.000713348388671875 loss: 0.00171661376953125 loss: 0.000926971435546875 64%|██████▍ | 315/492 [2:51:30<1:33:37, 31.74s/it] {'loss': 0.0048, 'learning_rate': 1e-05, 'epoch': 0.64} 64%|██████▍ | 315/492 [2:51:30<1:33:37, 31.74s/it]predicted value: tensor([[0.3965], [0.8008], [0.7461], [0.9648], [0.2432], [0.6758], [0.4355], [0.2715], [0.2578], [0.2471], [0.7227], [0.4219], [0.3184], [0.1904], [0.3848], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.8320], [1.0000], [0.2002], [0.6680], [0.3750], [0.3340], [0.3340], [0.2500], [0.8008], [0.4004], [0.4004], [0.2500], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012664794921875 loss: 0.000789642333984375 loss: 0.000850677490234375 loss: 0.0013275146484375 predicted value: tensor([[0.4297], [0.9648], [0.6875], [0.2617], [0.6523], [0.3750], [0.4238], [0.6367], [0.2051], [0.1406], [0.3379], [0.3828], [0.5430], [0.1982], [0.1846], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.6680], [0.2500], [0.7500], [0.2500], [0.4668], [0.6016], [0.2002], [0.0625], [0.4004], [0.4004], [0.7500], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000873565673828125 loss: 0.00136566162109375 loss: 0.0010528564453125 loss: 0.00142669677734375 predicted value: tensor([[0.7188], [0.9883], [0.9883], [0.4023], [0.9453], [0.3789], [0.7188], [0.9570], [0.3672], [0.5664], [0.1934], [0.3535], [0.3906], [0.1836], [0.4023], [0.1465]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [1.0000], [0.4668], [1.0000], [0.6016], [0.8008], [1.0000], [0.5000], [0.6016], [0.0400], [0.3340], [0.4004], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00102996826171875 loss: 0.0019989013671875loss: 0.0009307861328125 loss: 0.000797271728515625 predicted value: tensor([[0.6406], [0.2344], [0.5508], [0.5195], [0.2207], [0.6758], [0.2832], [0.4902], [0.9648], [0.3633], [0.9414], [0.3848], [0.2949], [0.3574], [0.1523], [0.1377]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.2002], [0.6172], [0.5547], [0.3340], [0.8008], [0.2500], [0.6016], [1.0000], [0.4004], [1.0000], [0.4004], [0.3340], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000942230224609375 loss: 0.00152587890625 loss: 0.0011749267578125 loss: 0.000644683837890625 64%|██████▍ | 316/492 [2:52:02<1:32:41, 31.60s/it] {'loss': 0.0045, 'learning_rate': 1e-05, 'epoch': 0.64} 64%|██████▍ | 316/492 [2:52:02<1:32:41, 31.60s/it]predicted value: tensor([[0.5078], [0.3125], [1.0469], [0.3086], [1.0703], [0.2832], [0.7109], [0.4902], [0.6172], [1.0547], [1.0234], [0.4414], [0.7070], [0.4336], [0.2334], [0.2168]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [1.0000], [0.2500], [1.0000], [0.2500], [0.3750], [0.3750], [0.2500], [1.0000], [1.0000], [0.4004], [0.7500], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003753662109375 loss: 0.00469970703125 loss: 0.001800537109375 loss: 0.00162506103515625 predicted value: tensor([[1.0469], [0.2695], [0.4609], [0.4336], [0.4707], [0.6758], [0.2852], [0.3281], [0.3828], [1.0625], [0.6602], [0.6445], [0.4160], [0.4863], [0.2539], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [0.3750], [0.3750], [0.3750], [0.6016], [0.2500], [0.2002], [0.1670], [1.0000], [0.7500], [0.6016], [0.2500], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115966796875 loss: 0.00109100341796875 loss: 0.0025787353515625 loss: 0.000896453857421875 predicted value: tensor([[0.5000], [0.8672], [0.7500], [0.6562], [0.5273], [0.6797], [0.5195], [0.6953], [0.4355], [0.6133], [0.4395], [0.4453], [0.6680], [0.4316], [0.2363], [0.2246]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8555], [0.8008], [0.6016], [0.6680], [0.7500], [0.3750], [0.6680], [0.4668], [0.6016], [0.4004], [0.3340], [0.5000], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00250244140625 loss: 0.00182342529296875loss: 0.001983642578125 loss: 0.00133514404296875 predicted value: tensor([[1.0625], [0.8828], [0.7812], [0.7734], [1.0469], [0.8477], [0.5664], [1.0312], [0.2832], [0.5234], [0.5625], [0.3730], [0.7148], [0.5352], [0.2100], [0.2148]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.8008], [0.8008], [1.0000], [0.8008], [0.6016], [1.0000], [0.2500], [0.5000], [0.8008], [0.2500], [0.3340], [0.4668], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0038909912109375 loss: 0.00274658203125 loss: 0.000919342041015625 loss: 0.00372314453125 64%|██████▍ | 317/492 [2:52:33<1:31:38, 31.42s/it] {'loss': 0.0091, 'learning_rate': 1e-05, 'epoch': 0.64} 64%|██████▍ | 317/492 [2:52:33<1:31:38, 31.42s/it]predicted value: tensor([[0.5898], [1.0625], [1.0391], [0.7539], [1.0391], [0.7656], [0.6875], [0.6094], [0.2832], [0.7539], [0.6211], [0.6250], [0.5078], [0.2295], [0.2334], [0.2344]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [1.0000], [1.0000], [0.6680], [1.0000], [0.8008], [0.6016], [0.4668], [0.2500], [0.8008], [0.6016], [0.7500], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00084686279296875 loss: 0.00131988525390625 loss: 0.0027313232421875 loss: 0.004730224609375 predicted value: tensor([[0.2363], [0.3281], [0.3008], [0.3809], [0.8516], [0.4883], [0.3398], [0.6055], [1.0625], [0.6797], [0.3105], [0.4766], [0.4062], [0.3945], [0.2471], [0.2314]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.2500], [0.2002], [0.3340], [0.8320], [0.4668], [0.2500], [0.6016], [1.0000], [0.6016], [0.3340], [0.6016], [0.4004], [0.5000], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00286865234375 loss: 0.000942230224609375 loss: 0.0011749267578125 loss: 0.002685546875 predicted value: tensor([[1.0547], [0.5898], [0.6211], [1.0391], [0.5078], [0.5000], [0.5352], [0.7109], [0.7227], [0.6719], [0.5430], [0.4492], [0.4141], [0.2617], [0.2520], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.5547], [1.0000], [0.3750], [0.4668], [0.5000], [0.6016], [0.6680], [0.6016], [0.6016], [0.3340], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00193023681640625 loss: 0.001373291015625 loss: 0.0010223388671875 loss: 0.00182342529296875 predicted value: tensor([[0.6914], [0.4941], [0.8438], [0.4629], [0.5039], [0.4844], [0.6914], [0.5938], [0.7305], [0.4492], [0.5469], [0.5195], [0.4980], [0.4688], [0.2617], [0.2266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.4668], [0.8320], [0.4668], [0.4668], [0.4668], [0.8008], [0.5547], [0.8008], [0.4004], [0.5000], [0.6016], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000614166259765625 loss: 0.000713348388671875 loss: 0.00110626220703125 loss: 0.0006561279296875 65%|██████▍ | 318/492 [2:53:04<1:30:54, 31.35s/it] {'loss': 0.0066, 'learning_rate': 1e-05, 'epoch': 0.65} 65%|██████▍ | 318/492 [2:53:04<1:30:54, 31.35s/it]predicted value: tensor([[0.5781], [0.9844], [0.5156], [0.2373], [0.4590], [0.4160], [0.2275], [0.5625], [0.1719], [0.4883], [0.3594], [0.3770], [0.4316], [0.3730], [0.1309], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [1.0000], [0.5547], [0.2500], [0.4668], [0.4668], [0.2500], [0.5547], [0.2500], [0.4668], [0.2852], [0.5000], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003326416015625 loss: 0.000804901123046875loss: 0.0010986328125 loss: 0.002716064453125 predicted value: tensor([[0.5195], [0.1533], [0.9844], [0.2520], [0.4863], [0.6094], [0.3691], [0.9766], [0.9805], [0.3750], [0.6953], [0.3105], [0.3555], [0.6016], [0.1387], [0.1582]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [1.0000], [0.3340], [0.4668], [0.7500], [0.4668], [1.0000], [1.0000], [0.4004], [0.8008], [0.4004], [0.4004], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00072479248046875 loss: 0.00139617919921875 loss: 0.00118255615234375 loss: 0.001678466796875 predicted value: tensor([[0.9844], [0.4121], [0.4141], [0.6875], [0.7070], [1.0234], [0.9961], [0.9922], [0.2656], [0.6602], [0.5586], [0.2051], [0.9727], [0.1523], [0.1826], [0.1455]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3750], [0.8008], [0.8008], [1.0000], [1.0000], [1.0000], [0.2002], [0.6680], [0.6016], [0.2500], [1.0000], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00119781494140625 loss: 0.000675201416015625 loss: 0.002471923828125 loss: 0.00186920166015625 predicted value: tensor([[0.7578], [0.5586], [0.3809], [0.1270], [0.3926], [0.7539], [0.1738], [0.4238], [0.6094], [0.5859], [0.3398], [0.1787], [0.3789], [0.3066], [0.1729], [0.1416]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6172], [0.3750], [0.0625], [0.4668], [0.8320], [0.2002], [0.6016], [0.6016], [0.7500], [0.2500], [0.5000], [0.5000], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00176239013671875 loss: 0.0035552978515625 loss: 0.0035858154296875 loss: 0.001495361328125 65%|██████▍ | 319/492 [2:53:37<1:31:31, 31.74s/it] {'loss': 0.0074, 'learning_rate': 1e-05, 'epoch': 0.65} 65%|██████▍ | 319/492 [2:53:37<1:31:31, 31.74s/it]predicted value: tensor([[0.5195], [0.5586], [0.7188], [0.6484], [0.4082], [0.5508], [0.2773], [0.7539], [0.2070], [0.3594], [0.5430], [0.2295], [0.3242], [0.3496], [0.2041], [0.1426]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.8008], [0.6680], [0.4668], [0.3340], [0.2500], [0.8008], [0.3340], [0.3340], [0.6016], [0.2002], [0.4004], [0.3340], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0017242431640625 loss: 0.001434326171875loss: 0.00115966796875 loss: 0.0020904541015625 predicted value: tensor([[0.4062], [0.5156], [0.5273], [0.2051], [0.7383], [0.3477], [0.9844], [0.6758], [0.3164], [0.2617], [0.6016], [0.6133], [0.0310], [0.2949], [0.3809], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.5547], [0.2500], [0.8008], [0.3145], [1.0000], [0.8008], [0.3340], [0.3340], [0.5000], [0.7500], [0.0400], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00238037109375 loss: 0.00118255615234375 loss: 0.00099945068359375 loss: 0.0028533935546875 predicted value: tensor([[0.2891], [0.7656], [0.4355], [0.5117], [0.3516], [0.4883], [0.9961], [0.2207], [0.6562], [1.0000], [0.1836], [0.1104], [0.4727], [0.2070], [0.1719], [0.1562]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.8320], [0.5547], [0.4668], [0.2500], [0.5000], [1.0000], [0.3340], [0.8008], [1.0000], [0.2500], [0.0278], [0.6016], [0.2500], [0.0400], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023193359375 loss: 0.00189971923828125loss: 0.004852294921875 loss: 0.000823974609375 predicted value: tensor([[0.4121], [0.4141], [0.9844], [0.3984], [0.5156], [1.0078], [0.4395], [0.9961], [0.4648], [0.5547], [0.5508], [0.3516], [0.3086], [0.2080], [0.1777], [0.1484]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [1.0000], [0.4668], [0.8008], [1.0000], [0.3750], [1.0000], [0.3750], [0.4668], [0.6016], [0.4004], [0.4004], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002105712890625 loss: 0.00060272216796875 loss: 0.0028076171875 loss: 0.0015411376953125 65%|██████▌ | 320/492 [2:54:09<1:31:20, 31.87s/it] {'loss': 0.0077, 'learning_rate': 1e-05, 'epoch': 0.65} 65%|██████▌ | 320/492 [2:54:09<1:31:20, 31.87s/it]predicted value: tensor([[0.6172], [0.2949], [0.2676], [0.5117], [0.4746], [0.7227], [1.0391], [1.0469], [0.6133], [0.6133], [0.5547], [0.3770], [0.4219], [0.2695], [0.4336], [0.2441]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.2500], [0.3750], [0.4668], [0.8008], [1.0000], [1.0000], [0.5000], [0.3340], [0.4668], [0.4004], [0.3340], [0.2500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027008056640625 loss: 0.00193023681640625 loss: 0.0022735595703125 loss: 0.00081634521484375 predicted value: tensor([[0.8555], [0.6328], [0.2695], [0.4648], [0.5469], [1.0625], [0.6797], [0.4824], [0.5938], [0.4863], [0.6836], [0.6172], [0.3633], [0.4590], [0.4941], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.3340], [0.4668], [0.4668], [1.0000], [0.3750], [0.4668], [0.6016], [0.3145], [0.7500], [0.5000], [0.6016], [0.4004], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000926971435546875 loss: 0.00347900390625 loss: 0.000904083251953125 loss: 0.004608154296875 predicted value: tensor([[0.9023], [0.3164], [0.8594], [0.8438], [1.0469], [0.4980], [1.0547], [0.3301], [0.5078], [0.6445], [0.2852], [0.4199], [0.4277], [0.4238], [0.5039], [0.2246]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [0.8320], [0.8008], [1.0000], [0.4668], [1.0000], [0.2500], [0.4668], [0.6016], [0.2002], [0.4004], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016632080078125 loss: 0.00058746337890625 loss: 0.0019989013671875 loss: 0.0008697509765625 predicted value: tensor([[0.5234], [0.4883], [0.7930], [1.0234], [0.4863], [0.4707], [1.0469], [0.6562], [0.6094], [0.6797], [0.4902], [0.7031], [0.0447], [0.1719], [0.2334], [0.2285]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [1.0000], [0.3750], [0.3750], [1.0000], [0.6016], [0.6016], [0.6016], [0.5000], [0.7500], [0.0278], [0.0204], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001190185546875 loss: 0.0012359619140625 loss: 0.00138092041015625 loss: 0.00168609619140625 65%|██████▌ | 321/492 [2:54:41<1:31:25, 32.08s/it] {'loss': 0.0071, 'learning_rate': 1e-05, 'epoch': 0.65} 65%|██████▌ | 321/492 [2:54:41<1:31:25, 32.08s/it]predicted value: tensor([[0.6289], [1.0234], [0.4629], [0.5117], [0.7852], [0.7734], [0.6797], [0.6445], [1.0312], [0.6133], [0.5781], [0.1670], [0.4473], [0.2236], [0.5469], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.4668], [0.8008], [0.8008], [0.6680], [0.7500], [1.0000], [0.8008], [0.5000], [0.0625], [0.5000], [0.1426], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00154876708984375 loss: 0.0016326904296875 loss: 0.00083160400390625 loss: 0.0016021728515625 predicted value: tensor([[0.8672], [0.7930], [0.5039], [1.0234], [0.4805], [0.3301], [0.5273], [0.6406], [1.0156], [0.6484], [0.6641], [0.5586], [0.3340], [0.2676], [0.4805], [0.2305]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.4668], [1.0000], [0.4668], [0.2500], [0.6016], [0.7500], [1.0000], [0.7500], [0.6016], [0.7500], [0.4004], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00213623046875 loss: 0.00153350830078125 loss: 0.00421142578125 loss: 0.00173187255859375 predicted value: tensor([[0.7031], [1.0312], [0.4629], [1.0234], [1.0156], [0.6914], [0.4727], [0.6094], [0.2871], [0.4961], [0.6406], [0.4551], [0.2373], [0.2578], [0.2539], [0.1953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [1.0000], [0.4668], [1.0000], [1.0000], [0.7500], [0.4668], [0.7500], [0.2500], [0.5000], [0.5000], [0.4004], [0.2002], [0.2002], [0.2002], [0.0400]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00084686279296875 loss: 0.00128173828125loss: 0.00055694580078125 loss: 0.001953125 predicted value: tensor([[0.3145], [0.4883], [0.2930], [0.2949], [0.5039], [0.4902], [0.6641], [0.7148], [1.0312], [1.0312], [0.4316], [0.4531], [0.4043], [0.2285], [0.2129], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.4668], [0.2500], [0.2500], [0.4668], [0.4668], [0.6016], [0.7500], [1.0000], [1.0000], [0.4004], [0.4004], [0.3340], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00139617919921875 loss: 0.0004405975341796875 loss: 0.0020599365234375 loss: 0.0005645751953125 65%|██████▌ | 322/492 [2:55:13<1:30:54, 32.08s/it] {'loss': 0.0061, 'learning_rate': 1e-05, 'epoch': 0.65} 65%|██████▌ | 322/492 [2:55:13<1:30:54, 32.08s/it]predicted value: tensor([[0.7617], [0.5508], [0.2432], [0.5508], [0.4492], [0.2314], [0.2871], [0.9727], [0.3672], [0.4043], [0.5273], [0.5742], [0.3145], [0.3301], [0.1807], [0.1777]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.3340], [0.5000], [0.3750], [0.2500], [0.3340], [1.0000], [0.4004], [0.5000], [0.6016], [0.6016], [0.3340], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0004825592041015625 loss: 0.00152587890625 loss: 0.0012359619140625 loss: 0.00135040283203125 predicted value: tensor([[0.4434], [0.9492], [0.9492], [0.9531], [0.2324], [0.4199], [0.4258], [0.4746], [0.5234], [0.4238], [0.6641], [0.3613], [0.3789], [0.1465], [0.1992], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [1.0000], [0.2500], [0.4668], [0.4668], [0.3750], [0.5000], [0.5000], [0.7500], [0.3340], [0.4004], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000762939453125 loss: 0.000812530517578125 loss: 0.000843048095703125 loss: 0.000812530517578125 predicted value: tensor([[0.9414], [0.3809], [0.4082], [0.3711], [0.7344], [0.4062], [0.2734], [0.5586], [0.3516], [0.5156], [0.5430], [0.0304], [0.3770], [0.2266], [0.1885], [0.1445]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3750], [0.3145], [0.8008], [0.4668], [0.2002], [0.5547], [0.5000], [0.5000], [0.1670], [0.0400], [0.3340], [0.1670], [0.2500], [0.1113]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000644683837890625 loss: 0.0031585693359375 loss: 0.00104522705078125 loss: 0.0010833740234375 predicted value: tensor([[ 0.5469], [ 0.4180], [ 0.7773], [ 0.2930], [ 0.4062], [ 0.6172], [ 0.9414], [ 0.4785], [ 0.5938], [ 0.6055], [ 0.5234], [ 0.3242], [ 0.4785], [ 0.5703], [-0.0155], [ 0.1572]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3145], [0.8320], [0.3340], [0.4668], [0.8320], [1.0000], [0.6016], [0.7500], [0.6016], [0.6016], [0.3340], [0.6016], [0.7500], [0.0400], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003692626953125 loss: 0.000949859619140625 loss: 0.0026092529296875 loss: 0.0019989013671875 66%|██████▌ | 323/492 [2:55:47<1:31:19, 32.42s/it] {'loss': 0.0058, 'learning_rate': 1e-05, 'epoch': 0.66} 66%|██████▌ | 323/492 [2:55:47<1:31:19, 32.42s/it]predicted value: tensor([[0.9609], [0.3984], [0.5000], [0.4082], [0.9375], [0.3926], [0.4570], [0.4980], [0.9609], [0.5039], [0.5312], [0.5469], [0.4160], [0.1660], [0.2168], [0.1973]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.5547], [0.4668], [1.0000], [0.4668], [0.3750], [0.5000], [1.0000], [0.6016], [0.6016], [0.5000], [0.3340], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020294189453125 loss: 0.000965118408203125 loss: 0.00090789794921875 loss: 0.00128173828125 predicted value: tensor([[0.4043], [0.4277], [0.8164], [0.8164], [0.4277], [0.4648], [0.7070], [0.3945], [0.7812], [0.4570], [0.4258], [0.4980], [0.2471], [0.4199], [0.2031], [0.1777]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8320], [0.8320], [0.4668], [0.5000], [0.8008], [0.4668], [0.8008], [0.4004], [0.4668], [0.4668], [0.2002], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00060272216796875 loss: 0.00128173828125loss: 0.0005950927734375 loss: 0.000782012939453125 predicted value: tensor([[0.4219], [0.3965], [0.4238], [0.3848], [0.2461], [0.6875], [0.1963], [0.6133], [0.7227], [0.2373], [0.2002], [0.5820], [0.9453], [0.3184], [0.1650], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.4668], [0.4668], [0.2500], [0.8008], [0.2500], [0.7500], [0.8008], [0.2500], [0.2002], [0.5000], [1.0000], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000885009765625 loss: 0.00109100341796875 loss: 0.000762939453125 loss: 0.0013885498046875 predicted value: tensor([[0.3926], [0.4121], [0.7383], [0.4023], [0.7305], [0.4102], [0.6484], [0.9336], [0.6016], [0.5898], [0.2852], [0.3770], [0.5117], [0.3594], [0.4219], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8320], [0.4668], [0.6680], [0.4668], [0.8320], [1.0000], [0.8008], [0.6016], [0.3340], [0.4004], [0.5000], [0.4004], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00189208984375 loss: 0.0010986328125 loss: 0.00130462646484375 loss: 0.0019073486328125 66%|██████▌ | 324/492 [2:56:19<1:31:02, 32.52s/it] {'loss': 0.0047, 'learning_rate': 1e-05, 'epoch': 0.66} 66%|██████▌ | 324/492 [2:56:19<1:31:02, 32.52s/it]predicted value: tensor([[0.5156], [0.4609], [0.5234], [0.7109], [0.7031], [1.0547], [0.6953], [0.3203], [0.8477], [0.3281], [0.8008], [0.4473], [0.1074], [0.2012], [0.2559], [0.2139]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.6680], [0.8008], [1.0000], [0.6016], [0.2500], [0.8008], [0.2500], [0.8008], [0.3340], [0.0625], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010528564453125 loss: 0.00092315673828125loss: 0.0015716552734375 loss: 0.00127410888671875 predicted value: tensor([[0.5703], [0.4805], [0.5820], [1.0312], [1.0391], [0.4785], [1.0156], [0.7969], [0.5820], [0.6289], [0.3340], [0.5547], [0.5195], [0.4766], [0.2461], [0.2354]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [1.0000], [1.0000], [0.4668], [1.0000], [0.6680], [0.5000], [0.6016], [0.2002], [0.4004], [0.5000], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000659942626953125 loss: 0.0014190673828125 loss: 0.00142669677734375 loss: 0.000949859619140625 predicted value: tensor([[0.4922], [0.6055], [0.4570], [0.3105], [0.7891], [0.5508], [0.9023], [0.4746], [0.3086], [1.0312], [0.4883], [1.0234], [0.6641], [0.2793], [0.2559], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.4668], [0.2500], [0.8008], [0.4668], [0.8320], [0.4668], [0.2500], [1.0000], [0.3340], [1.0000], [0.6016], [0.1426], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00133514404296875 loss: 0.00133514404296875loss: 0.0037994384765625 loss: 0.0016021728515625 predicted value: tensor([[0.8555], [0.9844], [0.3398], [0.6016], [0.5273], [0.7734], [0.6289], [0.8438], [1.0000], [1.0469], [0.3320], [0.6289], [0.3340], [0.4922], [0.2178], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.3340], [0.5547], [0.3750], [0.8008], [0.6016], [0.8008], [1.0000], [1.0000], [0.3340], [0.7500], [0.2500], [0.4004], [0.2002], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024871826171875 loss: 0.0010223388671875 loss: 0.0012969970703125 loss: 0.00104522705078125 66%|██████▌ | 325/492 [2:56:51<1:29:54, 32.30s/it] {'loss': 0.0058, 'learning_rate': 1e-05, 'epoch': 0.66} 66%|██████▌ | 325/492 [2:56:51<1:29:54, 32.30s/it]predicted value: tensor([[0.9219], [0.4727], [0.8633], [0.4824], [0.8008], [0.6406], [0.2891], [0.8438], [1.0312], [0.6562], [0.3184], [0.5234], [0.4336], [0.2119], [0.4570], [0.2344]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.8320], [0.4668], [0.8008], [0.4668], [0.2002], [0.8008], [1.0000], [0.8008], [0.3340], [0.4004], [0.3340], [0.2002], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000751495361328125 loss: 0.001129150390625 loss: 0.001708984375 loss: 0.000919342041015625 predicted value: tensor([[0.5195], [0.4941], [0.8047], [1.0078], [0.4863], [0.3164], [0.4688], [0.7070], [0.5273], [1.0156], [0.3965], [0.5234], [0.3613], [1.0156], [0.2520], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8320], [1.0000], [0.4668], [0.3340], [0.3145], [0.8008], [0.4277], [1.0000], [0.4004], [0.5000], [0.2002], [1.0000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00145721435546875 loss: 0.0012054443359375loss: 0.00213623046875 loss: 0.000537872314453125 predicted value: tensor([[0.4531], [0.7539], [0.8242], [0.6289], [1.0156], [0.5078], [1.0078], [0.5859], [0.8320], [0.7812], [0.4863], [0.4961], [0.4258], [0.3945], [0.4648], [0.2148]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.6250], [0.5547], [1.0000], [0.4668], [1.0000], [0.5547], [0.8008], [0.5703], [0.5000], [0.4004], [0.4004], [0.4004], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0003509521484375 loss: 0.0020599365234375 loss: 0.0005340576171875 loss: 0.000640869140625 predicted value: tensor([[1.0000], [0.4844], [0.8398], [0.4629], [1.0312], [0.7930], [0.2969], [0.6680], [0.3223], [0.2734], [0.5430], [0.3945], [1.0625], [0.2988], [0.4883], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8320], [0.4668], [1.0000], [0.8008], [0.2500], [0.7500], [0.2500], [0.2500], [0.4004], [0.4004], [1.0000], [0.2500], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027008056640625 loss: 0.00176239013671875 loss: 0.00072479248046875 loss: 0.00084686279296875 66%|██████▋ | 326/492 [2:57:23<1:28:34, 32.01s/it] {'loss': 0.0049, 'learning_rate': 1e-05, 'epoch': 0.66} 66%|██████▋ | 326/492 [2:57:23<1:28:34, 32.01s/it]predicted value: tensor([[0.5586], [0.7930], [0.9453], [0.8164], [0.5977], [0.1816], [0.6641], [0.5352], [0.9766], [0.2236], [0.7695], [0.9727], [0.6758], [0.1914], [0.1396], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.8320], [1.0000], [0.8320], [0.6016], [0.2002], [0.7500], [0.5000], [1.0000], [0.3340], [0.8008], [1.0000], [0.6016], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013427734375 loss: 0.000720977783203125 loss: 0.00139617919921875 loss: 0.00099945068359375 predicted value: tensor([[0.4082], [0.3965], [0.6562], [0.2754], [0.4238], [0.7578], [0.3535], [0.7031], [0.4160], [0.2266], [0.0557], [0.3242], [0.3926], [0.6406], [0.1533], [0.1484]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [0.3340], [0.5000], [0.8008], [0.3340], [0.7500], [0.5000], [0.2500], [0.0625], [0.4004], [0.5000], [0.7500], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00102996826171875 loss: 0.00109100341796875 loss: 0.001983642578125 loss: 0.0020599365234375 predicted value: tensor([[0.7578], [0.7188], [0.9570], [0.4082], [0.7734], [0.9648], [0.4141], [0.4531], [0.7266], [0.3672], [0.8008], [0.6445], [0.6406], [0.2002], [0.1611], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [1.0000], [0.8008], [0.8008], [1.0000], [0.4668], [0.4668], [0.8008], [0.4004], [0.8008], [0.8008], [0.6016], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016632080078125 loss: 0.00156402587890625 loss: 0.0032958984375 loss: 0.00107574462890625 predicted value: tensor([[0.5547], [0.3516], [0.4258], [0.2910], [0.9258], [0.4336], [0.7109], [0.4727], [0.5547], [0.5234], [0.3711], [0.2930], [0.3906], [0.3613], [0.1523], [0.1826]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.2500], [1.0000], [0.4668], [0.8008], [0.8008], [0.6172], [0.5000], [0.4004], [0.5000], [0.4004], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001953125 loss: 0.002777099609375 loss: 0.0009918212890625 loss: 0.00152587890625 66%|██████▋ | 327/492 [2:57:54<1:27:52, 31.95s/it] {'loss': 0.0064, 'learning_rate': 1e-05, 'epoch': 0.66} 66%|██████▋ | 327/492 [2:57:54<1:27:52, 31.95s/it]predicted value: tensor([[0.4258], [0.9414], [0.4141], [0.4238], [0.9766], [0.5742], [0.6758], [0.6055], [0.7578], [0.2344], [0.9766], [0.7383], [0.4492], [0.0457], [0.2852], [0.1621]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.4668], [1.0000], [0.5000], [0.6680], [0.7500], [0.8008], [0.2500], [1.0000], [0.8008], [0.5000], [0.0625], [0.2852], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000949859619140625 loss: 0.0027923583984375 loss: 0.00128173828125 loss: 0.000743865966796875 predicted value: tensor([[0.4277], [0.4277], [0.3926], [0.9492], [0.2217], [0.5469], [0.4531], [0.4141], [0.3887], [0.6094], [0.3730], [0.4473], [0.3770], [0.6094], [0.1572], [0.1621]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [1.0000], [0.2500], [0.6016], [0.4668], [0.4668], [0.3750], [0.6016], [0.3750], [0.4668], [0.4004], [0.7500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023040771484375 loss: 0.000675201416015625 loss: 0.003265380859375 loss: 0.000484466552734375 predicted value: tensor([[0.3848], [0.3594], [0.7734], [0.7188], [0.4414], [0.5430], [0.4316], [0.2188], [0.4492], [0.2041], [0.4004], [0.3359], [0.6250], [0.6602], [0.1719], [0.1553]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [0.8008], [0.4668], [0.4668], [0.5000], [0.2500], [0.3750], [0.2500], [0.4668], [0.5000], [0.5000], [0.6016], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00136566162109375 loss: 0.0014495849609375loss: 0.0010223388671875 loss: 0.00051116943359375 predicted value: tensor([[0.4023], [0.5273], [0.9688], [0.4336], [0.7969], [0.8398], [0.4961], [0.5625], [0.5430], [0.9609], [0.6641], [0.5508], [0.5039], [0.3613], [0.1582], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [1.0000], [0.4668], [0.8008], [0.8320], [0.6016], [0.6016], [0.3750], [1.0000], [0.7500], [0.6016], [0.5000], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0005645751953125 loss: 0.003814697265625 loss: 0.0009765625 loss: 0.0013885498046875 67%|██████▋ | 328/492 [2:58:26<1:26:48, 31.76s/it] {'loss': 0.0059, 'learning_rate': 1e-05, 'epoch': 0.67} 67%|██████▋ | 328/492 [2:58:26<1:26:48, 31.76s/it]predicted value: tensor([[0.8477], [0.7930], [0.2891], [1.0391], [0.3496], [1.0312], [0.2871], [1.0391], [0.6680], [0.5898], [0.4258], [1.0234], [0.4824], [0.4766], [0.2148], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.8008], [0.2500], [1.0000], [0.3340], [1.0000], [0.3340], [1.0000], [0.6016], [0.6016], [0.4004], [1.0000], [0.4004], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00135040283203125 loss: 0.000881195068359375 loss: 0.0004558563232421875 loss: 0.0010986328125 predicted value: tensor([[0.5859], [0.5156], [0.6914], [0.7852], [0.6133], [0.3203], [0.6445], [0.5078], [0.4668], [0.4824], [1.0547], [0.3535], [0.4336], [0.3965], [0.2520], [0.2266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.6680], [0.8008], [0.4668], [0.2500], [0.5547], [0.4668], [0.3750], [0.7500], [1.0000], [0.2500], [0.3340], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00140380859375 loss: 0.0023193359375loss: 0.0010223388671875 loss: 0.00084686279296875 predicted value: tensor([[0.5820], [0.3438], [0.2988], [1.0156], [0.2930], [0.5938], [1.0625], [0.3809], [0.7812], [0.6211], [0.6172], [0.4688], [0.4883], [0.3828], [0.2373], [0.1992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [0.2500], [1.0000], [0.2500], [0.4668], [1.0000], [0.3340], [0.8008], [0.5000], [0.5000], [0.4004], [0.5000], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00122833251953125 loss: 0.00102996826171875 loss: 0.0023040771484375 loss: 0.00145721435546875 predicted value: tensor([[0.4922], [0.5898], [1.0469], [0.7891], [1.0391], [1.0859], [1.0547], [0.6445], [0.6367], [1.0234], [0.7539], [0.4590], [0.6094], [0.4062], [0.1992], [0.2373]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4648], [1.0000], [0.8008], [1.0000], [1.0000], [1.0000], [0.3145], [0.7500], [1.0000], [0.6680], [0.4004], [0.5000], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00238037109375 loss: 0.002838134765625 loss: 0.0018768310546875 loss: 0.0019683837890625 67%|██████▋ | 329/492 [2:58:57<1:26:11, 31.73s/it] {'loss': 0.0061, 'learning_rate': 1e-05, 'epoch': 0.67} 67%|██████▋ | 329/492 [2:58:57<1:26:11, 31.73s/it]predicted value: tensor([[0.5742], [0.5117], [0.4824], [0.3203], [0.3125], [0.8320], [0.8242], [1.0391], [0.6758], [1.0547], [1.0391], [1.0547], [0.5078], [0.2490], [0.2285], [0.2334]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.3340], [0.3340], [0.8008], [0.8008], [1.0000], [0.5000], [1.0000], [1.0000], [1.0000], [0.5000], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00110626220703125 loss: 0.000774383544921875loss: 0.003204345703125 loss: 0.0009918212890625 predicted value: tensor([[0.5703], [0.6055], [0.5195], [1.0234], [0.7031], [0.3730], [0.5469], [0.6680], [1.0547], [0.6914], [0.7695], [0.5820], [0.6914], [0.4473], [0.2266], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [1.0000], [0.7500], [0.3340], [0.3750], [0.7500], [1.0000], [0.5000], [0.8008], [0.7148], [0.5000], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00141143798828125 loss: 0.0023040771484375loss: 0.000789642333984375 loss: 0.001953125 predicted value: tensor([[0.4961], [0.6875], [0.3223], [0.5977], [0.6250], [0.7734], [0.4805], [0.5273], [0.8477], [0.4199], [0.6602], [0.6133], [0.4004], [0.3984], [0.2412], [0.2070]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [0.3340], [0.4668], [0.6016], [0.8008], [0.4668], [0.8008], [0.8320], [0.4004], [0.6016], [0.5000], [0.6016], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0038299560546875 loss: 0.0024871826171875loss: 0.00160980224609375 loss: 0.0011749267578125 predicted value: tensor([[0.6055], [0.3184], [1.0234], [0.5039], [1.0469], [0.7773], [0.3164], [0.6992], [0.5234], [0.8281], [0.1855], [0.4141], [0.4766], [0.4355], [0.2305], [0.2207]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [1.0000], [0.4668], [1.0000], [0.8008], [0.2500], [0.7500], [0.4668], [0.5703], [0.0625], [0.3340], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00124359130859375 loss: 0.0027618408203125 loss: 0.0007781982421875 loss: 0.00185394287109375 67%|██████▋ | 330/492 [2:59:30<1:26:16, 31.96s/it] {'loss': 0.0071, 'learning_rate': 1e-05, 'epoch': 0.67} 67%|██████▋ | 330/492 [2:59:30<1:26:16, 31.96s/it]predicted value: tensor([[0.3945], [0.2471], [0.9688], [0.7070], [0.3770], [0.7656], [0.9844], [0.2314], [0.7500], [0.5352], [0.4023], [0.4199], [0.6250], [0.1533], [0.2012], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.2500], [1.0000], [0.6680], [0.3750], [0.8008], [1.0000], [0.1670], [0.8008], [0.6016], [0.5000], [0.5000], [0.6016], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000804901123046875 loss: 0.00159454345703125 loss: 0.000698089599609375 loss: 0.000720977783203125 predicted value: tensor([[ 0.8086], [ 0.4141], [ 0.7852], [ 0.7734], [ 0.3125], [ 0.5156], [ 0.4629], [ 0.9883], [ 0.5156], [ 0.5391], [ 0.5469], [-0.0356], [ 0.4551], [ 0.6523], [ 0.9062], [ 0.1826]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.8008], [0.8008], [0.3340], [0.5547], [0.6016], [1.0000], [0.5000], [0.5000], [0.6016], [0.0400], [0.5000], [0.6016], [1.0000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000965118408203125 loss: 0.000827789306640625 loss: 0.0006866455078125 loss: 0.0013580322265625 predicted value: tensor([[0.4297], [0.3926], [0.9531], [0.9688], [0.4551], [0.4355], [0.4492], [0.4141], [0.2451], [0.6250], [0.1787], [0.3340], [0.3711], [0.1934], [0.1699], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [1.0000], [1.0000], [0.4668], [0.3750], [0.4668], [0.2500], [0.2500], [0.7500], [0.2002], [0.4004], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0031280517578125loss: 0.001953125 loss: 0.000701904296875 loss: 0.00104522705078125 predicted value: tensor([[0.7070], [0.4062], [0.9492], [0.4062], [0.4238], [0.8164], [0.2197], [0.2295], [0.9844], [0.7539], [0.2910], [0.3789], [0.6172], [0.3379], [0.1641], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.3145], [1.0000], [0.4668], [0.3750], [0.8555], [0.2500], [0.2500], [1.0000], [0.8008], [0.4004], [0.3340], [0.4668], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00147247314453125 loss: 0.0016021728515625 loss: 0.00113677978515625 loss: 0.00107574462890625 67%|██████▋ | 331/492 [3:00:02<1:26:17, 32.16s/it] {'loss': 0.0049, 'learning_rate': 1e-05, 'epoch': 0.67} 67%|██████▋ | 331/492 [3:00:02<1:26:17, 32.16s/it]predicted value: tensor([[0.2598], [1.0000], [0.4277], [0.9844], [0.6953], [0.5742], [0.9727], [0.2129], [0.4355], [0.9727], [0.9727], [0.9648], [0.5898], [0.3652], [0.3496], [0.2031]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [1.0000], [0.4668], [1.0000], [0.8008], [0.6016], [1.0000], [0.2002], [0.4668], [1.0000], [1.0000], [1.0000], [0.6016], [0.5000], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0014495849609375 loss: 0.0009765625 loss: 0.00069427490234375 loss: 0.000606536865234375 predicted value: tensor([[0.6094], [0.4199], [0.3789], [0.9961], [0.6523], [0.9766], [0.7344], [0.9844], [0.7148], [0.4160], [0.7930], [0.9844], [0.4258], [0.3379], [0.3867], [0.1826]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5312], [0.3750], [0.3750], [1.0000], [0.8320], [1.0000], [0.5703], [1.0000], [0.8008], [0.3750], [0.8320], [1.0000], [0.5000], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0022430419921875 loss: 0.001739501953125 loss: 0.000888824462890625 loss: 0.00040435791015625 predicted value: tensor([[0.4668], [0.9531], [0.2500], [0.4863], [0.5938], [0.7773], [0.4941], [0.3828], [0.4199], [0.3984], [0.1973], [0.6406], [0.3906], [0.1650], [0.1768], [0.1982]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.2002], [0.4668], [0.5547], [0.8008], [0.5000], [0.4668], [0.3750], [0.3750], [0.4004], [0.5000], [0.5000], [0.2500], [0.2500], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016632080078125 loss: 0.00188446044921875 loss: 0.002105712890625 loss: 0.000732421875 predicted value: tensor([[0.4258], [0.4434], [0.7930], [0.7070], [0.9609], [0.6719], [0.4297], [0.6523], [0.5664], [0.4395], [0.4434], [0.2363], [0.4238], [0.1973], [0.1729], [0.1943]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8320], [0.6680], [1.0000], [0.6680], [0.4668], [0.4668], [0.6016], [0.4004], [0.5000], [0.3340], [0.4668], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003143310546875 loss: 0.0010528564453125 loss: 0.0011444091796875 loss: 0.0026092529296875 67%|██████▋ | 332/492 [3:00:35<1:25:47, 32.17s/it] {'loss': 0.0058, 'learning_rate': 1e-05, 'epoch': 0.67} 67%|██████▋ | 332/492 [3:00:35<1:25:47, 32.17s/it]predicted value: tensor([[0.5820], [1.0469], [0.6133], [1.0312], [0.7695], [0.7773], [0.6133], [0.4395], [0.6094], [0.6445], [0.7148], [0.5469], [0.6680], [0.4121], [0.2656], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [1.0000], [0.6680], [0.8008], [0.6016], [0.3750], [0.6016], [0.5000], [0.6016], [0.5000], [0.7500], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004241943359375 loss: 0.001556396484375 loss: 0.004119873046875 loss: 0.00112152099609375 predicted value: tensor([[0.4961], [0.7344], [0.6719], [0.7188], [0.7695], [0.4746], [1.0547], [1.0469], [0.5859], [0.4375], [1.0547], [0.7344], [0.3965], [0.4355], [0.2969], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.7500], [0.6016], [0.8008], [0.6680], [0.3750], [1.0000], [1.0000], [0.5000], [0.2852], [1.0000], [0.7500], [0.3340], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002532958984375 loss: 0.00095367431640625 loss: 0.001312255859375 loss: 0.0010833740234375 predicted value: tensor([[0.6016], [1.0469], [0.6172], [0.7109], [0.4453], [1.0391], [0.8086], [0.2812], [0.3574], [0.6914], [0.2793], [0.4355], [0.5156], [0.4902], [0.2832], [0.2383]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.5547], [0.4668], [0.3750], [1.0000], [0.8008], [0.2002], [0.3340], [0.6016], [0.2500], [0.4004], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024871826171875 loss: 0.0016937255859375loss: 0.00128173828125 loss: 0.00135040283203125 predicted value: tensor([[0.6172], [0.5391], [1.0312], [1.0547], [1.0391], [0.7812], [0.6328], [0.7773], [0.6133], [0.3887], [0.5000], [0.5234], [0.5156], [0.5820], [0.2656], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [1.0000], [1.0000], [1.0000], [0.8008], [0.5000], [0.7500], [0.6016], [0.2500], [0.5000], [0.4004], [0.5000], [0.4668], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00159454345703125 loss: 0.00130462646484375 loss: 0.0006103515625 loss: 0.0010986328125 68%|██████▊ | 333/492 [3:01:08<1:25:56, 32.43s/it] {'loss': 0.0071, 'learning_rate': 1e-05, 'epoch': 0.68} 68%|██████▊ | 333/492 [3:01:08<1:25:56, 32.43s/it]predicted value: tensor([[1.0703], [0.6367], [0.7617], [1.0312], [0.3125], [0.3652], [1.0469], [0.7227], [0.4648], [0.4043], [1.0312], [0.4297], [0.4375], [0.2578], [0.2988], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.8008], [1.0000], [0.2500], [0.3340], [1.0000], [0.6680], [0.3750], [0.3340], [1.0000], [0.4004], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00078582763671875 loss: 0.00177764892578125 loss: 0.0010223388671875 loss: 0.00115203857421875 predicted value: tensor([[0.6016], [0.8711], [0.3301], [0.4922], [0.2891], [1.0547], [0.6523], [0.5000], [1.0547], [0.2715], [0.5078], [1.0312], [0.4023], [0.2520], [0.2451], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.2500], [0.3750], [0.2500], [1.0000], [0.6016], [0.2500], [1.0000], [0.2002], [0.4004], [1.0000], [0.3340], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00109100341796875 loss: 0.002105712890625 loss: 0.0010833740234375 loss: 0.00060272216796875 predicted value: tensor([[1.0234], [0.3457], [0.5117], [0.4766], [1.0234], [0.6992], [0.8359], [0.7695], [0.3398], [0.4531], [0.5938], [0.4922], [0.4121], [0.4746], [0.4023], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [0.4668], [0.4668], [1.0000], [0.7500], [0.8008], [0.4668], [0.3340], [0.4004], [0.6016], [0.4668], [0.4004], [0.4004], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00119781494140625 loss: 0.00201416015625loss: 0.00113677978515625 loss: 0.0020294189453125 predicted value: tensor([[1.0234], [0.4727], [1.0391], [0.5664], [1.0391], [0.6719], [0.5352], [0.6602], [0.6445], [0.5273], [0.5938], [0.4961], [0.4141], [0.2715], [0.4473], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [1.0000], [0.4668], [1.0000], [0.7500], [0.6016], [0.6016], [0.7500], [0.4668], [0.3750], [0.5000], [0.4004], [0.2500], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020904541015625 loss: 0.0016021728515625 loss: 0.0012359619140625 loss: 0.000690460205078125 68%|██████▊ | 334/492 [3:01:41<1:26:03, 32.68s/it] {'loss': 0.0054, 'learning_rate': 1e-05, 'epoch': 0.68} 68%|██████▊ | 334/492 [3:01:41<1:26:03, 32.68s/it]predicted value: tensor([[0.3906], [0.4043], [0.4531], [0.2539], [0.6562], [0.3887], [0.9727], [0.2402], [0.5000], [0.0859], [0.3711], [0.5156], [0.9375], [0.1924], [0.1934], [0.2100]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.2500], [0.6680], [0.4668], [1.0000], [0.2500], [0.8008], [0.0625], [0.4004], [0.5000], [1.0000], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00098419189453125 loss: 0.00188446044921875loss: 0.000957489013671875 loss: 0.0011749267578125 predicted value: tensor([[0.6992], [0.3672], [0.7266], [0.4883], [0.9766], [0.7070], [0.3965], [0.5820], [0.4043], [0.2070], [0.2119], [0.3496], [0.5859], [0.4473], [0.1943], [0.2402]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.3750], [0.8008], [0.5547], [1.0000], [0.7148], [0.4668], [0.6016], [0.4668], [0.2500], [0.2500], [0.4004], [0.7500], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00101470947265625 loss: 0.000881195068359375 loss: 0.00074005126953125 loss: 0.002960205078125 predicted value: tensor([[0.4355], [0.7070], [0.9727], [0.4082], [0.2598], [0.6055], [0.7695], [0.2500], [0.5859], [0.5273], [0.3574], [0.5430], [0.3672], [0.1953], [0.4688], [0.2188]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [1.0000], [0.4668], [0.2500], [0.6016], [0.8008], [0.2500], [0.6016], [0.6016], [0.4004], [0.5000], [0.5000], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00142669677734375 loss: 0.00078582763671875loss: 0.000789642333984375 loss: 0.000881195068359375 predicted value: tensor([[0.9375], [0.3848], [0.4082], [0.2100], [0.9883], [0.0776], [0.5234], [0.7188], [0.5938], [0.2148], [0.2754], [0.6602], [0.5469], [0.1973], [0.2031], [0.2090]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.4668], [0.2500], [1.0000], [0.0400], [0.6016], [0.8320], [0.6016], [0.3340], [0.2852], [0.7500], [0.6016], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001190185546875 loss: 0.00115966796875 loss: 0.00054931640625 loss: 0.00093841552734375 68%|██████▊ | 335/492 [3:02:14<1:25:29, 32.67s/it] {'loss': 0.0046, 'learning_rate': 1e-05, 'epoch': 0.68} 68%|██████▊ | 335/492 [3:02:14<1:25:29, 32.67s/it]predicted value: tensor([[ 0.9688], [ 0.7852], [ 0.7383], [ 0.4727], [ 0.9805], [ 0.6250], [ 0.7109], [ 0.3008], [ 0.4590], [-0.0244], [ 0.7305], [ 0.4336], [ 0.5234], [ 0.2070], [ 0.1855], [ 0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.8320], [0.5547], [1.0000], [0.6680], [0.8008], [0.6016], [0.4277], [0.0400], [0.8008], [0.5000], [0.6016], [0.2500], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00139617919921875 loss: 0.00133514404296875 loss: 0.002227783203125 loss: 0.0032501220703125 predicted value: tensor([[0.2773], [0.6211], [0.4238], [0.3965], [0.9688], [0.7109], [0.2559], [0.9766], [0.6719], [0.5039], [0.3789], [0.9883], [0.9727], [0.1953], [0.4688], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.8008], [0.4668], [0.4668], [1.0000], [0.8008], [0.2500], [1.0000], [0.7500], [0.4668], [0.4004], [1.0000], [1.0000], [0.2002], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000823974609375 loss: 0.00098419189453125 loss: 0.0007781982421875 loss: 0.0029754638671875 predicted value: tensor([[0.7891], [0.4219], [0.3867], [0.3828], [0.9531], [0.6250], [0.9844], [0.9648], [0.3867], [0.4199], [0.2090], [0.4023], [0.3438], [0.2207], [0.1768], [0.2158]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.4668], [0.4668], [1.0000], [0.5000], [1.0000], [1.0000], [0.4004], [0.5000], [0.3340], [0.4004], [0.4004], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00089263916015625 loss: 0.0009918212890625 loss: 0.00086212158203125 loss: 0.0016021728515625 predicted value: tensor([[0.1885], [0.4180], [0.2598], [0.5039], [0.6992], [0.4414], [0.6328], [0.4883], [0.9844], [0.9805], [0.3848], [0.4609], [0.4082], [0.1924], [0.2314], [0.2080]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.3750], [0.3340], [0.5547], [0.8008], [0.4668], [0.6680], [0.5000], [1.0000], [1.0000], [0.4004], [0.5000], [0.5000], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000759124755859375 loss: 0.00144195556640625 loss: 0.000583648681640625loss: 0.002410888671875 68%|██████▊ | 336/492 [3:02:46<1:24:45, 32.60s/it] {'loss': 0.0058, 'learning_rate': 1e-05, 'epoch': 0.68} 68%|██████▊ | 336/492 [3:02:46<1:24:45, 32.60s/it]predicted value: tensor([[0.4766], [0.4629], [0.8125], [0.5664], [0.3164], [1.0469], [0.4023], [0.7930], [0.4551], [1.0547], [0.4238], [0.4902], [0.4395], [0.4531], [0.2617], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.8008], [0.5547], [0.3340], [1.0000], [0.2500], [0.8008], [0.4004], [1.0000], [0.3340], [0.4004], [0.3340], [0.3340], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00170135498046875 loss: 0.00156402587890625loss: 0.001617431640625 loss: 0.000774383544921875 predicted value: tensor([[1.0469], [1.0625], [1.0234], [1.0234], [0.7773], [0.6055], [0.4648], [1.0547], [1.0469], [0.7695], [0.6172], [0.4043], [0.4082], [0.2656], [0.4844], [0.2930]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [1.0000], [1.0000], [0.8008], [0.6016], [0.3750], [1.0000], [1.0000], [0.8008], [0.6016], [0.2002], [0.4004], [0.2002], [0.2852], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010833740234375 loss: 0.00171661376953125 loss: 0.0010223388671875 loss: 0.0011138916015625 predicted value: tensor([[0.8984], [0.6875], [0.3477], [0.5625], [1.0391], [0.5586], [1.0469], [0.4980], [0.6172], [0.4629], [0.7148], [0.5977], [0.4434], [0.4902], [0.2773], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4648], [0.2500], [0.4668], [1.0000], [0.4668], [1.0000], [0.4668], [0.6016], [0.5000], [0.6016], [0.6016], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00139617919921875 loss: 0.0019073486328125 loss: 0.00250244140625 loss: 0.000888824462890625 predicted value: tensor([[0.5586], [1.0469], [0.4648], [0.8008], [0.5586], [0.6055], [1.0547], [0.6367], [0.5977], [0.5273], [0.6680], [0.4434], [0.4453], [0.2754], [0.2539], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.8008], [0.5547], [0.6016], [1.0000], [0.6016], [0.6016], [0.5000], [0.6016], [0.4004], [0.4004], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00054168701171875loss: 0.00131988525390625 loss: 0.000701904296875 loss: 0.000732421875 68%|██████▊ | 337/492 [3:03:18<1:24:03, 32.54s/it] {'loss': 0.0051, 'learning_rate': 1e-05, 'epoch': 0.68} 68%|██████▊ | 337/492 [3:03:18<1:24:03, 32.54s/it]predicted value: tensor([[0.6836], [0.8359], [0.8125], [0.7500], [1.0469], [0.7773], [1.0469], [0.8008], [0.6289], [0.7344], [0.7422], [0.4570], [0.4902], [0.4355], [0.4688], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.8555], [0.8008], [0.8008], [1.0000], [0.8008], [1.0000], [0.8008], [0.6016], [0.7500], [0.7500], [0.4004], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000934600830078125 loss: 0.000659942626953125 loss: 0.000682830810546875 loss: 0.00084686279296875 predicted value: tensor([[0.8633], [0.4863], [0.5195], [0.4746], [0.7852], [0.5234], [1.0547], [0.7461], [0.4570], [0.3340], [0.4238], [0.8047], [0.4922], [0.2461], [0.2715], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.4668], [0.4668], [0.8008], [0.4668], [1.0000], [0.8008], [0.3340], [0.2002], [0.4004], [0.6680], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012664794921875loss: 0.00052642822265625 loss: 0.0019989013671875 loss: 0.0019989013671875 predicted value: tensor([[0.4824], [0.6602], [0.5000], [0.4707], [0.4336], [1.0547], [0.2930], [1.0234], [0.6289], [1.0391], [0.3594], [0.5664], [1.0469], [0.5117], [0.2617], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.5547], [0.4668], [0.4668], [0.4668], [1.0000], [0.2500], [1.0000], [0.5000], [1.0000], [0.2500], [0.6016], [1.0000], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00159454345703125 loss: 0.001251220703125 loss: 0.000904083251953125 loss: 0.0020294189453125 predicted value: tensor([[1.0469], [0.5195], [0.4980], [0.8008], [1.0234], [0.4805], [0.1416], [0.2969], [0.2930], [0.5234], [0.6914], [0.2910], [0.5117], [0.5391], [0.4531], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.4668], [0.6680], [1.0000], [0.4668], [0.0400], [0.2500], [0.2500], [0.4668], [0.6680], [0.2500], [0.7500], [0.5000], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00193023681640625loss: 0.0029449462890625 loss: 0.000591278076171875 loss: 0.001251220703125 69%|██████▊ | 338/492 [3:03:51<1:23:53, 32.69s/it] {'loss': 0.0054, 'learning_rate': 1e-05, 'epoch': 0.69} 69%|██████▊ | 338/492 [3:03:51<1:23:53, 32.69s/it]predicted value: tensor([[0.5156], [0.4219], [0.7188], [0.9883], [0.4102], [0.9492], [0.5312], [0.5352], [0.6367], [0.2217], [0.3789], [0.3789], [0.3594], [0.0728], [0.1748], [0.1816]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8008], [1.0000], [0.4668], [1.0000], [0.6016], [0.6016], [0.6016], [0.2500], [0.4004], [0.5000], [0.4004], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011138916015625 loss: 0.00072479248046875loss: 0.00128173828125 loss: 0.0010986328125 predicted value: tensor([[0.4824], [0.4297], [0.6172], [0.9883], [0.9766], [0.4922], [0.8008], [0.4297], [0.4355], [0.5391], [0.4922], [0.4629], [0.3828], [0.3848], [0.1846], [0.2207]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.6016], [1.0000], [1.0000], [0.5547], [0.8320], [0.4668], [0.4668], [0.5000], [0.3750], [0.5000], [0.4004], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0008087158203125 loss: 0.000553131103515625 loss: 0.000522613525390625 loss: 0.000568389892578125 predicted value: tensor([[0.4043], [0.4062], [0.9531], [0.4160], [0.2578], [0.4062], [0.4180], [0.4824], [0.7500], [0.6172], [0.5234], [0.3770], [0.2598], [0.3242], [0.2080], [0.1777]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.4668], [0.3340], [0.3750], [0.4668], [0.4668], [0.8008], [0.7500], [0.5000], [0.4668], [0.2500], [0.3340], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000797271728515625 loss: 0.000820159912109375 loss: 0.001220703125 loss: 0.000904083251953125 predicted value: tensor([[ 0.8047], [ 0.7266], [ 0.7461], [ 0.7305], [ 0.9805], [ 0.4023], [ 0.9766], [ 0.7266], [ 0.4297], [ 0.6992], [ 0.4062], [ 0.3965], [ 0.9531], [-0.0273], [ 0.1689], [ 0.2012]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8320], [0.8008], [0.8008], [1.0000], [0.3750], [1.0000], [0.8320], [0.3750], [0.7148], [0.5000], [0.4004], [1.0000], [0.0400], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00048828125 loss: 0.0012359619140625 loss: 0.00135040283203125 loss: 0.0008544921875 69%|██████▉ | 339/492 [3:04:24<1:23:30, 32.75s/it] {'loss': 0.0036, 'learning_rate': 1e-05, 'epoch': 0.69} 69%|██████▉ | 339/492 [3:04:24<1:23:30, 32.75s/it]predicted value: tensor([[0.9688], [0.9648], [0.9688], [0.5039], [0.6367], [0.7148], [0.1904], [0.4258], [0.9766], [0.9609], [0.4590], [0.3770], [0.3848], [0.4531], [0.1787], [0.1543]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [1.0000], [0.5547], [0.6680], [0.8008], [0.1670], [0.4668], [1.0000], [1.0000], [0.4668], [0.4004], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002532958984375 loss: 0.000385284423828125 loss: 0.003173828125 loss: 0.000782012939453125 predicted value: tensor([[0.4336], [0.7812], [0.4238], [0.4023], [0.2197], [0.5820], [0.2373], [0.9805], [0.5508], [0.2295], [0.3730], [0.6523], [0.1729], [0.4961], [0.1484], [0.1953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.4668], [0.4668], [0.2500], [0.6016], [0.2500], [1.0000], [0.6016], [0.2500], [0.4004], [0.7500], [0.4004], [0.7500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00060272216796875 loss: 0.002288818359375loss: 0.00099945068359375 loss: 0.00118255615234375 predicted value: tensor([[0.2256], [0.5156], [0.4375], [0.7305], [0.2344], [0.7148], [0.6133], [0.9766], [0.9727], [0.8281], [0.6484], [0.3574], [0.5742], [0.3965], [0.1592], [0.3750]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.5547], [0.4668], [0.8320], [0.2002], [0.6680], [0.6680], [1.0000], [1.0000], [0.8008], [0.6016], [0.4004], [0.6016], [0.4004], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001434326171875 loss: 0.00045013427734375 loss: 0.001800537109375 loss: 0.000499725341796875 predicted value: tensor([[0.9688], [0.9688], [0.7305], [0.7578], [0.6641], [0.1924], [0.4297], [0.3945], [0.5508], [0.9961], [0.5586], [0.4043], [0.2109], [0.3867], [0.1113], [0.1807]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.8320], [0.8008], [0.6680], [0.2500], [0.4668], [0.4668], [0.5000], [1.0000], [0.6016], [0.5000], [0.5000], [0.5000], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002197265625 loss: 0.003387451171875 loss: 0.0034027099609375loss: 0.0013427734375 69%|██████▉ | 340/492 [3:04:57<1:22:43, 32.66s/it] {'loss': 0.0066, 'learning_rate': 1e-05, 'epoch': 0.69} 69%|██████▉ | 340/492 [3:04:57<1:22:43, 32.66s/it]predicted value: tensor([[0.5625], [0.6523], [0.5195], [0.7656], [1.0547], [0.5078], [1.0469], [0.3203], [0.6914], [0.3105], [0.6367], [0.6523], [0.4668], [0.2285], [0.2520], [0.2451]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.4668], [0.8008], [1.0000], [0.4668], [1.0000], [0.2500], [0.5000], [0.2500], [0.5547], [0.6016], [0.5000], [0.2002], [0.2500], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115966796875 loss: 0.00171661376953125loss: 0.00130462646484375 loss: 0.00191497802734375 predicted value: tensor([[1.0469], [0.8828], [1.0391], [0.4941], [0.4961], [0.5078], [0.2891], [1.0469], [1.0234], [1.0703], [0.4102], [0.7852], [0.6602], [0.2412], [0.2559], [0.2383]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [1.0000], [0.2500], [0.4668], [0.3750], [0.2500], [1.0000], [1.0000], [1.0000], [0.5000], [0.7500], [0.6016], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002166748046875 loss: 0.0017547607421875loss: 0.0028533935546875 loss: 0.0014190673828125 predicted value: tensor([[0.2949], [0.5273], [0.4980], [0.4902], [0.3242], [0.8516], [0.3535], [0.8555], [1.0547], [1.0234], [0.5938], [0.5117], [0.4355], [0.4727], [0.2227], [0.2227]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.4668], [0.4668], [0.4668], [0.2002], [0.8008], [0.6016], [0.8008], [1.0000], [1.0000], [0.6016], [0.4004], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001068115234375 loss: 0.00180816650390625loss: 0.0033416748046875 loss: 0.0025177001953125 predicted value: tensor([[0.6367], [0.5078], [1.0391], [0.8047], [0.8945], [0.5508], [1.0547], [1.0703], [0.8594], [0.6562], [0.6875], [0.4316], [0.7305], [0.2471], [0.2178], [0.2158]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.4668], [1.0000], [0.8008], [0.8008], [0.8008], [1.0000], [1.0000], [0.8008], [0.6016], [0.7500], [0.4004], [0.6016], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001129150390625 loss: 0.00177001953125loss: 0.0012054443359375 69%|██████▉ | 341/492 [3:05:29<1:22:06, 32.62s/it]loss: 0.000885009765625 {'loss': 0.007, 'learning_rate': 1e-05, 'epoch': 0.69} 69%|██████▉ | 341/492 [3:05:29<1:22:06, 32.62s/it]predicted value: tensor([[1.0312], [0.5039], [0.3262], [0.5078], [0.9297], [0.2793], [0.5117], [0.7266], [0.3652], [0.4863], [0.8555], [0.3379], [0.4395], [0.2393], [0.0260], [0.2334]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.2500], [0.3750], [0.8555], [0.2002], [0.3340], [0.6016], [0.3340], [0.4004], [0.6680], [0.2500], [0.5000], [0.2002], [0.0278], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00124359130859375 loss: 0.0022735595703125 loss: 0.000759124755859375 loss: 0.00136566162109375 predicted value: tensor([[0.5430], [0.7852], [0.7891], [0.4531], [0.5195], [1.0078], [0.5156], [0.6250], [0.3418], [0.8203], [0.6758], [0.8398], [0.4902], [0.4766], [0.1816], [0.2266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.8008], [0.4668], [0.4668], [1.0000], [0.3750], [0.6016], [0.2500], [0.6680], [0.6016], [0.8008], [0.5000], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0005645751953125 loss: 0.00124359130859375 loss: 0.0033721923828125 loss: 0.00121307373046875 predicted value: tensor([[0.8711], [1.0312], [1.0312], [0.4727], [0.8242], [0.3066], [0.4297], [0.2773], [0.7109], [0.5039], [0.6445], [0.3945], [0.4434], [0.4648], [0.2695], [0.2402]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [1.0000], [0.4668], [0.8008], [0.2500], [0.5000], [0.2002], [0.6016], [0.5000], [0.7500], [0.2852], [0.4004], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115966796875 loss: 0.00154876708984375 loss: 0.0026092529296875 loss: 0.0006866455078125 predicted value: tensor([[0.5977], [1.0312], [0.3633], [1.0156], [0.2734], [1.0312], [0.5078], [0.5195], [1.0391], [0.5000], [0.6250], [0.5430], [0.5195], [0.2891], [0.2129], [0.2061]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3340], [1.0000], [0.2002], [1.0000], [0.4668], [0.4668], [1.0000], [0.3750], [0.6016], [0.3750], [0.4668], [0.2500], [0.1670], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015716552734375 loss: 0.0015411376953125 loss: 0.0011138916015625 loss: 0.00086212158203125 70%|██████▉ | 342/492 [3:06:02<1:21:39, 32.66s/it] {'loss': 0.0058, 'learning_rate': 1e-05, 'epoch': 0.7} 70%|██████▉ | 342/492 [3:06:02<1:21:39, 32.66s/it]predicted value: tensor([[0.4922], [0.4492], [0.4883], [0.4277], [0.9648], [0.7695], [0.3984], [0.5469], [0.2754], [0.9531], [0.7188], [0.4102], [0.2012], [0.1445], [0.1680], [0.1650]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.5547], [0.3750], [1.0000], [0.8008], [0.4668], [0.3750], [0.3340], [1.0000], [0.7500], [0.5000], [0.2002], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000946044921875 loss: 0.000461578369140625 loss: 0.0011444091796875 loss: 0.00194549560546875 predicted value: tensor([[0.9609], [0.6797], [0.7969], [0.3477], [0.4219], [0.7656], [0.2139], [0.7383], [0.7227], [0.9570], [0.5508], [0.2715], [0.4727], [0.3555], [0.1855], [0.1455]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.6680], [0.2002], [0.4668], [0.8008], [0.2500], [0.6680], [0.7500], [1.0000], [0.6016], [0.2500], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00079345703125 loss: 0.001251220703125loss: 0.003082275390625 loss: 0.00109100341796875 predicted value: tensor([[0.4980], [0.9688], [0.4453], [0.7070], [0.9609], [0.2500], [0.6094], [0.2578], [0.9492], [0.9570], [0.1865], [0.3223], [0.4043], [0.5781], [0.1865], [0.1455]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.8008], [1.0000], [0.3340], [0.6016], [0.2500], [1.0000], [1.0000], [0.2002], [0.3340], [0.5000], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000621795654296875 loss: 0.00150299072265625 loss: 0.000759124755859375 loss: 0.0025482177734375 predicted value: tensor([[0.6992], [0.4004], [0.5352], [0.1768], [0.2314], [0.4238], [0.9648], [0.6641], [0.2910], [0.3164], [0.2471], [0.5742], [0.4297], [0.1816], [0.1611], [0.1367]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.4668], [0.5000], [0.1670], [0.2500], [0.4668], [1.0000], [0.6680], [0.6016], [0.4004], [0.2500], [0.6016], [0.5000], [0.2002], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00093841552734375 loss: 0.0025634765625 loss: 0.00142669677734375 loss: 0.00188446044921875 70%|██████▉ | 343/492 [3:06:35<1:21:09, 32.68s/it] {'loss': 0.0057, 'learning_rate': 1e-05, 'epoch': 0.7} 70%|██████▉ | 343/492 [3:06:35<1:21:09, 32.68s/it]predicted value: tensor([[0.9844], [0.5117], [0.5312], [0.5430], [0.5820], [0.3301], [0.2539], [0.5625], [0.2578], [0.9727], [0.4688], [0.3652], [0.5469], [0.5781], [0.3574], [0.1943]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.5547], [0.6016], [0.6016], [0.2500], [0.2500], [0.6016], [0.2500], [1.0000], [0.4668], [0.4004], [0.6016], [0.6016], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000461578369140625 loss: 0.0003414154052734375 loss: 0.0012664794921875 loss: 0.0015411376953125 predicted value: tensor([[0.7656], [0.9492], [0.4102], [0.2158], [0.4316], [0.2432], [0.2197], [0.6719], [0.7656], [0.2383], [0.4043], [0.2373], [0.3789], [0.1621], [0.1230], [0.1816]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.4668], [0.2500], [0.4668], [0.3340], [0.2002], [0.6680], [0.8008], [0.2500], [0.5000], [0.2500], [0.5000], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00131988525390625 loss: 0.0019683837890625loss: 0.00157928466796875 loss: 0.0007171630859375 predicted value: tensor([[0.4102], [0.8555], [0.7227], [0.9648], [0.0452], [0.9531], [0.4453], [0.6953], [0.9570], [0.5938], [0.4199], [0.6758], [0.3613], [0.1689], [0.1689], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.6250], [1.0000], [0.0400], [1.0000], [0.4668], [0.7500], [1.0000], [0.7500], [0.5000], [0.7500], [0.4004], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000698089599609375 loss: 0.00077056884765625 loss: 0.00092315673828125 loss: 0.00153350830078125 predicted value: tensor([[0.3984], [0.2432], [0.6797], [0.6992], [0.2949], [0.9492], [0.6680], [0.6836], [0.2158], [0.3320], [0.6953], [0.3379], [0.3789], [0.1963], [0.1709], [0.1553]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.6680], [0.6680], [0.3340], [1.0000], [0.6016], [0.6680], [0.5000], [0.4004], [0.6680], [0.4004], [0.4004], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00168609619140625 loss: 0.000514984130859375 loss: 0.0002994537353515625 loss: 0.000438690185546875 70%|██████▉ | 344/492 [3:07:08<1:20:41, 32.71s/it] {'loss': 0.004, 'learning_rate': 1e-05, 'epoch': 0.7} 70%|██████▉ | 344/492 [3:07:08<1:20:41, 32.71s/it]predicted value: tensor([[1.0391], [1.0469], [0.4727], [0.7617], [0.5117], [0.3418], [0.7031], [0.6484], [0.5391], [0.6992], [0.6719], [0.3340], [0.4258], [0.4434], [0.2246], [0.2217]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.7500], [0.4668], [0.2500], [0.6016], [0.6016], [0.4668], [0.6016], [0.5000], [0.3340], [0.3340], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013275146484375 loss: 0.00194549560546875loss: 0.0013275146484375 loss: 0.0009918212890625 predicted value: tensor([[0.4941], [0.4961], [0.3047], [0.6484], [0.6758], [0.7422], [1.0625], [0.7539], [0.7344], [0.5156], [0.6914], [0.4727], [0.4531], [0.4941], [0.5195], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.2500], [0.6016], [0.5000], [0.7500], [1.0000], [0.7500], [0.7500], [0.3750], [0.8008], [0.4004], [0.4004], [0.5000], [0.6680], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003204345703125 loss: 0.00171661376953125loss: 0.00201416015625 loss: 0.00141143798828125 predicted value: tensor([[0.4941], [1.0391], [0.5039], [0.8438], [0.2812], [0.5117], [0.4746], [1.0391], [0.7148], [1.0469], [0.3320], [0.3340], [0.4453], [0.6328], [0.4590], [0.2383]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [0.8008], [0.2002], [0.4668], [0.3750], [1.0000], [0.7500], [1.0000], [0.2500], [0.2500], [0.4004], [0.6016], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010528564453125 loss: 0.00106048583984375loss: 0.00091552734375 loss: 0.0016021728515625 predicted value: tensor([[0.4492], [0.5117], [1.0625], [0.5781], [1.0391], [1.0312], [1.0234], [0.4805], [0.7109], [0.5039], [0.7227], [0.2090], [0.4414], [0.4199], [0.4805], [0.2295]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [1.0000], [0.6680], [1.0000], [1.0000], [1.0000], [0.4668], [0.7500], [0.5000], [0.7500], [0.4004], [0.4004], [0.3340], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00124359130859375 loss: 0.002655029296875 loss: 0.00067138671875 loss: 0.00112152099609375 70%|███████ | 345/492 [3:07:40<1:20:14, 32.75s/it] {'loss': 0.0061, 'learning_rate': 1e-05, 'epoch': 0.7} 70%|███████ | 345/492 [3:07:40<1:20:14, 32.75s/it]predicted value: tensor([[1.0547], [0.4980], [0.2812], [1.0234], [0.8906], [0.6797], [1.0391], [0.6523], [1.0469], [0.6992], [1.0391], [0.7656], [0.4727], [0.2637], [0.2363], [0.2197]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.2002], [1.0000], [0.8320], [0.4668], [1.0000], [0.6016], [1.0000], [0.6016], [1.0000], [0.6680], [0.5000], [0.2500], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001251220703125 loss: 0.00110626220703125 loss: 0.00148773193359375 loss: 0.00067138671875 predicted value: tensor([[0.6914], [0.5117], [0.5703], [1.0312], [0.4688], [0.3359], [0.3262], [0.6406], [0.5898], [0.2891], [0.3770], [0.4434], [0.4199], [0.2451], [0.2246], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.4668], [0.5547], [1.0000], [0.3750], [0.3340], [0.2500], [0.6016], [0.5547], [0.2500], [0.2500], [0.4004], [0.3340], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00104522705078125 loss: 0.000865936279296875 loss: 0.00063323974609375 loss: 0.000904083251953125 predicted value: tensor([[1.0469], [0.4863], [0.5625], [1.0547], [0.2773], [0.4688], [0.4180], [0.8438], [0.6367], [1.0391], [0.5664], [0.5977], [0.4824], [0.2676], [0.4336], [0.5039]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.5547], [1.0000], [0.2500], [0.3750], [0.3750], [0.8008], [0.6016], [1.0000], [0.6016], [0.5000], [0.4004], [0.2500], [0.4004], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00159454345703125 loss: 0.0010833740234375loss: 0.002349853515625 loss: 0.00176239013671875 predicted value: tensor([[0.4629], [1.0469], [0.8086], [0.4414], [1.0391], [0.5430], [0.7539], [0.7969], [0.6367], [0.3262], [0.4590], [0.2910], [0.7188], [0.2520], [0.2559], [0.2363]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8008], [0.3750], [1.0000], [0.6680], [0.7500], [0.8008], [0.6016], [0.2500], [0.4004], [0.2500], [0.7500], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023956298828125 loss: 0.00063323974609375 loss: 0.0009765625 loss: 0.000762939453125 70%|███████ | 346/492 [3:08:14<1:20:00, 32.88s/it] {'loss': 0.0049, 'learning_rate': 1e-05, 'epoch': 0.7} 70%|███████ | 346/492 [3:08:14<1:20:00, 32.88s/it]predicted value: tensor([[0.3672], [0.4121], [0.4746], [0.1348], [0.2617], [0.4199], [0.9727], [0.7852], [0.9531], [0.9727], [0.2891], [0.3945], [0.2734], [0.2217], [0.1885], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8008], [0.0625], [0.3340], [0.4668], [1.0000], [0.8320], [1.0000], [1.0000], [0.2500], [0.3340], [0.2500], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002044677734375 loss: 0.0022735595703125loss: 0.000820159912109375 loss: 0.00102996826171875 predicted value: tensor([[0.9766], [0.7422], [0.5273], [0.4180], [0.7031], [0.3477], [0.3066], [0.7578], [0.9727], [0.5391], [0.5430], [0.2695], [0.3730], [0.3281], [0.0630], [0.2354]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.5547], [0.4668], [0.6680], [0.3750], [0.3340], [0.8008], [1.0000], [0.5000], [0.6016], [0.2500], [0.4004], [0.3340], [0.0278], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000835418701171875 loss: 0.00032806396484375 loss: 0.00106048583984375 loss: 0.001373291015625 predicted value: tensor([[0.2695], [0.2080], [1.0000], [0.2148], [0.5508], [0.2539], [0.9531], [0.3867], [0.8242], [0.3379], [0.5742], [0.4492], [0.5586], [0.1992], [0.1514], [0.1836]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.2500], [1.0000], [0.1670], [0.6016], [0.2500], [1.0000], [0.4668], [0.8320], [0.3340], [0.6016], [0.3750], [0.7500], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0004863739013671875 loss: 0.000965118408203125loss: 0.0005645751953125 loss: 0.00093841552734375 predicted value: tensor([[0.4004], [0.3926], [0.2441], [0.4062], [0.3945], [0.4902], [0.4121], [0.6797], [0.9609], [0.3926], [0.2695], [0.3184], [0.4258], [0.1953], [0.1875], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.2002], [0.4668], [0.4668], [0.3750], [0.4668], [0.7500], [1.0000], [0.5000], [0.2500], [0.3340], [0.4004], [0.1670], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000885009765625 loss: 0.000728607177734375 loss: 0.000335693359375 loss: 0.00095367431640625 71%|███████ | 347/492 [3:08:46<1:19:25, 32.86s/it] {'loss': 0.0039, 'learning_rate': 1e-05, 'epoch': 0.71} 71%|███████ | 347/492 [3:08:46<1:19:25, 32.86s/it]predicted value: tensor([[0.4043], [0.4395], [0.4023], [0.4180], [0.6055], [0.9961], [0.9844], [0.9766], [0.2871], [0.9688], [1.0000], [0.9727], [0.4512], [0.4219], [0.2100], [0.3730]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.4668], [0.6680], [1.0000], [1.0000], [1.0000], [0.2500], [1.0000], [1.0000], [1.0000], [0.5000], [0.5000], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00070953369140625 loss: 0.0004425048828125 loss: 0.00102996826171875 loss: 0.002593994140625 predicted value: tensor([[0.4102], [0.9883], [0.7500], [0.6133], [0.3906], [0.2441], [0.9805], [0.6797], [0.7500], [0.2207], [0.5508], [0.3711], [0.3613], [0.3711], [0.4082], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8008], [0.6680], [0.4668], [0.2002], [1.0000], [0.7500], [0.8008], [0.2500], [0.7500], [0.4004], [0.5000], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015106201171875 loss: 0.00142669677734375loss: 0.00058746337890625 loss: 0.0011444091796875 predicted value: tensor([[0.3828], [0.9531], [0.4355], [0.9766], [0.2461], [0.6719], [0.5781], [0.2393], [0.6445], [0.9570], [0.9766], [0.5312], [0.3789], [0.1865], [0.3926], [0.2031]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.4668], [1.0000], [0.3340], [0.8008], [0.6016], [0.2500], [0.5547], [1.0000], [1.0000], [0.6016], [0.4004], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0003910064697265625 loss: 0.000881195068359375loss: 0.000881195068359375 loss: 0.00106048583984375 predicted value: tensor([[0.7109], [0.6289], [0.5000], [0.4551], [0.2100], [0.6875], [0.4590], [0.2031], [0.7070], [0.4121], [0.9766], [0.3379], [0.4551], [0.3340], [0.2002], [0.1875]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.5547], [0.8008], [0.4668], [0.2500], [0.8008], [0.5000], [0.2500], [0.8008], [0.4004], [1.0000], [0.2500], [0.5000], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010528564453125 loss: 0.0021209716796875 loss: 0.002166748046875 loss: 0.00167083740234375 71%|███████ | 348/492 [3:09:18<1:17:57, 32.48s/it] {'loss': 0.0049, 'learning_rate': 1e-05, 'epoch': 0.71} 71%|███████ | 348/492 [3:09:18<1:17:57, 32.48s/it]predicted value: tensor([[0.5781], [0.3008], [0.8828], [0.4727], [0.7383], [0.2598], [0.6133], [0.4902], [1.0703], [0.6797], [0.6172], [0.4648], [0.3105], [0.2539], [0.5508], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.2500], [0.8320], [0.4668], [0.7500], [0.2500], [0.5000], [0.4668], [1.0000], [0.6016], [0.4668], [0.4004], [0.0400], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00154876708984375 loss: 0.000782012939453125 loss: 0.00238037109375 loss: 0.0010223388671875 predicted value: tensor([[0.6016], [0.8438], [0.3398], [1.0469], [0.5000], [0.2793], [1.0234], [0.6328], [1.0234], [0.8008], [0.6133], [0.6484], [0.5195], [0.4375], [0.2148], [0.2402]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.3340], [1.0000], [0.4668], [0.1670], [1.0000], [0.6016], [1.0000], [0.6680], [0.6016], [0.6016], [0.4004], [0.4004], [0.0625], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002105712890625 loss: 0.00072479248046875 loss: 0.0012664794921875 loss: 0.0015106201171875 predicted value: tensor([[0.5586], [0.5586], [0.3750], [0.4922], [1.0625], [0.6602], [0.5586], [0.7656], [0.5781], [0.6523], [1.0312], [0.4336], [0.0957], [0.7930], [0.2617], [0.4102]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.3340], [0.4668], [1.0000], [0.6680], [0.8008], [0.8008], [0.5000], [0.7500], [1.0000], [0.4004], [0.0400], [0.8008], [0.0400], [0.2852]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009307861328125 loss: 0.0023651123046875loss: 0.000946044921875 loss: 0.0024261474609375 predicted value: tensor([[1.0547], [0.6445], [0.4648], [0.4492], [0.4746], [1.0547], [1.0391], [0.3652], [0.4883], [0.3438], [0.4844], [0.3438], [1.0547], [0.5078], [0.2617], [0.2441]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6172], [0.4668], [0.3750], [0.4668], [1.0000], [1.0000], [0.2500], [0.4004], [0.2002], [0.4004], [0.2500], [1.0000], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00148773193359375 loss: 0.00095367431640625 loss: 0.001373291015625 loss: 0.0014801025390625 71%|███████ | 349/492 [3:09:50<1:17:10, 32.38s/it] {'loss': 0.0058, 'learning_rate': 1e-05, 'epoch': 0.71} 71%|███████ | 349/492 [3:09:50<1:17:10, 32.38s/it]predicted value: tensor([[0.5781], [0.3281], [0.7969], [0.4707], [0.4668], [0.7617], [0.5742], [1.0547], [1.0469], [0.3027], [0.5625], [0.5820], [0.4219], [0.4453], [0.2285], [0.2480]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [0.8008], [0.3750], [0.4668], [0.8008], [0.5547], [1.0000], [1.0000], [0.2500], [0.5000], [0.6016], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000873565673828125 loss: 0.00171661376953125 loss: 0.0004253387451171875 loss: 0.00139617919921875 predicted value: tensor([[0.4648], [0.4434], [0.5664], [1.0625], [1.0234], [0.7539], [1.0547], [0.8281], [0.6875], [0.6719], [0.7109], [0.4395], [0.4219], [0.2305], [0.2451], [0.2314]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.5547], [1.0000], [1.0000], [0.8008], [1.0000], [0.8320], [0.7500], [0.6016], [0.6680], [0.4004], [0.3340], [0.1670], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024566650390625 loss: 0.000728607177734375loss: 0.00118255615234375 loss: 0.0016021728515625 predicted value: tensor([[0.4512], [0.5586], [0.8984], [0.5898], [0.7812], [1.0625], [1.0391], [0.7070], [0.5352], [0.6992], [0.3105], [0.5391], [1.0312], [0.2539], [0.2451], [0.2119]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4648], [0.8320], [0.5547], [0.8008], [1.0000], [1.0000], [0.5000], [0.5000], [0.6680], [0.2002], [0.6016], [1.0000], [0.2002], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00125885009765625 loss: 0.00091552734375 loss: 0.00146484375 loss: 0.00070953369140625 predicted value: tensor([[0.4766], [0.8164], [0.8945], [0.8008], [0.3438], [0.2949], [0.4570], [0.6055], [0.8242], [0.6172], [0.7227], [0.5195], [0.4648], [0.2256], [0.2305], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.8320], [0.8008], [0.3340], [0.2002], [0.3145], [0.6016], [0.8008], [0.5000], [0.8008], [0.5000], [0.5000], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00091552734375 loss: 0.000762939453125 loss: 0.0020599365234375 loss: 0.00089263916015625 71%|███████ | 350/492 [3:10:22<1:16:24, 32.28s/it] {'loss': 0.0048, 'learning_rate': 1e-05, 'epoch': 0.71} 71%|███████ | 350/492 [3:10:22<1:16:24, 32.28s/it]predicted value: tensor([[0.4180], [0.5977], [0.7188], [0.4355], [0.4199], [0.9805], [0.2793], [0.2617], [0.4785], [0.9570], [0.6758], [0.4082], [0.2812], [0.1621], [0.1816], [0.1729]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.6680], [0.8008], [0.4668], [0.4668], [1.0000], [0.2002], [0.2500], [0.5000], [1.0000], [0.7500], [0.4004], [0.2500], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001068115234375 loss: 0.000751495361328125 loss: 0.000720977783203125 loss: 0.00086212158203125 predicted value: tensor([[0.4199], [0.7539], [0.6992], [0.7656], [0.4355], [0.6211], [0.4297], [0.2266], [0.5469], [0.9727], [0.2637], [0.3008], [0.3789], [0.4258], [0.1924], [0.1680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.6680], [0.8008], [0.4668], [0.8008], [0.4668], [0.2500], [0.6016], [1.0000], [0.3340], [0.3340], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015869140625 loss: 0.0009765625 loss: 0.00148773193359375 loss: 0.001312255859375 predicted value: tensor([[0.3672], [0.7500], [0.4180], [0.3887], [0.9844], [0.3906], [0.5781], [0.7109], [0.4902], [0.2432], [0.2578], [0.2891], [0.4180], [0.4102], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.4668], [0.4668], [1.0000], [0.3750], [0.6016], [0.8008], [0.5000], [0.2002], [0.2500], [0.2500], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00128936767578125 loss: 0.0003948211669921875loss: 0.00139617919921875 loss: 0.0004215240478515625 predicted value: tensor([[0.4043], [0.9727], [1.0000], [0.2178], [0.7383], [0.3750], [0.2217], [0.6641], [0.9883], [0.2139], [0.3125], [0.3457], [0.1689], [0.1709], [0.1855], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [1.0000], [0.2500], [0.8008], [0.3750], [0.2500], [0.8008], [1.0000], [0.2500], [0.2500], [0.3340], [0.2002], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002044677734375 loss: 0.0032958984375 loss: 0.000652313232421875 loss: 0.0005340576171875 71%|███████▏ | 351/492 [3:10:54<1:15:18, 32.05s/it] {'loss': 0.0047, 'learning_rate': 1e-05, 'epoch': 0.71} 71%|███████▏ | 351/492 [3:10:54<1:15:18, 32.05s/it]predicted value: tensor([[0.4277], [0.8125], [0.4375], [0.6836], [0.5508], [0.7500], [0.2871], [0.5586], [0.2773], [0.4551], [0.9570], [0.4277], [0.4785], [0.3730], [0.3477], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.4668], [0.7500], [0.4648], [0.8008], [0.3340], [0.7500], [0.2500], [0.5000], [1.0000], [0.4004], [0.5000], [0.4004], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010528564453125 loss: 0.0013427734375loss: 0.000537872314453125 loss: 0.00109100341796875 predicted value: tensor([[0.5156], [0.5078], [0.7578], [0.4062], [0.2178], [0.3105], [0.3848], [0.9609], [0.9766], [0.4883], [0.3926], [0.5742], [0.3164], [0.3809], [0.1748], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.8008], [0.4668], [0.2002], [0.2002], [0.3750], [1.0000], [1.0000], [0.6016], [0.4004], [0.7500], [0.3340], [0.4004], [0.2500], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000797271728515625 loss: 0.00052642822265625 loss: 0.003448486328125 loss: 0.002838134765625 predicted value: tensor([[1.0000], [0.9922], [0.2402], [0.2129], [0.4004], [0.9844], [0.3086], [0.4766], [1.0000], [0.5312], [0.2432], [0.7031], [0.4395], [0.3457], [0.1660], [0.1836]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.2002], [0.2002], [0.3750], [1.0000], [0.3340], [0.4668], [1.0000], [0.6016], [0.3340], [0.7500], [0.5000], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010528564453125 loss: 0.0004405975341796875loss: 0.0013427734375 loss: 0.000850677490234375 predicted value: tensor([[0.5312], [0.5078], [0.6758], [0.7070], [0.9922], [0.9961], [1.0078], [0.4375], [0.6914], [0.4062], [0.2305], [0.2617], [0.5898], [0.1582], [0.3184], [0.1836]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.6680], [0.6680], [1.0000], [1.0000], [1.0000], [0.6016], [0.8008], [0.4668], [0.2002], [0.2500], [0.5000], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000701904296875 loss: 0.000484466552734375 loss: 0.0010528564453125 loss: 0.00101470947265625 72%|███████▏ | 352/492 [3:11:25<1:14:23, 31.89s/it] {'loss': 0.0046, 'learning_rate': 1e-05, 'epoch': 0.72} 72%|███████▏ | 352/492 [3:11:25<1:14:23, 31.89s/it]predicted value: tensor([[0.5898], [1.0547], [0.5117], [0.4746], [1.0703], [0.6602], [0.8008], [0.8164], [0.5625], [0.3594], [0.8594], [0.3203], [0.4238], [0.4102], [0.2412], [0.5391]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.4668], [1.0000], [0.6016], [0.4668], [0.4668], [0.4668], [0.3340], [0.8008], [0.2002], [0.4004], [0.3340], [0.2002], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0032958984375 loss: 0.00445556640625 loss: 0.00164031982421875 loss: 0.00592041015625 predicted value: tensor([[0.3320], [0.3027], [0.6211], [0.5078], [1.0547], [0.7031], [0.4395], [0.6562], [0.4785], [0.2773], [1.0469], [0.3516], [1.0312], [0.1689], [0.2227], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.3340], [0.4668], [0.4668], [1.0000], [0.6016], [0.3750], [0.6016], [0.3750], [0.2500], [1.0000], [0.3340], [1.0000], [0.0400], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009918212890625 loss: 0.0013427734375loss: 0.0026397705078125 loss: 0.0016632080078125 predicted value: tensor([[0.8867], [0.8789], [1.0547], [0.4746], [0.6797], [0.4844], [0.4570], [0.6680], [0.4180], [0.8242], [0.4668], [0.4629], [0.4766], [0.4531], [0.2578], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [1.0000], [0.4668], [0.6016], [0.3750], [0.3750], [0.6016], [0.3340], [0.8008], [0.5000], [0.5000], [0.5000], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016326904296875 loss: 0.0009918212890625 loss: 0.002288818359375 loss: 0.00063323974609375 predicted value: tensor([[0.4941], [0.4922], [0.4961], [0.4844], [0.5781], [0.5664], [0.5312], [0.6602], [0.4512], [0.7500], [0.6484], [0.5273], [0.4434], [0.4492], [0.2275], [0.2354]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.4668], [0.6016], [0.5000], [0.5000], [0.5000], [0.6016], [0.6016], [0.6016], [0.2500], [0.4004], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0004673004150390625 loss: 0.000606536865234375 loss: 0.002716064453125 loss: 0.0012969970703125 72%|███████▏ | 353/492 [3:11:57<1:13:26, 31.70s/it] {'loss': 0.0081, 'learning_rate': 1e-05, 'epoch': 0.72} 72%|███████▏ | 353/492 [3:11:57<1:13:26, 31.70s/it]predicted value: tensor([[0.6250], [1.0391], [0.6055], [0.3164], [0.4727], [0.5039], [0.6680], [0.5469], [0.3438], [0.6055], [0.5859], [0.7031], [0.4668], [0.2500], [0.4785], [0.2266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [1.0000], [0.3750], [0.2500], [0.4668], [0.4668], [0.4668], [0.4668], [0.2500], [0.6016], [0.6016], [0.6016], [0.4004], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00110626220703125 loss: 0.002197265625 loss: 0.001678466796875 loss: 0.0020599365234375 predicted value: tensor([[0.4531], [0.7070], [0.8672], [0.7773], [1.0391], [0.8398], [0.1523], [0.6680], [0.7148], [0.4922], [0.3730], [0.4238], [0.2793], [0.2324], [0.2715], [0.2334]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.6016], [0.8320], [0.7500], [1.0000], [0.8008], [0.0278], [0.6016], [0.5000], [0.4004], [0.2500], [0.5000], [0.2002], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00150299072265625 loss: 0.001800537109375 loss: 0.0020294189453125 loss: 0.00194549560546875 predicted value: tensor([[0.5820], [1.0391], [1.0312], [0.3203], [1.0625], [0.5508], [0.3125], [0.7695], [0.2871], [0.5273], [0.3633], [0.3965], [0.4629], [1.0391], [0.2656], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.2500], [1.0000], [0.3750], [0.3340], [0.8008], [0.2500], [0.5000], [0.3340], [0.3340], [0.4004], [1.0000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115966796875 loss: 0.000823974609375loss: 0.00095367431640625 loss: 0.0022125244140625 predicted value: tensor([[0.5820], [0.5000], [0.4688], [0.4707], [0.8125], [0.2041], [0.7266], [0.2910], [1.0781], [0.2930], [0.6367], [0.4668], [0.2988], [0.2305], [0.0742], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.4668], [0.4668], [0.2002], [0.6680], [0.2500], [1.0000], [0.2002], [0.6016], [0.3340], [0.0625], [0.2002], [0.0625], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000782012939453125 loss: 0.000934600830078125 loss: 0.00165557861328125 loss: 0.003387451171875 72%|███████▏ | 354/492 [3:12:28<1:12:48, 31.65s/it] {'loss': 0.0066, 'learning_rate': 1e-05, 'epoch': 0.72} 72%|███████▏ | 354/492 [3:12:28<1:12:48, 31.65s/it]predicted value: tensor([[0.5469], [0.3848], [0.9688], [0.3008], [0.7109], [0.4082], [0.5430], [0.6641], [0.6875], [0.6875], [0.5312], [0.6094], [0.3965], [0.1699], [0.1768], [0.1826]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [1.0000], [0.2500], [0.6680], [0.4668], [0.5547], [0.7500], [0.6016], [0.8008], [0.6016], [0.6016], [0.3340], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00130462646484375 loss: 0.0008087158203125 loss: 0.00130462646484375 loss: 0.00084686279296875 predicted value: tensor([[0.9766], [0.2266], [0.4355], [0.4199], [0.7031], [0.4414], [0.6562], [0.9883], [0.9688], [0.9766], [0.5547], [0.4238], [0.4570], [0.4004], [0.1846], [0.1621]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.4668], [0.4668], [0.8008], [0.4668], [0.6016], [1.0000], [1.0000], [1.0000], [0.5000], [0.5000], [0.5000], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00150299072265625 loss: 0.000644683837890625loss: 0.0029296875 loss: 0.000629425048828125 predicted value: tensor([[0.4883], [0.4355], [0.4219], [0.7930], [0.4062], [0.8164], [0.7227], [0.9766], [0.4707], [0.1680], [0.4922], [0.3535], [0.3809], [0.3477], [0.4023], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2832], [0.4668], [0.4668], [0.8008], [0.3750], [0.8320], [0.8008], [1.0000], [0.4668], [0.2500], [0.5000], [0.3340], [0.5000], [0.3340], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00092315673828125 loss: 0.0013580322265625 loss: 0.0004596710205078125 loss: 0.0023956298828125 predicted value: tensor([[0.2617], [0.4258], [0.4297], [0.8008], [0.8008], [0.3750], [0.2188], [0.2715], [0.2578], [0.3438], [0.9844], [0.3145], [0.3691], [0.3906], [0.1846], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.3750], [0.4668], [0.8008], [0.8320], [0.4668], [0.3340], [0.2002], [0.3340], [0.4668], [1.0000], [0.3340], [0.4004], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0001697540283203125 loss: 0.00128936767578125 loss: 0.000946044921875 loss: 0.00079345703125 72%|███████▏ | 355/492 [3:12:59<1:11:51, 31.47s/it] {'loss': 0.0046, 'learning_rate': 1e-05, 'epoch': 0.72} 72%|███████▏ | 355/492 [3:12:59<1:11:51, 31.47s/it]predicted value: tensor([[1.0000], [0.4336], [0.4883], [0.4043], [0.6016], [0.7695], [0.7344], [0.6758], [0.7031], [0.3828], [0.9570], [0.4160], [0.2021], [0.1904], [0.1738], [0.1953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4004], [0.3750], [0.4668], [0.8008], [0.7500], [0.6680], [0.8008], [0.2500], [1.0000], [0.4004], [0.2500], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00054931640625 loss: 0.00101470947265625 loss: 0.0004215240478515625 loss: 0.0012054443359375 predicted value: tensor([[0.3887], [0.4492], [0.1914], [0.2227], [0.9570], [0.2471], [0.4785], [0.3574], [0.6016], [0.7227], [0.4453], [0.4258], [0.4102], [0.4336], [0.1943], [0.3516]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.3340], [0.2500], [1.0000], [0.2500], [0.8008], [0.4668], [0.5000], [0.7500], [0.5000], [0.4004], [0.4004], [0.4004], [0.2500], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00189971923828125 loss: 0.000453948974609375 loss: 0.0025177001953125 loss: 0.0017242431640625 predicted value: tensor([[0.9766], [0.4414], [0.6641], [0.9766], [0.9609], [0.6523], [0.9492], [0.4512], [0.4316], [0.3594], [0.4355], [0.4648], [0.3535], [0.1729], [0.1885], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.7500], [1.0000], [1.0000], [0.5000], [1.0000], [0.4668], [0.4668], [0.3340], [0.5000], [0.5000], [0.4004], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026702880859375 loss: 0.00072479248046875 loss: 0.0013427734375 loss: 0.0013885498046875 predicted value: tensor([[0.9727], [0.1973], [0.7344], [0.4238], [0.1602], [0.4199], [0.5703], [0.6328], [0.7305], [0.6016], [0.2207], [0.7266], [0.3359], [0.2021], [0.1875], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.8008], [0.3750], [0.2500], [0.4668], [0.6016], [0.6016], [0.8008], [0.5000], [0.3340], [0.7500], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013580322265625 loss: 0.00173187255859375 loss: 0.00091552734375loss: 0.000835418701171875 72%|███████▏ | 356/492 [3:13:31<1:11:31, 31.55s/it] {'loss': 0.0052, 'learning_rate': 1e-05, 'epoch': 0.72} 72%|███████▏ | 356/492 [3:13:31<1:11:31, 31.55s/it]predicted value: tensor([[0.6133], [0.2969], [1.0312], [0.5273], [0.4707], [0.5039], [0.7891], [0.6719], [0.4707], [1.0391], [0.5898], [0.2598], [0.4238], [0.7422], [0.2500], [0.2295]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [1.0000], [0.4668], [0.4668], [0.4668], [0.8008], [0.7500], [0.3750], [1.0000], [0.6016], [0.0625], [0.3340], [0.7500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021209716796875 loss: 0.00121307373046875loss: 0.00131988525390625 loss: 0.0033416748046875 predicted value: tensor([[0.6445], [0.4766], [0.4746], [0.8672], [0.5352], [1.0469], [1.0625], [0.6055], [0.7695], [0.6719], [0.1455], [0.4727], [0.4648], [0.4648], [0.3828], [0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.3750], [0.3750], [0.8008], [0.4668], [1.0000], [1.0000], [0.6016], [0.7500], [0.6016], [0.0625], [0.3340], [0.4004], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00106048583984375 loss: 0.00125885009765625 loss: 0.00170135498046875 loss: 0.0018310546875 predicted value: tensor([[0.5195], [0.3867], [0.8164], [0.2471], [0.4863], [0.3047], [0.3086], [0.6562], [0.2539], [0.7852], [1.0391], [0.6250], [0.4023], [0.2656], [0.2441], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3340], [0.6680], [0.2500], [0.3750], [0.3340], [0.2500], [0.6016], [0.2500], [0.6680], [1.0000], [0.5000], [0.3340], [0.1670], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001220703125 loss: 0.00180816650390625 loss: 0.00153350830078125 loss: 0.002288818359375 predicted value: tensor([[0.6367], [0.8594], [0.5117], [1.0469], [0.2715], [0.7734], [0.6328], [0.8320], [0.4805], [0.6055], [0.6289], [0.7617], [0.3125], [0.2832], [0.2676], [0.2393]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.8320], [0.3145], [1.0000], [0.2500], [0.6680], [0.6016], [0.8008], [0.4668], [0.6016], [0.6016], [0.7500], [0.2500], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003204345703125 loss: 0.00250244140625 loss: 0.001800537109375 loss: 0.0009918212890625 73%|███████▎ | 357/492 [3:14:02<1:10:57, 31.54s/it] {'loss': 0.0073, 'learning_rate': 1e-05, 'epoch': 0.73} 73%|███████▎ | 357/492 [3:14:02<1:10:57, 31.54s/it]predicted value: tensor([[0.6172], [0.3047], [0.6094], [1.0234], [1.0156], [0.3027], [0.3105], [0.6250], [0.8047], [0.8281], [0.4238], [0.6875], [0.2480], [0.4688], [0.2275], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.3340], [0.5547], [1.0000], [1.0000], [0.3340], [0.2002], [0.5000], [0.8008], [0.8320], [0.4004], [0.7500], [0.2002], [0.4004], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00179290771484375 loss: 0.00089263916015625loss: 0.00092315673828125 loss: 0.00124359130859375 predicted value: tensor([[1.0391], [0.4551], [0.4863], [0.5469], [0.3457], [0.6406], [1.0000], [0.8086], [0.8789], [1.0547], [0.4512], [0.2373], [0.6562], [0.2617], [0.2617], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.3145], [0.3750], [0.3340], [0.6016], [1.0000], [0.3750], [0.8320], [1.0000], [0.5000], [0.2002], [0.7500], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00083160400390625 loss: 0.0045166015625loss: 0.000843048095703125 loss: 0.000946044921875 predicted value: tensor([[0.8711], [0.2949], [1.0312], [1.0312], [1.0391], [0.2891], [0.5352], [0.6406], [0.7461], [0.8242], [0.6875], [0.4375], [0.4824], [0.2754], [0.2539], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [1.0000], [1.0000], [1.0000], [0.2500], [0.3750], [0.6016], [0.6680], [0.8008], [0.8008], [0.5000], [0.5000], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0017852783203125 loss: 0.00116729736328125loss: 0.0022735595703125 loss: 0.000789642333984375 predicted value: tensor([[0.8281], [0.4688], [0.6289], [0.7500], [0.2598], [1.0234], [0.3125], [1.0234], [0.7617], [0.6641], [0.6562], [0.5078], [0.4648], [0.4551], [0.2656], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.5547], [0.7500], [0.1670], [1.0000], [0.3340], [1.0000], [0.6016], [0.6016], [0.6016], [0.5000], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000827789306640625 loss: 0.00157928466796875 loss: 0.00139617919921875 loss: 0.00107574462890625 73%|███████▎ | 358/492 [3:14:34<1:10:13, 31.44s/it] {'loss': 0.0057, 'learning_rate': 1e-05, 'epoch': 0.73} 73%|███████▎ | 358/492 [3:14:34<1:10:13, 31.44s/it]predicted value: tensor([[0.5625], [0.9609], [0.9805], [0.2412], [0.4727], [0.5078], [0.9570], [0.4336], [0.4043], [0.9648], [0.3516], [0.4434], [0.5898], [0.2227], [0.1904], [0.2148]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7344], [1.0000], [1.0000], [0.2500], [0.6016], [0.6016], [1.0000], [0.4668], [0.4004], [1.0000], [0.4004], [0.4004], [0.7500], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000537872314453125 loss: 0.0014495849609375 loss: 0.00072479248046875 loss: 0.00109100341796875 predicted value: tensor([[0.4902], [0.9648], [0.2324], [0.2432], [0.9688], [0.2373], [0.9766], [0.5625], [0.1543], [0.3926], [0.4355], [0.3750], [0.3379], [0.1807], [0.4707], [0.1953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.2500], [0.3340], [1.0000], [0.2500], [1.0000], [0.5000], [0.2002], [0.4004], [0.4004], [0.3340], [0.4004], [0.1670], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00045013427734375 loss: 0.00052642822265625 loss: 0.0005035400390625 loss: 0.00048828125 predicted value: tensor([[0.5430], [0.7539], [0.1328], [0.7812], [0.9766], [0.9531], [0.7656], [0.4473], [0.5195], [0.6680], [0.3516], [0.4102], [0.9688], [0.1973], [0.1953], [0.2002]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.2002], [0.8320], [1.0000], [1.0000], [0.8008], [0.4668], [0.5000], [0.8008], [0.4004], [0.5000], [1.0000], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00124359130859375 loss: 0.000797271728515625loss: 0.0037841796875 loss: 0.0011138916015625 predicted value: tensor([[0.5117], [0.4258], [0.4258], [0.9570], [0.4180], [0.2051], [0.9609], [0.4668], [0.6602], [0.5078], [0.9688], [0.4082], [0.3750], [0.4668], [0.2148], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [1.0000], [0.3750], [0.2002], [1.0000], [0.5000], [0.7500], [0.6016], [1.0000], [0.5000], [0.4004], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003997802734375 loss: 0.000682830810546875 loss: 0.000579833984375 loss: 0.001739501953125 73%|███████▎ | 359/492 [3:15:06<1:10:18, 31.72s/it] {'loss': 0.0049, 'learning_rate': 1e-05, 'epoch': 0.73} 73%|███████▎ | 359/492 [3:15:06<1:10:18, 31.72s/it]predicted value: tensor([[0.7617], [0.2539], [0.9570], [0.4238], [0.7734], [0.7422], [0.2383], [0.7344], [0.5195], [0.2637], [0.4707], [0.3633], [0.3516], [0.2090], [0.2119], [0.1943]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2002], [1.0000], [0.4668], [0.8008], [0.8008], [0.2002], [0.8008], [0.5547], [0.2500], [0.6016], [0.4004], [0.3340], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00183868408203125 loss: 0.000835418701171875 loss: 0.0006866455078125 loss: 0.0003032684326171875 predicted value: tensor([[0.4297], [0.7500], [0.9805], [0.7969], [0.5273], [0.2432], [0.5234], [0.4609], [0.2236], [0.3457], [0.0820], [0.4297], [0.4688], [0.3613], [0.2246], [0.3242]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.7773], [1.0000], [0.8008], [0.5547], [0.2002], [0.5547], [0.4668], [0.2500], [0.4004], [0.0625], [0.3750], [0.6016], [0.4004], [0.1670], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019989013671875 loss: 0.00103759765625loss: 0.000804901123046875 loss: 0.00106048583984375 predicted value: tensor([[0.5352], [0.2676], [0.9805], [0.7500], [0.5508], [0.6367], [0.2012], [0.9805], [0.6016], [0.4336], [0.5273], [0.2754], [0.5938], [0.2070], [0.1914], [0.1953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [1.0000], [0.8008], [0.5547], [0.8008], [0.2002], [1.0000], [0.7500], [0.4668], [0.6016], [0.3340], [0.6016], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001190185546875 loss: 0.00102996826171875 loss: 0.00360107421875 loss: 0.000885009765625 predicted value: tensor([[0.7969], [0.4219], [0.9688], [0.6992], [0.7461], [0.7656], [0.9648], [0.5469], [0.5508], [0.5625], [0.6797], [0.7891], [0.5703], [0.2178], [0.3926], [0.1982]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [1.0000], [0.7148], [0.8008], [0.8008], [1.0000], [0.6016], [0.5000], [0.5000], [0.7500], [0.8008], [0.6016], [0.2002], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00125885009765625 loss: 0.00052642822265625 loss: 0.000446319580078125 loss: 0.002197265625 73%|███████▎ | 360/492 [3:15:37<1:09:28, 31.58s/it] {'loss': 0.0049, 'learning_rate': 1e-05, 'epoch': 0.73} 73%|███████▎ | 360/492 [3:15:37<1:09:28, 31.58s/it]predicted value: tensor([[0.6211], [0.4902], [0.5117], [1.0469], [1.0391], [0.5352], [1.0547], [0.5391], [1.0312], [0.5977], [0.4609], [0.4766], [0.4570], [0.4824], [0.2617], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [1.0000], [1.0000], [0.4668], [1.0000], [0.4668], [1.0000], [0.6016], [0.3340], [0.5000], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012664794921875 loss: 0.00086212158203125 loss: 0.0020751953125 loss: 0.0012664794921875 predicted value: tensor([[0.8750], [0.5195], [0.5391], [0.5352], [0.6328], [1.0391], [0.2969], [1.0547], [0.6758], [0.8125], [0.4355], [1.0625], [0.3594], [0.1592], [0.2734], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.4668], [0.4668], [0.5547], [1.0000], [0.2500], [1.0000], [0.7500], [0.8008], [0.2715], [1.0000], [0.3340], [0.0625], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004364013671875 loss: 0.00140380859375 loss: 0.00201416015625 loss: 0.00099945068359375 predicted value: tensor([[0.8438], [0.4629], [1.0469], [0.7148], [0.3672], [0.3672], [1.0391], [0.6484], [0.4551], [1.0391], [0.2852], [0.1455], [0.4668], [0.3594], [0.4453], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.3145], [1.0000], [0.6016], [0.2002], [0.2500], [1.0000], [0.7500], [0.7500], [1.0000], [0.2500], [0.0625], [0.5000], [0.2500], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00164031982421875 loss: 0.003448486328125loss: 0.00146484375 loss: 0.001312255859375 predicted value: tensor([[0.7969], [0.3105], [0.8555], [0.6250], [1.0312], [0.7148], [0.7812], [0.6328], [0.6445], [0.6328], [0.3262], [0.3555], [0.4414], [0.3711], [0.4492], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.2002], [0.8320], [0.5547], [1.0000], [0.6680], [0.8008], [0.6016], [0.7500], [0.6016], [0.2500], [0.2002], [0.4004], [0.2500], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001983642578125 loss: 0.002166748046875 loss: 0.00157928466796875 loss: 0.00151824951171875 73%|███████▎ | 361/492 [3:16:09<1:09:11, 31.69s/it] {'loss': 0.0073, 'learning_rate': 1e-05, 'epoch': 0.73} 73%|███████▎ | 361/492 [3:16:09<1:09:11, 31.69s/it]predicted value: tensor([[0.6445], [0.4863], [0.2695], [0.7773], [0.8320], [0.8359], [0.4883], [1.0234], [0.5820], [0.4219], [0.4980], [0.2910], [0.4727], [0.4688], [0.2471], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.3750], [0.2002], [0.6680], [0.8008], [0.8320], [0.4668], [1.0000], [0.6016], [0.3340], [0.5000], [0.4004], [0.5000], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010986328125 loss: 0.00115966796875loss: 0.0015411376953125 loss: 0.0014801025390625 predicted value: tensor([[0.8516], [0.7578], [1.0469], [1.0312], [0.7734], [0.6562], [0.5000], [0.6367], [0.5703], [0.2217], [0.2129], [0.3926], [0.3984], [0.2812], [0.2793], [0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6680], [1.0000], [1.0000], [0.6680], [0.6016], [0.4668], [0.6680], [0.5000], [0.0400], [0.2002], [0.4004], [0.4004], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000972747802734375 loss: 0.0012359619140625 loss: 0.001190185546875 loss: 0.0025482177734375 predicted value: tensor([[0.2969], [1.0469], [0.8359], [0.5000], [1.0234], [0.8164], [0.4883], [0.4629], [1.0234], [0.5430], [0.5195], [0.4316], [0.4648], [0.5039], [0.2734], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.1670], [1.0000], [0.8008], [0.4668], [1.0000], [0.8008], [0.3750], [0.3750], [1.0000], [0.5000], [0.4668], [0.5000], [0.3340], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00109100341796875 loss: 0.00145721435546875 loss: 0.00164031982421875 loss: 0.00135040283203125 predicted value: tensor([[0.7734], [0.5312], [0.3809], [1.0781], [0.8008], [1.0469], [0.5352], [0.6367], [0.6172], [0.1768], [0.6953], [0.4121], [0.4258], [0.3438], [0.2451], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.4668], [0.2500], [1.0000], [0.7500], [1.0000], [0.4668], [0.6016], [0.5000], [0.0625], [0.6016], [0.3340], [0.5000], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00183868408203125 loss: 0.00164794921875 loss: 0.000858306884765625 loss: 0.00179290771484375 74%|███████▎ | 362/492 [3:16:40<1:08:20, 31.54s/it] {'loss': 0.0057, 'learning_rate': 1e-05, 'epoch': 0.74} 74%|███████▎ | 362/492 [3:16:40<1:08:20, 31.54s/it]predicted value: tensor([[0.4863], [0.4492], [0.4297], [0.4395], [0.6133], [0.4043], [0.5977], [0.9727], [0.5898], [0.7266], [0.4941], [0.3730], [0.3164], [0.3750], [0.4102], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.4668], [0.6016], [0.4668], [0.6016], [1.0000], [0.6016], [0.8008], [0.7500], [0.4004], [0.4004], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0003376007080078125 loss: 0.00136566162109375 loss: 0.002410888671875 loss: 0.000537872314453125 predicted value: tensor([[0.9922], [0.9844], [0.9727], [0.4160], [0.7266], [0.4219], [0.4141], [0.6836], [0.6055], [0.7227], [0.4238], [0.2910], [0.4141], [0.0703], [0.2148], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [1.0000], [0.4668], [0.6680], [0.4668], [0.4668], [0.5703], [0.6016], [0.8008], [0.3750], [0.2500], [0.4004], [0.0625], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000972747802734375 loss: 0.0003757476806640625 loss: 0.000553131103515625 loss: 0.00136566162109375 predicted value: tensor([[0.7578], [0.7969], [0.4062], [0.9727], [0.3691], [0.4062], [0.9648], [0.5391], [0.9844], [0.2891], [0.5508], [0.6641], [0.7109], [0.3691], [0.1318], [0.2051]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [0.3750], [1.0000], [0.4277], [0.4668], [1.0000], [0.5000], [1.0000], [0.6016], [0.7500], [0.6680], [0.7500], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000659942626953125 loss: 0.00457763671875loss: 0.001983642578125 loss: 0.00135040283203125 predicted value: tensor([[0.7773], [0.4355], [0.4551], [0.4258], [0.4180], [0.5742], [0.9648], [0.0806], [0.9961], [0.4355], [0.4492], [0.4453], [0.2100], [0.3320], [0.3750], [0.1914]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3750], [0.4668], [0.3145], [0.4668], [0.7500], [1.0000], [0.0625], [1.0000], [0.4668], [0.4004], [0.4004], [0.1670], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00101470947265625 loss: 0.006988525390625 loss: 0.0013275146484375loss: 0.00213623046875 74%|███████▍ | 363/492 [3:17:13<1:08:17, 31.76s/it] {'loss': 0.007, 'learning_rate': 1e-05, 'epoch': 0.74} 74%|███████▍ | 363/492 [3:17:13<1:08:17, 31.76s/it]predicted value: tensor([[ 0.9609], [ 0.5508], [ 0.2500], [ 0.4258], [ 0.9805], [ 0.6445], [ 0.9688], [ 0.6562], [ 0.9805], [ 0.9688], [ 0.2617], [ 0.4082], [ 0.5234], [ 0.1738], [-0.0476], [ 0.2070]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.2500], [0.4668], [1.0000], [0.6016], [1.0000], [0.7500], [1.0000], [1.0000], [0.2500], [0.4004], [0.6016], [0.1670], [0.0156], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015869140625 loss: 0.000812530517578125 loss: 0.00045013427734375 loss: 0.002349853515625 predicted value: tensor([[0.7930], [0.2695], [0.9727], [0.2773], [0.4023], [0.5547], [0.6680], [0.2656], [0.6641], [0.9883], [0.7266], [0.3828], [0.3691], [0.1875], [0.3828], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [1.0000], [0.3340], [0.4668], [0.5547], [0.6680], [0.2002], [0.7500], [1.0000], [0.8008], [0.4004], [0.4004], [0.2002], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00074005126953125 loss: 0.000499725341796875 loss: 0.00136566162109375 loss: 0.000568389892578125 predicted value: tensor([[0.5430], [0.5273], [0.7344], [0.9844], [0.9961], [0.9688], [0.3164], [0.6406], [0.9844], [0.4961], [0.3730], [0.4551], [0.9883], [0.2988], [0.3809], [0.2041]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.8320], [1.0000], [1.0000], [1.0000], [0.3340], [0.6016], [1.0000], [0.4668], [0.4004], [0.5000], [1.0000], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003021240234375 loss: 0.0004730224609375 loss: 0.001251220703125 loss: 0.00067138671875 predicted value: tensor([[0.4297], [0.2295], [0.7266], [0.9688], [0.7852], [0.9961], [0.2471], [0.3203], [0.9727], [0.4180], [0.2090], [0.4395], [0.3086], [0.2852], [0.1836], [0.1855]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2002], [0.8008], [1.0000], [0.8320], [1.0000], [0.2500], [0.3340], [1.0000], [0.4004], [0.2500], [0.5000], [0.2500], [0.2500], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00151824951171875 loss: 0.000637054443359375 loss: 0.0010528564453125 loss: 0.00041961669921875 74%|███████▍ | 364/492 [3:17:44<1:07:32, 31.66s/it] {'loss': 0.0044, 'learning_rate': 1e-05, 'epoch': 0.74} 74%|███████▍ | 364/492 [3:17:44<1:07:32, 31.66s/it]predicted value: tensor([[0.6055], [0.8672], [0.6055], [0.5000], [1.0469], [0.4531], [0.7969], [1.0469], [0.6016], [0.1396], [0.4961], [0.4902], [0.2178], [0.3848], [0.2676], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.5547], [0.3750], [1.0000], [0.3145], [0.6680], [1.0000], [0.5000], [0.0400], [0.4668], [0.3340], [0.2002], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028839111328125 loss: 0.0016632080078125 loss: 0.00185394287109375 loss: 0.00189971923828125 predicted value: tensor([[0.4961], [0.6250], [0.6016], [0.7461], [0.3066], [0.4961], [0.7109], [0.6289], [0.4648], [0.6641], [0.4492], [0.8594], [0.5117], [0.6406], [0.2334], [0.2207]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.5547], [0.5547], [0.3340], [0.4668], [0.6680], [0.5000], [0.5000], [0.6016], [0.4004], [0.8008], [0.4004], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033416748046875 loss: 0.00176239013671875 loss: 0.0009002685546875 loss: 0.00104522705078125 predicted value: tensor([[1.0703], [0.5625], [0.7266], [0.2754], [0.6250], [1.0469], [0.7188], [1.0312], [0.5820], [0.4844], [0.3633], [0.5625], [0.4609], [0.2480], [0.2598], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5000], [0.6680], [0.2002], [0.5547], [1.0000], [0.6680], [1.0000], [0.5000], [0.5000], [0.3340], [0.5000], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011749267578125 loss: 0.000843048095703125 loss: 0.00183868408203125 loss: 0.000934600830078125 predicted value: tensor([[0.4609], [0.4746], [0.4746], [0.4746], [1.0312], [0.8555], [0.5312], [0.3262], [0.8125], [0.3359], [0.6602], [0.5312], [0.4180], [0.4766], [0.2383], [0.2422]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.4668], [0.4668], [1.0000], [0.8008], [0.6680], [0.3340], [0.6680], [0.2500], [0.7500], [0.5000], [0.4004], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002166748046875 loss: 0.0011749267578125 loss: 0.00075531005859375 loss: 0.0028533935546875 74%|███████▍ | 365/492 [3:18:16<1:06:55, 31.61s/it] {'loss': 0.0068, 'learning_rate': 1e-05, 'epoch': 0.74} 74%|███████▍ | 365/492 [3:18:16<1:06:55, 31.61s/it]predicted value: tensor([[0.9297], [0.8398], [0.8750], [0.7773], [1.0469], [0.6758], [1.0469], [0.5469], [1.0469], [0.3105], [1.0469], [0.4590], [0.4043], [0.5039], [0.2559], [0.2490]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.8008], [0.8320], [0.8008], [1.0000], [0.4668], [1.0000], [0.6016], [1.0000], [0.1670], [1.0000], [0.4004], [0.2500], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000804901123046875 loss: 0.001922607421875loss: 0.0020599365234375 loss: 0.0029144287109375 predicted value: tensor([[0.6016], [0.3008], [1.0625], [1.0234], [0.8242], [0.5000], [0.5117], [1.0312], [1.0469], [0.4805], [1.0625], [0.1221], [0.5352], [0.2461], [0.2480], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [1.0000], [1.0000], [0.8008], [0.4648], [0.4668], [1.0000], [1.0000], [0.3145], [1.0000], [0.0400], [0.3340], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00102996826171875 loss: 0.00152587890625 loss: 0.00150299072265625 loss: 0.00238037109375 predicted value: tensor([[0.3496], [0.6172], [0.5117], [0.4922], [1.0625], [0.4844], [1.0547], [0.8242], [0.3066], [0.7617], [0.6953], [0.6055], [0.4375], [0.4766], [0.2598], [0.2490]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.5547], [0.4668], [0.3750], [1.0000], [0.3750], [1.0000], [0.8008], [0.3340], [0.8008], [0.7500], [0.6016], [0.4004], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00173187255859375 loss: 0.00083160400390625loss: 0.00112152099609375 loss: 0.00148773193359375 predicted value: tensor([[0.8789], [0.3027], [0.3672], [0.4824], [0.4863], [0.7383], [0.3750], [0.8281], [0.5703], [0.3516], [1.0469], [0.4336], [0.3457], [0.2812], [0.2852], [0.2295]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [0.3340], [0.3750], [0.3750], [0.6680], [0.6016], [0.8008], [0.5000], [0.3340], [1.0000], [0.3340], [0.2500], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00151824951171875 loss: 0.000782012939453125 loss: 0.000682830810546875 loss: 0.0019073486328125 74%|███████▍ | 366/492 [3:18:48<1:07:04, 31.94s/it] {'loss': 0.0061, 'learning_rate': 1e-05, 'epoch': 0.74} 74%|███████▍ | 366/492 [3:18:48<1:07:04, 31.94s/it]predicted value: tensor([[ 0.5430], [ 0.5039], [ 0.9688], [ 0.7539], [ 0.6562], [ 0.7344], [ 0.3457], [ 0.7578], [ 0.9844], [ 0.6211], [-0.0486], [ 0.3672], [ 0.2109], [ 0.4199], [ 0.1895], [ 0.1943]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.5547], [1.0000], [0.8320], [0.4668], [0.8320], [0.2500], [0.8008], [1.0000], [0.6016], [0.0278], [0.5000], [0.1670], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001251220703125 loss: 0.0016937255859375 loss: 0.0016937255859375 loss: 0.00119781494140625 predicted value: tensor([[0.2500], [0.5625], [0.9883], [0.4727], [0.6875], [0.2451], [0.4473], [0.6875], [0.4180], [0.3730], [0.9805], [0.3789], [0.1816], [0.3828], [0.3477], [0.1689]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.4688], [1.0000], [0.4668], [0.8008], [0.2500], [0.4668], [0.4668], [0.5000], [0.4004], [1.0000], [0.4004], [0.1670], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0008392333984375 loss: 0.00183868408203125 loss: 0.00041961669921875 loss: 0.000492095947265625 predicted value: tensor([[0.4043], [0.9766], [0.9922], [0.8516], [0.5117], [0.7773], [0.6289], [0.2656], [0.6484], [0.2559], [0.5820], [0.6406], [0.3809], [0.3633], [0.0347], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [0.8555], [0.4668], [0.8320], [0.6016], [0.2500], [0.7500], [0.3340], [0.6016], [0.5000], [0.4004], [0.4004], [0.0625], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000606536865234375 loss: 0.000774383544921875 loss: 0.000652313232421875 loss: 0.00136566162109375 predicted value: tensor([[0.4258], [0.4023], [0.4238], [0.4629], [0.2334], [0.4062], [0.9766], [0.2432], [0.1816], [0.1953], [0.3613], [0.4219], [0.1562], [0.4336], [0.1660], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.4668], [0.2002], [0.4668], [1.0000], [0.2002], [0.1426], [0.2002], [0.5000], [0.4004], [0.2002], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000644683837890625 loss: 0.00188446044921875 loss: 0.000698089599609375 loss: 0.000537872314453125 75%|███████▍ | 367/492 [3:19:22<1:07:34, 32.43s/it] {'loss': 0.0041, 'learning_rate': 1e-05, 'epoch': 0.75} 75%|███████▍ | 367/492 [3:19:22<1:07:34, 32.43s/it]predicted value: tensor([[0.3906], [0.1357], [0.1572], [0.4160], [1.0078], [0.7930], [0.6250], [0.6484], [0.6055], [0.6484], [0.4141], [0.7305], [0.9844], [0.4199], [0.1631], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.0625], [0.2002], [0.4668], [1.0000], [0.8008], [0.6680], [0.6016], [0.6016], [0.6680], [0.5000], [0.7500], [1.0000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009918212890625 loss: 0.000469207763671875 loss: 0.00080108642578125 loss: 0.0012664794921875 predicted value: tensor([[0.5352], [0.7930], [0.5000], [0.9961], [0.9922], [0.5781], [0.1865], [1.0000], [0.7031], [0.4941], [0.9648], [0.4199], [0.2930], [0.2090], [0.1943], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.4668], [1.0000], [1.0000], [0.5000], [0.2002], [1.0000], [0.8008], [0.2002], [1.0000], [0.5000], [0.6016], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000514984130859375 loss: 0.00127410888671875 loss: 0.0032501220703125 loss: 0.000408172607421875 predicted value: tensor([[0.9961], [0.5508], [0.5508], [0.7422], [1.0000], [0.5117], [0.4043], [0.9805], [0.6445], [0.2207], [0.9805], [0.4316], [0.3379], [0.3633], [0.1914], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4648], [0.5547], [0.8008], [1.0000], [0.4668], [0.3750], [1.0000], [0.7500], [0.3340], [1.0000], [0.5000], [0.4004], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0017242431640625 loss: 0.00099945068359375loss: 0.0028076171875 loss: 0.00080108642578125 predicted value: tensor([[0.6250], [0.4180], [0.2598], [0.7383], [0.4473], [0.6953], [0.4082], [0.5195], [0.3008], [0.4180], [0.2207], [0.3730], [0.4258], [0.3906], [0.1816], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.2500], [0.6680], [0.4668], [0.8008], [0.4668], [0.4668], [0.6016], [0.3750], [0.3340], [0.4004], [0.4004], [0.3340], [0.2500], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00067901611328125 loss: 0.00066375732421875 loss: 0.004547119140625 loss: 0.0022735595703125 75%|███████▍ | 368/492 [3:19:55<1:07:43, 32.77s/it] {'loss': 0.0059, 'learning_rate': 1e-05, 'epoch': 0.75} 75%|███████▍ | 368/492 [3:19:55<1:07:43, 32.77s/it]predicted value: tensor([[1.0625], [1.0469], [0.8672], [1.0781], [0.4629], [0.7422], [1.0625], [1.0391], [1.0547], [0.2910], [0.6211], [0.4785], [0.4277], [0.2715], [0.2715], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.8555], [1.0000], [0.4668], [0.8008], [1.0000], [1.0000], [1.0000], [0.2002], [0.5000], [0.5000], [0.3340], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00396728515625 loss: 0.0009002685546875loss: 0.001678466796875 loss: 0.001007080078125 predicted value: tensor([[1.0703], [0.4160], [0.2891], [0.7539], [0.4961], [0.8438], [0.5117], [0.6875], [0.6719], [1.0391], [0.1875], [0.4648], [0.5742], [0.2832], [0.3164], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.2500], [0.7500], [0.4668], [0.8008], [0.4668], [0.7500], [0.6016], [1.0000], [0.0400], [0.4004], [0.3340], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0017547607421875 loss: 0.00201416015625loss: 0.00179290771484375 loss: 0.000759124755859375 predicted value: tensor([[0.4707], [0.8477], [0.8516], [0.8438], [0.5039], [0.7930], [1.0625], [0.7148], [0.4668], [0.5625], [0.4082], [0.4980], [0.5000], [0.4395], [0.2812], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.8008], [0.8008], [0.3750], [0.8008], [1.0000], [0.7500], [0.4668], [0.4277], [0.3340], [0.4004], [0.5000], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010986328125 loss: 0.001495361328125 loss: 0.0006866455078125 loss: 0.001007080078125 predicted value: tensor([[0.5859], [0.5117], [0.8438], [0.3027], [0.4883], [0.4629], [1.0469], [0.7305], [0.5508], [0.7344], [0.6094], [0.5547], [0.4512], [0.2676], [0.2578], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8320], [0.2500], [0.4668], [0.3750], [1.0000], [0.5703], [0.3145], [0.8008], [0.6016], [0.4004], [0.4004], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009765625 loss: 0.0021209716796875 loss: 0.00165557861328125 loss: 0.0021514892578125 75%|███████▌ | 369/492 [3:20:29<1:07:33, 32.96s/it] {'loss': 0.0063, 'learning_rate': 1e-05, 'epoch': 0.75} 75%|███████▌ | 369/492 [3:20:29<1:07:33, 32.96s/it]predicted value: tensor([[0.9102], [0.7344], [0.3906], [0.4707], [0.3047], [0.5430], [0.3086], [0.6914], [0.6406], [0.3184], [0.6758], [0.4570], [0.2715], [0.2480], [0.2695], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6680], [0.3340], [0.4668], [0.2500], [0.4668], [0.2500], [0.6016], [0.6016], [0.2500], [0.6016], [0.4004], [0.2002], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027313232421875 loss: 0.00099945068359375loss: 0.00144195556640625 loss: 0.002288818359375 predicted value: tensor([[1.0234], [0.5586], [0.2832], [0.2520], [0.4492], [1.0547], [1.0391], [0.6367], [0.4395], [0.5430], [0.4668], [0.5391], [0.3438], [0.2695], [0.4492], [0.5117]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.3340], [0.2002], [0.4668], [1.0000], [1.0000], [0.5000], [0.2852], [0.5000], [0.3340], [0.5000], [0.3340], [0.2002], [0.4004], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00112152099609375 loss: 0.00146484375 loss: 0.00152587890625 loss: 0.003997802734375 predicted value: tensor([[0.3164], [0.3359], [0.3438], [0.7656], [0.7773], [0.4609], [0.2617], [0.6797], [0.3398], [0.8047], [0.5117], [0.4980], [0.5039], [0.7070], [0.2578], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.2500], [0.3340], [0.5703], [0.6680], [0.4668], [0.2002], [0.6016], [0.2002], [0.6680], [0.4668], [0.3340], [0.3340], [0.7500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0036773681640625 loss: 0.0028533935546875 loss: 0.0030517578125 loss: 0.0021209716796875 predicted value: tensor([[0.4180], [1.0469], [0.7656], [1.0469], [1.0469], [0.6875], [1.0547], [0.6016], [0.4453], [0.5039], [0.4219], [0.5508], [0.4414], [0.2891], [0.2832], [0.2969]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [1.0000], [1.0000], [0.6016], [1.0000], [0.6016], [0.3340], [0.4004], [0.4004], [0.5000], [0.2500], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00110626220703125 loss: 0.002838134765625loss: 0.0015411376953125 loss: 0.000728607177734375 75%|███████▌ | 370/492 [3:21:01<1:06:36, 32.76s/it] {'loss': 0.0084, 'learning_rate': 1e-05, 'epoch': 0.75} 75%|███████▌ | 370/492 [3:21:01<1:06:36, 32.76s/it]predicted value: tensor([[0.6680], [0.3828], [0.7422], [0.7852], [0.6797], [0.9805], [0.3652], [0.5859], [0.4883], [0.5430], [0.5703], [0.4141], [0.3867], [0.1895], [0.2285], [0.2188]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.8008], [0.8320], [0.8008], [1.0000], [0.5000], [0.3340], [0.5000], [0.5000], [0.6016], [0.5000], [0.3340], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00131988525390625 loss: 0.002471923828125loss: 0.0009613037109375 loss: 0.000461578369140625 predicted value: tensor([[0.9648], [0.6836], [0.3848], [0.3633], [0.2285], [0.9844], [0.5430], [0.3809], [0.2559], [0.4141], [0.5195], [0.6758], [0.1943], [0.4473], [0.2090], [0.1875]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.4668], [0.3145], [0.2500], [1.0000], [0.6016], [0.4668], [0.2500], [0.4668], [0.7500], [0.6680], [0.2002], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000698089599609375 loss: 0.000942230224609375 loss: 0.00148773193359375 loss: 0.00128936767578125 predicted value: tensor([[0.9727], [0.8164], [0.9727], [0.7852], [0.6680], [0.9922], [0.4785], [0.2363], [0.2812], [0.3750], [0.3320], [0.3965], [0.2207], [0.3145], [0.2031], [0.2139]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [1.0000], [0.8320], [0.6680], [1.0000], [0.5000], [0.2500], [0.2500], [0.3340], [0.3340], [0.4004], [0.2002], [0.0278], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00299072265625 loss: 0.00156402587890625 loss: 0.00141143798828125 loss: 0.000667572021484375 predicted value: tensor([[0.4141], [0.7891], [0.7695], [0.9961], [0.9844], [0.5195], [0.3691], [0.7812], [0.7617], [0.5156], [0.4102], [0.3906], [0.2041], [0.2188], [0.2578], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.8008], [1.0000], [1.0000], [0.8008], [0.3750], [0.8008], [0.4668], [0.5000], [0.3340], [0.3340], [0.2002], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019073486328125 loss: 0.00067901611328125 loss: 0.0029296875 loss: 0.000629425048828125 75%|███████▌ | 371/492 [3:21:32<1:05:06, 32.28s/it] {'loss': 0.0056, 'learning_rate': 1e-05, 'epoch': 0.75} 75%|███████▌ | 371/492 [3:21:32<1:05:06, 32.28s/it]predicted value: tensor([[0.9922], [0.8242], [0.1592], [0.3613], [0.6875], [0.5781], [0.6211], [0.3945], [0.6680], [0.2578], [0.5117], [0.4004], [0.3477], [0.2002], [0.1934], [0.2188]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.2002], [0.3750], [0.6680], [0.6016], [0.4668], [0.3340], [0.7500], [0.2500], [0.5000], [0.4004], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000827789306640625 loss: 0.000591278076171875 loss: 0.00099945068359375 loss: 0.000865936279296875 predicted value: tensor([[0.6602], [0.9805], [0.2656], [0.2852], [0.3711], [0.2129], [0.2188], [0.6680], [0.2344], [1.0000], [0.2637], [0.4082], [0.4375], [0.3770], [0.1787], [0.1787]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [1.0000], [0.3340], [0.2500], [0.3340], [0.2500], [0.2500], [0.6680], [0.3340], [1.0000], [0.2500], [0.4004], [0.5000], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025177001953125 loss: 0.0003948211669921875loss: 0.0019683837890625 loss: 0.00194549560546875 predicted value: tensor([[0.6680], [0.7734], [0.3594], [0.9805], [0.3008], [0.7383], [0.5664], [0.6211], [0.4121], [0.2988], [0.5078], [0.4121], [0.1846], [0.2275], [0.2129], [0.2119]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.3750], [1.0000], [0.2500], [0.8008], [0.5000], [0.7500], [0.4668], [0.2500], [0.5000], [0.4004], [0.1426], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00052642822265625 loss: 0.00138092041015625loss: 0.002227783203125 loss: 0.0028533935546875 predicted value: tensor([[0.5977], [0.4004], [0.7188], [0.9922], [0.6680], [0.7695], [0.2158], [0.3848], [0.4551], [0.3184], [0.0977], [0.5312], [0.4082], [0.2422], [0.1973], [0.4727]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.4668], [0.8008], [1.0000], [0.6680], [0.8008], [0.3340], [0.2500], [0.5000], [0.2500], [0.0278], [0.6016], [0.4004], [0.2500], [0.1670], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000934600830078125 loss: 0.00238037109375 loss: 0.0010528564453125loss: 0.00064849853515625 76%|███████▌ | 372/492 [3:22:04<1:04:08, 32.07s/it] {'loss': 0.0055, 'learning_rate': 1e-05, 'epoch': 0.76} 76%|███████▌ | 372/492 [3:22:04<1:04:08, 32.07s/it]predicted value: tensor([[0.8789], [0.6289], [1.0547], [0.8477], [0.7656], [0.4746], [0.6367], [0.7812], [0.7969], [0.5000], [0.7266], [0.7148], [0.4492], [0.5273], [0.2754], [0.3066]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [1.0000], [0.8008], [0.6680], [0.4668], [0.7500], [0.8008], [0.8008], [0.5000], [0.6016], [0.7500], [0.4004], [0.5000], [0.2500], [0.0625]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00250244140625 loss: 0.0021820068359375loss: 0.00098419189453125 loss: 0.002105712890625 predicted value: tensor([[0.7500], [1.0547], [1.0625], [0.3457], [0.7930], [0.4688], [1.0625], [0.7812], [0.4531], [0.6758], [0.2930], [0.6484], [0.8047], [0.4258], [0.3008], [0.2930]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.3340], [0.8008], [0.3750], [1.0000], [0.8008], [0.2500], [0.7500], [0.2500], [0.7500], [0.8320], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000743865966796875 loss: 0.00146484375 loss: 0.001922607421875 loss: 0.00115966796875 predicted value: tensor([[0.8477], [0.5000], [0.4688], [0.3281], [0.4570], [0.8789], [0.4844], [1.0625], [1.0625], [0.6211], [0.8281], [0.5117], [0.5273], [0.2832], [0.2812], [0.4434]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.4668], [0.2500], [0.4668], [0.8555], [0.5000], [1.0000], [1.0000], [0.5000], [0.8008], [0.5000], [0.4004], [0.2002], [0.2500], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00099945068359375 loss: 0.00164031982421875 loss: 0.00106048583984375 loss: 0.00119781494140625 predicted value: tensor([[0.4785], [0.4668], [0.6523], [0.8633], [1.0469], [0.4707], [0.5430], [0.6875], [0.5234], [0.6484], [0.6250], [0.2812], [0.1484], [1.0625], [0.2598], [0.2490]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.8320], [1.0000], [0.4668], [0.6016], [0.7500], [0.6016], [0.7500], [0.2500], [0.2500], [0.0625], [1.0000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00140380859375 loss: 0.0012664794921875 loss: 0.003509521484375 loss: 0.001739501953125 76%|███████▌ | 373/492 [3:22:35<1:03:06, 31.82s/it] {'loss': 0.0065, 'learning_rate': 1e-05, 'epoch': 0.76} 76%|███████▌ | 373/492 [3:22:35<1:03:06, 31.82s/it]predicted value: tensor([[0.6172], [1.0547], [1.0625], [0.3086], [1.0625], [0.7734], [1.0625], [0.6484], [0.5078], [0.6562], [0.6758], [0.6875], [0.6953], [0.2471], [0.2490], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.2500], [1.0000], [0.8008], [1.0000], [0.7500], [0.4668], [0.7500], [0.7500], [0.6016], [0.7500], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003204345703125 loss: 0.00133514404296875 loss: 0.001129150390625 loss: 0.00188446044921875 predicted value: tensor([[0.5000], [0.5117], [0.4922], [0.8008], [1.0391], [0.8672], [1.0469], [0.4512], [0.6406], [1.0469], [0.4805], [0.5586], [0.4922], [0.2656], [0.2578], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.8008], [1.0000], [0.8320], [1.0000], [0.4668], [0.6016], [1.0000], [0.4004], [0.5000], [0.4668], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00103759765625 loss: 0.0008087158203125loss: 0.00084686279296875 loss: 0.00058746337890625 predicted value: tensor([[0.9102], [0.7461], [0.8086], [0.7539], [0.4648], [1.0625], [0.5820], [0.3398], [0.6133], [0.3535], [1.0625], [0.3828], [0.4961], [0.4531], [0.2715], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3750], [0.8320], [0.8008], [0.3750], [1.0000], [0.8008], [0.3340], [0.6016], [0.2500], [1.0000], [0.4004], [0.5000], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023651123046875 loss: 0.003509521484375loss: 0.0010833740234375 loss: 0.00113677978515625 predicted value: tensor([[0.2871], [1.0469], [0.6133], [0.5898], [0.7969], [0.5859], [0.3203], [0.4902], [0.3867], [0.4434], [0.5195], [0.4922], [0.5078], [0.4746], [0.2373], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [1.0000], [0.5547], [0.5547], [0.8008], [0.5547], [0.2002], [0.3750], [0.2500], [0.3340], [0.3750], [0.4004], [0.5000], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001922607421875 loss: 0.00112152099609375 loss: 0.0023193359375 loss: 0.00144195556640625 76%|███████▌ | 374/492 [3:23:07<1:02:28, 31.77s/it] {'loss': 0.0064, 'learning_rate': 1e-05, 'epoch': 0.76} 76%|███████▌ | 374/492 [3:23:07<1:02:28, 31.77s/it]predicted value: tensor([[0.7305], [0.4414], [0.7188], [0.4551], [0.4355], [0.4414], [0.4062], [0.5352], [0.4922], [0.4258], [0.6953], [0.4258], [0.3633], [0.3379], [0.3887], [0.1875]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3340], [0.8008], [0.4668], [0.4668], [0.3750], [0.4668], [0.7500], [0.5000], [0.4004], [0.6680], [0.4004], [0.4004], [0.4004], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00141143798828125 loss: 0.0004749298095703125 loss: 0.00153350830078125 loss: 0.00083160400390625 predicted value: tensor([[0.4688], [0.2207], [0.7344], [0.7539], [0.3301], [0.6953], [0.9883], [0.7305], [0.4336], [0.6758], [0.2617], [0.9766], [0.3516], [0.3633], [0.1963], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.5547], [0.4668], [0.3340], [0.6680], [1.0000], [0.8008], [0.4668], [0.8008], [0.3340], [1.0000], [0.3340], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002227783203125 loss: 0.0023193359375 loss: 0.00070953369140625 loss: 0.0011749267578125 predicted value: tensor([[0.4121], [0.9961], [0.5508], [0.9531], [0.4668], [0.4160], [0.9961], [0.5859], [0.9844], [0.4141], [0.3281], [0.5859], [0.3398], [0.4160], [0.1641], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.5547], [1.0000], [0.4668], [0.3750], [1.0000], [0.5000], [1.0000], [0.4004], [0.0625], [0.6016], [0.3340], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013427734375 loss: 0.00146484375 loss: 0.0011444091796875 loss: 0.000415802001953125 predicted value: tensor([[0.7617], [0.9961], [0.5859], [0.5938], [0.2930], [0.2734], [0.9844], [0.7070], [0.5664], [0.4512], [0.6367], [0.5938], [0.3652], [0.2080], [0.2002], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.5547], [0.6016], [0.2500], [0.2500], [1.0000], [0.8008], [0.6016], [0.3145], [0.7500], [0.6680], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00089263916015625 loss: 0.000904083251953125 loss: 0.000888824462890625 loss: 0.001007080078125 76%|███████▌ | 375/492 [3:23:38<1:01:48, 31.69s/it] {'loss': 0.0047, 'learning_rate': 1e-05, 'epoch': 0.76} 76%|███████▌ | 375/492 [3:23:38<1:01:48, 31.69s/it]predicted value: tensor([[0.4766], [0.9961], [0.5469], [0.4766], [0.7695], [0.4238], [0.4160], [0.2295], [0.4277], [0.6055], [0.3887], [0.4023], [0.4121], [0.2168], [0.2002], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.5547], [0.4668], [0.8008], [0.6016], [0.4668], [0.2500], [0.4668], [0.7500], [0.5000], [0.4004], [0.4004], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00034332275390625 loss: 0.0011444091796875loss: 0.000179290771484375 loss: 0.00127410888671875 predicted value: tensor([[0.9766], [0.4629], [0.2178], [0.6055], [0.6172], [0.9922], [0.7695], [0.7422], [0.4238], [0.3496], [0.9609], [0.2080], [0.3965], [0.5820], [0.1582], [0.2139]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.2500], [0.6016], [0.6016], [1.0000], [0.8008], [0.8008], [0.3750], [0.2852], [1.0000], [0.1001], [0.4004], [0.6016], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000492095947265625 loss: 0.000461578369140625 loss: 0.000865936279296875 loss: 0.00038909912109375 predicted value: tensor([[0.4785], [0.4297], [0.7461], [0.5586], [0.4062], [0.9883], [0.9883], [0.5273], [0.5078], [0.5820], [1.0000], [0.3379], [0.2295], [0.2305], [0.2754], [0.1621]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.8008], [0.6016], [0.3750], [1.0000], [1.0000], [0.5000], [0.5000], [0.6016], [1.0000], [0.2002], [0.2500], [0.4004], [0.2852], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00176239013671875 loss: 0.00106048583984375loss: 0.001861572265625 loss: 0.0003414154052734375 predicted value: tensor([[0.5781], [0.8164], [0.7227], [0.4609], [0.7266], [0.9688], [0.7070], [0.3574], [0.2676], [0.2578], [0.2715], [0.9688], [0.4434], [0.4004], [0.1895], [0.2041]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.8320], [0.4668], [0.8008], [1.0000], [0.8008], [0.3340], [0.2002], [0.2500], [0.2002], [1.0000], [0.5000], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001617431640625 loss: 0.0024261474609375 loss: 0.00075531005859375 loss: 0.00077056884765625 76%|███████▋ | 376/492 [3:24:10<1:01:16, 31.69s/it] {'loss': 0.0039, 'learning_rate': 1e-05, 'epoch': 0.76} 76%|███████▋ | 376/492 [3:24:10<1:01:16, 31.69s/it]predicted value: tensor([[0.6641], [1.0625], [0.7305], [1.0703], [0.3535], [0.1572], [0.3848], [0.5742], [1.0703], [0.4219], [0.4297], [0.4707], [0.4492], [0.4512], [0.2471], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [1.0000], [0.7500], [1.0000], [0.2500], [0.0625], [0.3340], [0.3750], [1.0000], [0.3340], [0.4004], [0.5000], [0.4004], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00194549560546875 loss: 0.0018463134765625 loss: 0.00145721435546875 loss: 0.00133514404296875 predicted value: tensor([[1.0938], [0.5117], [0.5117], [0.5312], [0.8203], [0.3027], [0.3848], [0.7852], [0.3066], [0.1709], [0.1914], [0.6445], [0.2910], [0.7656], [0.4492], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.4668], [0.8008], [0.2500], [0.3340], [0.6680], [0.2002], [0.0400], [0.0400], [0.6016], [0.2500], [0.7500], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020751953125 loss: 0.00164031982421875 loss: 0.003082275390625 loss: 0.001312255859375 predicted value: tensor([[0.6055], [0.5234], [0.8867], [0.8516], [0.3750], [0.5234], [0.7969], [0.3203], [0.6250], [0.6914], [0.6992], [0.3398], [0.4668], [0.4902], [0.2393], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.8320], [0.8320], [0.3340], [0.4668], [0.8008], [0.2002], [0.6016], [0.7500], [0.6016], [0.3340], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010986328125 loss: 0.00107574462890625loss: 0.00127410888671875 loss: 0.00115203857421875 predicted value: tensor([[0.6445], [0.5234], [0.5078], [0.5273], [0.4473], [0.7227], [1.0781], [0.4883], [1.0469], [0.7109], [0.6445], [0.4629], [0.2344], [0.4180], [0.2275], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.3750], [0.4668], [0.3750], [0.8008], [1.0000], [0.4668], [1.0000], [0.6016], [0.6016], [0.4004], [0.1426], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030364990234375 loss: 0.00124359130859375 loss: 0.00145721435546875 loss: 0.0023345947265625 77%|███████▋ | 377/492 [3:24:42<1:01:14, 31.95s/it] {'loss': 0.0068, 'learning_rate': 1e-05, 'epoch': 0.77} 77%|███████▋ | 377/492 [3:24:42<1:01:14, 31.95s/it]predicted value: tensor([[0.6836], [0.5898], [0.5000], [0.8594], [0.3398], [0.8555], [0.7031], [0.7539], [0.5977], [0.4902], [0.4980], [0.3633], [0.4707], [0.2344], [0.2334], [0.2422]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.4004], [0.4668], [0.8008], [0.3340], [0.8008], [0.6016], [0.6016], [0.5703], [0.3340], [0.5000], [0.2852], [0.4004], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115966796875 loss: 0.00180816650390625loss: 0.001983642578125 loss: 0.000965118408203125 predicted value: tensor([[0.6797], [1.0625], [0.3301], [0.7969], [0.8359], [0.4805], [0.4375], [0.6406], [0.4824], [0.6172], [0.4727], [0.2988], [0.3164], [0.4453], [0.2197], [0.2266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3340], [0.6680], [0.8008], [0.4668], [0.1387], [0.6016], [0.3750], [0.5547], [0.4004], [0.2500], [0.2500], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000926971435546875 loss: 0.002777099609375 loss: 0.00250244140625loss: 0.0021820068359375 predicted value: tensor([[0.4766], [1.0703], [0.3691], [0.4902], [0.8008], [0.7500], [0.6445], [0.6523], [0.6172], [0.5273], [1.0781], [1.0547], [0.2734], [0.2471], [0.2314], [0.2207]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.2500], [0.4668], [0.8008], [0.8008], [0.5547], [0.3750], [0.6016], [0.3750], [1.0000], [1.0000], [0.2002], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002960205078125 loss: 0.0023651123046875loss: 0.002227783203125 loss: 0.000904083251953125 predicted value: tensor([[0.4902], [0.5859], [0.4883], [0.5078], [1.0859], [1.0547], [0.5547], [0.3223], [1.0547], [0.7031], [0.3320], [0.1797], [0.3086], [0.2363], [0.2256], [0.2021]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.6680], [0.4668], [0.4668], [1.0000], [1.0000], [0.5000], [0.2500], [1.0000], [0.4668], [0.3340], [0.4004], [0.2500], [0.1670], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000640869140625 loss: 0.001617431640625 loss: 0.00179290771484375 loss: 0.002471923828125 77%|███████▋ | 378/492 [3:25:14<1:00:45, 31.98s/it] {'loss': 0.0073, 'learning_rate': 1e-05, 'epoch': 0.77} 77%|███████▋ | 378/492 [3:25:14<1:00:45, 31.98s/it]predicted value: tensor([[0.5625], [0.8047], [1.0000], [0.2285], [0.6758], [0.1953], [0.7773], [0.2754], [1.0000], [0.9883], [0.5664], [0.4648], [0.4648], [0.3574], [0.1787], [0.1523]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.7148], [1.0000], [0.3340], [0.5703], [0.2500], [0.8008], [0.3340], [1.0000], [1.0000], [0.6016], [0.5000], [0.7500], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030059814453125 loss: 0.0021514892578125loss: 0.000370025634765625 loss: 0.004608154296875 predicted value: tensor([[0.7734], [1.0078], [0.4082], [0.8086], [0.6406], [0.2041], [1.0000], [0.3574], [0.4043], [0.6719], [0.6133], [0.2812], [0.4062], [0.1216], [0.1533], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.3750], [0.8320], [0.6016], [0.2002], [1.0000], [0.3750], [0.4668], [0.6680], [0.6016], [0.2500], [0.4004], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001251220703125 loss: 0.000225067138671875loss: 0.000591278076171875 loss: 0.0006561279296875 predicted value: tensor([[0.4414], [0.3652], [1.0078], [0.4980], [0.7891], [0.4219], [0.2871], [1.0000], [1.0078], [0.6953], [0.3926], [0.0962], [0.2480], [0.1553], [0.1836], [0.1494]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [1.0000], [0.4668], [0.8320], [0.3750], [0.3340], [1.0000], [1.0000], [0.8008], [0.3340], [0.0400], [0.5000], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00058746337890625 loss: 0.00156402587890625 loss: 0.0003795623779296875 loss: 0.00037384033203125 predicted value: tensor([[0.4258], [0.2197], [0.7148], [0.6133], [1.0156], [0.5898], [0.2793], [0.6484], [0.1592], [0.7070], [0.5352], [0.3145], [0.4043], [0.1631], [0.1533], [0.1523]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.7500], [0.6680], [1.0000], [0.8320], [0.3340], [0.7500], [0.2002], [0.6016], [0.7500], [0.2500], [0.3340], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002593994140625 loss: 0.0008087158203125 loss: 0.000766754150390625 loss: 0.0009918212890625 77%|███████▋ | 379/492 [3:25:46<59:51, 31.79s/it] {'loss': 0.0052, 'learning_rate': 1e-05, 'epoch': 0.77} 77%|███████▋ | 379/492 [3:25:46<59:51, 31.79s/it]predicted value: tensor([[0.9805], [0.4258], [0.9844], [0.9961], [0.9844], [0.5742], [0.2637], [0.6133], [0.6875], [0.4180], [0.5391], [0.3984], [0.3984], [0.1904], [0.1992], [0.2061]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [1.0000], [1.0000], [0.7500], [0.3340], [0.6016], [0.7500], [0.5000], [0.6680], [0.4004], [0.5000], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000286102294921875 loss: 0.000732421875loss: 0.00124359130859375 loss: 0.002044677734375 predicted value: tensor([[0.7891], [1.0078], [1.0078], [0.5625], [1.0156], [0.5469], [1.0078], [0.2422], [0.2051], [0.5859], [0.7656], [0.4473], [0.4062], [0.4199], [0.3945], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [1.0000], [1.0000], [0.5547], [1.0000], [0.7148], [1.0000], [0.2500], [0.2500], [0.6016], [0.8008], [0.4668], [0.4004], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000667572021484375 loss: 0.000701904296875 loss: 0.00067901611328125 loss: 0.0002574920654296875 predicted value: tensor([[0.7969], [0.4297], [0.7812], [0.6719], [0.4160], [0.7930], [0.8438], [0.9805], [0.7500], [0.5820], [0.3418], [0.4199], [0.6836], [0.4531], [0.1641], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.8008], [0.4668], [0.4668], [0.8008], [0.8320], [1.0000], [0.6680], [0.6016], [0.4004], [0.4004], [0.7500], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000507354736328125 loss: 0.00113677978515625loss: 0.000850677490234375 loss: 0.000537872314453125 predicted value: tensor([[0.5898], [0.4199], [0.7578], [0.2109], [0.2246], [0.7812], [0.2158], [0.4062], [0.0645], [0.6328], [0.9805], [0.6914], [0.3965], [0.3691], [0.1621], [0.1562]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.4668], [0.8008], [0.2500], [0.3340], [0.8008], [0.2500], [0.4668], [0.0400], [0.7500], [1.0000], [0.8008], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00170135498046875 loss: 0.001861572265625 loss: 0.0026092529296875 loss: 0.0010223388671875 77%|███████▋ | 380/492 [3:26:18<59:31, 31.89s/it] {'loss': 0.0042, 'learning_rate': 1e-05, 'epoch': 0.77} 77%|███████▋ | 380/492 [3:26:18<59:31, 31.89s/it]predicted value: tensor([[0.5625], [1.0547], [0.7812], [0.4512], [1.0312], [0.7656], [0.7539], [1.0312], [1.0234], [0.7812], [0.7656], [0.6719], [0.5234], [0.4883], [0.2520], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8555], [0.3750], [1.0000], [0.7500], [0.6680], [1.0000], [1.0000], [0.7500], [0.7500], [0.6016], [0.4004], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00194549560546875 loss: 0.000949859619140625 loss: 0.002044677734375 loss: 0.0014495849609375 predicted value: tensor([[0.4395], [0.4766], [0.5117], [1.0234], [0.6484], [0.7695], [0.6211], [0.7578], [0.3516], [0.3027], [0.4727], [0.4453], [0.6406], [0.2754], [0.2988], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.4668], [1.0000], [0.5000], [0.7500], [0.3750], [0.6016], [0.2500], [0.2002], [0.2500], [0.4004], [0.6016], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016021728515625 loss: 0.003021240234375 loss: 0.0057373046875 loss: 0.0024871826171875 predicted value: tensor([[0.4785], [0.5078], [0.8867], [0.5039], [0.7266], [0.5547], [1.0625], [0.8086], [0.7227], [0.4609], [0.3457], [0.4492], [0.4199], [0.2578], [0.2324], [0.2324]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8320], [0.4668], [0.6680], [0.3750], [1.0000], [0.7500], [0.6680], [0.3145], [0.3340], [0.3340], [0.4004], [0.2500], [0.1426], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004302978515625 loss: 0.00138092041015625 loss: 0.00156402587890625 loss: 0.00183868408203125 predicted value: tensor([[0.8438], [0.7930], [0.7656], [0.7070], [0.3262], [0.5820], [0.6133], [1.0391], [1.0312], [1.0312], [0.8594], [0.7383], [0.4883], [0.4727], [0.2197], [0.2441]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.8555], [0.6680], [0.8320], [0.2500], [0.5000], [0.5547], [1.0000], [1.0000], [1.0000], [0.8008], [0.7500], [0.5000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00099945068359375 loss: 0.001129150390625 loss: 0.00170135498046875 loss: 0.0036773681640625 77%|███████▋ | 381/492 [3:26:50<58:53, 31.83s/it] {'loss': 0.009, 'learning_rate': 1e-05, 'epoch': 0.77} 77%|███████▋ | 381/492 [3:26:50<58:53, 31.83s/it]predicted value: tensor([[0.5898], [0.8320], [0.8242], [0.4707], [0.8047], [1.0000], [0.3926], [0.3516], [0.3301], [0.8281], [0.5508], [0.4082], [0.4785], [0.4824], [0.2422], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5703], [0.8008], [0.3750], [0.6680], [1.0000], [0.2500], [0.3340], [0.3340], [0.8008], [0.5000], [0.5000], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00103759765625 loss: 0.00185394287109375 loss: 0.0023040771484375 loss: 0.00189208984375 predicted value: tensor([[0.6328], [0.8516], [0.4922], [0.8047], [0.7773], [0.7383], [0.3027], [0.7695], [0.9961], [0.6836], [0.7695], [0.3789], [0.5039], [0.4043], [0.2695], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.8320], [0.4668], [0.7500], [0.8008], [0.3145], [0.3340], [0.7500], [1.0000], [0.6016], [0.6016], [0.2500], [0.4004], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020904541015625 loss: 0.0015106201171875 loss: 0.003997802734375 loss: 0.00112152099609375 predicted value: tensor([[0.5039], [0.7617], [0.5273], [0.4980], [0.4707], [0.7695], [0.3477], [1.0156], [1.0156], [1.0078], [0.7500], [0.4551], [0.5234], [0.7734], [0.2930], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.6680], [0.4668], [0.4668], [0.2500], [0.8008], [0.3340], [1.0000], [1.0000], [1.0000], [0.6016], [0.2002], [0.4004], [0.7500], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001556396484375 loss: 0.0029754638671875loss: 0.00148773193359375 loss: 0.00164794921875 predicted value: tensor([[0.3574], [0.4863], [0.5664], [0.8359], [0.7148], [0.3848], [1.0312], [0.8359], [0.6250], [0.5586], [0.5234], [0.6445], [0.4316], [0.5000], [0.4277], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.3750], [0.4668], [0.8008], [0.7500], [0.3340], [1.0000], [0.8008], [0.6016], [0.5547], [0.4668], [0.5000], [0.4004], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000942230224609375 loss: 0.000705718994140625 loss: 0.0012969970703125 loss: 0.00109100341796875 78%|███████▊ | 382/492 [3:27:22<58:28, 31.89s/it] {'loss': 0.0069, 'learning_rate': 1e-05, 'epoch': 0.78} 78%|███████▊ | 382/492 [3:27:22<58:28, 31.89s/it]predicted value: tensor([[0.4219], [0.9609], [0.2812], [0.7148], [0.2119], [0.4199], [0.3086], [0.9414], [0.2656], [0.9375], [0.3711], [0.4473], [0.4102], [0.0952], [0.1904], [0.1748]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.2500], [0.7500], [0.2002], [0.4668], [0.3340], [1.0000], [0.2500], [1.0000], [0.4004], [0.4668], [0.5000], [0.0625], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00096893310546875 loss: 0.0030670166015625 loss: 0.00041961669921875 loss: 0.000736236572265625 predicted value: tensor([[0.4570], [0.9570], [0.2109], [0.6016], [0.5703], [0.7578], [0.7109], [0.7617], [0.6328], [0.4570], [0.4316], [0.6836], [0.4297], [0.2188], [0.1846], [0.2002]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.2002], [0.4668], [0.6016], [0.8008], [0.6680], [0.8008], [0.6016], [0.4668], [0.4004], [0.5703], [0.4004], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0007171630859375loss: 0.00131988525390625 loss: 0.0016021728515625 loss: 0.00106048583984375 predicted value: tensor([[0.4180], [0.4102], [0.5547], [0.6875], [0.4121], [0.4590], [0.2402], [0.2559], [0.5742], [0.7031], [0.5078], [0.6289], [0.3359], [0.1914], [0.1943], [0.1826]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [0.6016], [0.3750], [0.7500], [0.2002], [0.2002], [0.6016], [0.6680], [0.5000], [0.6016], [0.3340], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00122833251953125 loss: 0.00171661376953125loss: 0.00157928466796875 loss: 0.00144195556640625 predicted value: tensor([[0.9336], [0.4902], [0.4141], [0.4941], [0.9297], [0.2910], [0.6367], [0.4961], [0.3750], [0.9453], [0.9297], [0.9375], [0.4238], [0.4590], [0.1768], [0.1904]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.5547], [1.0000], [0.2500], [0.6016], [0.4004], [0.4004], [1.0000], [1.0000], [1.0000], [0.4004], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00103759765625 loss: 0.00157928466796875 loss: 0.000675201416015625loss: 0.00147247314453125 78%|███████▊ | 383/492 [3:27:53<57:42, 31.77s/it] {'loss': 0.0052, 'learning_rate': 1e-05, 'epoch': 0.78} 78%|███████▊ | 383/492 [3:27:53<57:42, 31.77s/it]predicted value: tensor([[0.5586], [0.9688], [0.7656], [0.7188], [0.2578], [0.6641], [0.6016], [0.9570], [0.2393], [0.6719], [0.5352], [0.4121], [0.3770], [0.1992], [0.3105], [0.2129]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [1.0000], [0.8320], [0.8008], [0.3340], [0.4668], [0.6016], [1.0000], [0.2500], [0.6016], [0.4668], [0.4004], [0.4004], [0.2500], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0031585693359375 loss: 0.0027923583984375 loss: 0.00173187255859375 loss: 0.004608154296875 predicted value: tensor([[0.5391], [0.4434], [0.9492], [0.4473], [0.6094], [0.9609], [0.2578], [0.6523], [0.5938], [0.4199], [0.6992], [0.4141], [0.3945], [0.3418], [0.4180], [0.2012]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [1.0000], [0.4668], [0.6016], [1.0000], [0.2500], [0.6016], [0.6016], [0.4668], [0.7500], [0.4004], [0.3340], [0.3340], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00119781494140625 loss: 0.002166748046875 loss: 0.000408172607421875 loss: 0.00191497802734375 predicted value: tensor([[0.9609], [0.2275], [0.3691], [0.4688], [0.8008], [0.7617], [0.9414], [0.9609], [0.6641], [0.5820], [0.1406], [0.3633], [0.3965], [0.3516], [0.2080], [0.1875]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.2500], [0.4668], [0.8320], [0.8008], [1.0000], [1.0000], [0.6016], [0.6016], [0.0278], [0.4004], [0.4004], [0.4004], [0.1426], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0004520416259765625 loss: 0.002044677734375 loss: 0.000766754150390625 loss: 0.0010223388671875 predicted value: tensor([[0.4062], [0.8242], [0.9883], [0.5312], [0.3750], [0.1670], [0.7344], [0.6406], [0.3535], [0.5195], [0.6406], [0.4180], [0.2012], [0.4102], [0.4238], [0.1729]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [1.0000], [0.5000], [0.3750], [0.2002], [0.6680], [0.6016], [0.0625], [0.6016], [0.7500], [0.4004], [0.2500], [0.0400], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000823974609375 loss: 0.000865936279296875 loss: 0.0009307861328125 loss: 0.0040283203125 78%|███████▊ | 384/492 [3:28:25<57:09, 31.75s/it] {'loss': 0.0072, 'learning_rate': 1e-05, 'epoch': 0.78} 78%|███████▊ | 384/492 [3:28:25<57:09, 31.75s/it]predicted value: tensor([[0.5273], [0.7148], [0.6250], [0.3828], [0.2266], [0.1641], [0.3672], [0.5117], [0.5898], [0.3438], [0.9805], [0.6133], [0.3223], [0.1914], [0.1953], [0.2090]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.5547], [0.4668], [0.2002], [0.3340], [0.4668], [0.6016], [0.6016], [0.5000], [1.0000], [0.7500], [0.0625], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000667572021484375 loss: 0.000583648681640625 loss: 0.0028533935546875 loss: 0.0008544921875 predicted value: tensor([[0.3770], [1.0391], [0.6602], [0.4199], [0.4258], [0.3359], [0.9961], [0.7266], [0.3340], [0.5234], [0.7344], [0.4004], [0.3145], [0.3965], [0.1621], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.8008], [0.4668], [0.3750], [0.4668], [1.0000], [0.8008], [0.2500], [0.6016], [0.8008], [0.4004], [0.3340], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000946044921875 loss: 0.0011749267578125 loss: 0.00106048583984375 loss: 0.00095367431640625 predicted value: tensor([[1.0156], [0.6367], [1.0312], [0.7422], [0.4062], [0.5156], [0.2969], [0.7500], [0.3340], [0.6484], [0.1982], [0.4902], [0.5078], [0.1572], [0.1992], [0.2129]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [1.0000], [0.8008], [0.4668], [0.4277], [0.2500], [0.8008], [0.4004], [0.7500], [0.2500], [0.6016], [0.6016], [0.0625], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0003223419189453125 loss: 0.001129150390625loss: 0.0009307861328125 loss: 0.001434326171875 predicted value: tensor([[0.6875], [0.6211], [0.4160], [0.4043], [0.1592], [0.4199], [1.0078], [0.2852], [0.4336], [0.7070], [0.6445], [0.2656], [0.4395], [0.1738], [0.1924], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [0.4668], [0.4668], [0.2500], [0.4668], [1.0000], [0.2500], [0.5000], [0.8008], [0.7500], [0.3340], [0.5000], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0014801025390625 loss: 0.00177001953125 loss: 0.0018768310546875 loss: 0.000911712646484375 78%|███████▊ | 385/492 [3:28:57<56:41, 31.79s/it] {'loss': 0.0047, 'learning_rate': 1e-05, 'epoch': 0.78} 78%|███████▊ | 385/492 [3:28:57<56:41, 31.79s/it]predicted value: tensor([[0.5586], [0.2500], [0.4512], [0.7812], [0.7578], [1.1016], [0.7305], [0.3574], [0.0674], [0.6641], [0.4883], [0.3906], [0.5547], [0.4121], [0.2021], [0.2285]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.4668], [0.8008], [0.8008], [1.0000], [0.8320], [0.5000], [0.0278], [0.8008], [0.6016], [0.4004], [0.6016], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012664794921875 loss: 0.0015411376953125 loss: 0.00124359130859375 loss: 0.00176239013671875 predicted value: tensor([[0.5781], [0.8008], [0.7188], [0.7578], [0.4473], [0.3164], [0.2539], [0.6836], [0.7617], [0.6172], [0.2520], [0.5859], [0.3594], [0.2158], [0.2490], [0.2207]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.8008], [0.8008], [0.4668], [0.3340], [0.2500], [0.7500], [0.7500], [0.7500], [0.2002], [0.6016], [0.4004], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000896453857421875 loss: 0.00063323974609375loss: 0.001190185546875 loss: 0.0008697509765625 predicted value: tensor([[0.2793], [1.0938], [0.8398], [0.5586], [0.4668], [0.5859], [0.4668], [0.6797], [1.1016], [0.5547], [0.6641], [0.3066], [0.3594], [0.3613], [0.2412], [0.2451]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [1.0000], [0.8320], [0.5547], [0.4668], [0.6016], [0.4668], [0.7500], [1.0000], [0.5000], [0.7500], [0.3340], [0.4004], [0.4004], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018463134765625 loss: 0.000713348388671875loss: 0.00104522705078125 loss: 0.0028076171875 predicted value: tensor([[0.3965], [0.8125], [1.0703], [0.4961], [0.5742], [0.4688], [0.6172], [0.4863], [1.0703], [0.2812], [0.5703], [1.0469], [0.3926], [0.4355], [0.2334], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2715], [0.8320], [1.0000], [0.5547], [0.6016], [0.4668], [0.6680], [0.8008], [1.0000], [0.3340], [0.6016], [1.0000], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000598907470703125 loss: 0.002227783203125 loss: 0.0017852783203125 loss: 0.0031280517578125 78%|███████▊ | 386/492 [3:29:29<56:12, 31.81s/it] {'loss': 0.0059, 'learning_rate': 1e-05, 'epoch': 0.78} 78%|███████▊ | 386/492 [3:29:29<56:12, 31.81s/it]predicted value: tensor([[0.6250], [0.4941], [0.7539], [0.6055], [0.2871], [0.4961], [0.8047], [0.5312], [1.0469], [0.3379], [0.7539], [0.4023], [0.4238], [0.2695], [0.4590], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.5703], [0.6680], [0.2002], [0.4668], [0.8008], [0.4668], [1.0000], [0.2500], [0.7500], [0.4004], [0.4004], [0.1670], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00124359130859375 loss: 0.001129150390625 loss: 0.00164794921875 loss: 0.0015411376953125 predicted value: tensor([[0.6016], [1.0703], [0.4746], [0.5586], [0.7969], [1.0625], [0.6523], [0.6445], [1.0703], [0.4609], [0.7656], [0.6406], [0.3242], [0.6172], [0.4336], [0.3066]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.4668], [0.8008], [1.0000], [0.6016], [0.4668], [1.0000], [0.5000], [0.8008], [0.6016], [0.2002], [0.7500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001434326171875 loss: 0.0016937255859375loss: 0.0019378662109375 loss: 0.002227783203125 predicted value: tensor([[0.5898], [0.7578], [0.6211], [0.7969], [0.5273], [0.6250], [0.6680], [0.5977], [0.3047], [0.3789], [0.7656], [0.3438], [0.4375], [0.2500], [0.2520], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.5547], [0.8008], [0.4668], [0.5547], [0.6016], [0.5000], [0.2002], [0.3340], [0.8008], [0.2500], [0.4004], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00099945068359375 loss: 0.00299072265625 loss: 0.0009918212890625 loss: 0.0019378662109375 predicted value: tensor([[0.8320], [0.5195], [0.7930], [0.5195], [0.6133], [0.4863], [0.5039], [0.3145], [1.0547], [0.3516], [0.3457], [0.4453], [0.1465], [0.2715], [0.2490], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.8008], [0.4668], [0.5547], [0.3750], [0.4668], [0.2500], [1.0000], [0.3340], [0.3340], [0.4004], [0.0278], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00112152099609375 loss: 0.0015716552734375 loss: 0.00152587890625 loss: 0.00093841552734375 79%|███████▊ | 387/492 [3:30:00<55:40, 31.82s/it] {'loss': 0.0062, 'learning_rate': 1e-05, 'epoch': 0.79} 79%|███████▊ | 387/492 [3:30:00<55:40, 31.82s/it]predicted value: tensor([[1.0078], [0.3594], [0.9961], [0.6289], [0.2578], [0.2812], [0.9492], [0.2930], [0.5703], [0.6992], [0.1152], [0.6875], [0.6719], [0.2578], [0.2598], [0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [1.0000], [0.6016], [0.2002], [0.2500], [1.0000], [0.2500], [0.7500], [0.7500], [0.1113], [0.7500], [0.6016], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002777099609375 loss: 0.000911712646484375 loss: 0.0006256103515625 loss: 0.00142669677734375 predicted value: tensor([[0.9766], [0.4824], [0.9531], [0.2637], [0.5000], [0.7422], [0.6367], [0.6992], [0.5430], [0.4551], [0.4609], [0.1719], [0.5547], [0.4160], [0.2559], [0.2598]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.2500], [0.4668], [0.8008], [0.6016], [0.5547], [0.4277], [0.3340], [0.5000], [0.5000], [0.6016], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010223388671875 loss: 0.000873565673828125 loss: 0.0027008056640625 loss: 0.001678466796875 predicted value: tensor([[0.9727], [0.4805], [0.8320], [1.0078], [0.9805], [0.3359], [0.6602], [0.3145], [0.4590], [0.4043], [0.4180], [0.1348], [0.2656], [0.2158], [0.2168], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.8320], [1.0000], [1.0000], [0.2500], [0.6680], [0.2500], [0.3145], [0.4004], [0.4004], [0.0625], [0.2002], [0.1670], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00168609619140625 loss: 0.002685546875 loss: 0.00093841552734375 loss: 0.000751495361328125 predicted value: tensor([[0.2988], [0.2949], [0.9609], [0.7500], [0.9961], [0.6250], [0.9805], [0.5469], [0.5898], [0.5898], [0.6328], [0.6172], [0.2539], [0.2334], [0.2324], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.2500], [1.0000], [0.8008], [1.0000], [0.8008], [1.0000], [0.3750], [0.7500], [0.6016], [0.7500], [0.6016], [0.2500], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019683837890625 loss: 0.00090789794921875loss: 0.000946044921875 loss: 0.0035552978515625 79%|███████▉ | 388/492 [3:30:33<55:22, 31.95s/it] {'loss': 0.0064, 'learning_rate': 1e-05, 'epoch': 0.79} 79%|███████▉ | 388/492 [3:30:33<55:22, 31.95s/it]predicted value: tensor([[0.3770], [0.2637], [0.1553], [0.1895], [0.6484], [0.6367], [0.4258], [0.2500], [0.6016], [0.5898], [0.5352], [0.3301], [0.2969], [0.4902], [0.3418], [0.1030]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3340], [0.2002], [0.2500], [0.8008], [0.8008], [0.4668], [0.3340], [0.7500], [0.6016], [0.6016], [0.4004], [0.4004], [0.6016], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00445556640625 loss: 0.00408935546875 loss: 0.0021209716796875 loss: 0.002838134765625 predicted value: tensor([[0.3750], [0.3555], [0.6055], [0.6016], [0.2109], [0.3691], [0.5781], [0.4492], [0.5781], [0.5547], [0.5430], [0.1357], [0.2285], [0.3262], [0.1436], [0.1260]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.8008], [0.8008], [0.2500], [0.4668], [0.7500], [0.4668], [0.5000], [0.6016], [0.6016], [0.4004], [0.2500], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025787353515625 loss: 0.0035552978515625 loss: 0.00640869140625 loss: 0.005279541015625 predicted value: tensor([[0.8242], [0.8672], [0.3535], [0.2266], [0.4258], [0.4121], [0.6289], [0.2676], [0.5938], [0.6445], [0.2451], [0.4414], [0.3906], [0.6094], [0.1719], [0.1592]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.2500], [0.4668], [0.4668], [0.7500], [0.2500], [0.6016], [0.8008], [0.2500], [0.2002], [0.3750], [0.7500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002105712890625 loss: 0.0062255859375 loss: 0.0030059814453125 loss: 0.0031585693359375 predicted value: tensor([[0.3750], [0.3887], [0.5977], [0.2441], [0.2080], [0.1904], [0.5117], [0.0618], [0.3984], [0.3379], [0.8320], [0.3672], [0.4043], [0.3516], [0.3730], [0.3145]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [0.7500], [0.3340], [0.2002], [0.1670], [0.7500], [0.0400], [0.5000], [0.4004], [1.0000], [0.4004], [0.5000], [0.5000], [0.4004], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033721923828125 loss: 0.0025787353515625 loss: 0.0026702880859375 loss: 0.0028533935546875 79%|███████▉ | 389/492 [3:31:04<54:43, 31.88s/it] {'loss': 0.0143, 'learning_rate': 1e-05, 'epoch': 0.79} 79%|███████▉ | 389/492 [3:31:04<54:43, 31.88s/it]predicted value: tensor([[0.4883], [0.4121], [0.8203], [0.3516], [0.3770], [0.3379], [0.4648], [0.2100], [0.5742], [0.3770], [0.8125], [0.4941], [0.4961], [0.3066], [0.1235], [0.1099]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.4668], [1.0000], [0.3750], [0.4668], [0.4668], [0.6016], [0.2500], [0.7500], [0.8008], [1.0000], [0.6016], [0.7500], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029144287109375 loss: 0.00714111328125loss: 0.002899169921875 loss: 0.003204345703125 predicted value: tensor([[0.3965], [0.5273], [0.4004], [0.3340], [0.1680], [0.8281], [0.8281], [0.3535], [0.1523], [0.3945], [0.3789], [0.6719], [0.3301], [0.3125], [0.1045], [0.1377]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6172], [0.4668], [0.3145], [0.3340], [1.0000], [1.0000], [0.4668], [0.2002], [0.4668], [0.4668], [0.6680], [0.5000], [0.3340], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00323486328125 loss: 0.0030975341796875loss: 0.0045166015625 loss: 0.00439453125 predicted value: tensor([[0.3262], [0.8086], [0.5078], [0.8164], [0.3809], [0.2490], [0.6055], [0.3652], [0.5039], [0.4863], [0.6172], [0.3906], [0.3672], [0.1348], [0.1514], [0.1523]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.5547], [1.0000], [0.4668], [0.3340], [0.6016], [0.4668], [0.7500], [0.6016], [0.7500], [0.6016], [0.5000], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025482177734375 loss: 0.004058837890625 loss: 0.006683349609375 loss: 0.003814697265625 predicted value: tensor([[0.8516], [0.4434], [0.8398], [0.2197], [0.6445], [0.6328], [0.4102], [0.6133], [0.5352], [0.4531], [0.4824], [0.2021], [0.3809], [0.3457], [0.1309], [0.0908]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [1.0000], [0.2500], [0.8320], [0.8008], [0.8008], [0.6680], [0.7500], [0.6016], [0.5000], [0.4004], [0.4004], [0.5000], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0031280517578125 loss: 0.00634765625 loss: 0.0037689208984375 loss: 0.006561279296875 79%|███████▉ | 390/492 [3:31:37<54:25, 32.02s/it] {'loss': 0.0171, 'learning_rate': 1e-05, 'epoch': 0.79} 79%|███████▉ | 390/492 [3:31:37<54:25, 32.02s/it]predicted value: tensor([[0.8828], [0.6680], [0.4531], [0.2002], [0.2910], [0.6094], [0.2598], [0.6758], [0.6523], [0.2930], [0.1709], [0.3359], [0.4922], [0.1631], [0.1650], [0.1377]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.4668], [0.2500], [0.2500], [0.6016], [0.3340], [0.8320], [0.6680], [0.2500], [0.4004], [0.3340], [0.5000], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00064849853515625 loss: 0.00133514404296875 loss: 0.0022735595703125 loss: 0.00109100341796875 predicted value: tensor([[0.5234], [0.4395], [0.4297], [0.2012], [0.7461], [0.8789], [0.7148], [0.2578], [0.4434], [0.2021], [0.3555], [0.6289], [0.3750], [0.1660], [0.1426], [0.1465]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.2500], [0.8008], [1.0000], [0.6680], [0.2500], [0.4668], [0.3340], [0.4004], [0.6016], [0.4004], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001251220703125 loss: 0.0052490234375 loss: 0.00098419189453125 loss: 0.0017852783203125 predicted value: tensor([[0.8672], [0.7891], [0.4980], [0.8555], [0.2393], [0.6367], [0.3848], [0.2930], [0.2256], [0.4082], [0.8008], [0.3984], [0.2871], [0.1982], [0.1572], [0.1436]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.5547], [1.0000], [0.2500], [0.8008], [0.4668], [0.2500], [0.2002], [0.5000], [1.0000], [0.4004], [0.3340], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0042724609375 loss: 0.0020904541015625loss: 0.0017547607421875 loss: 0.002685546875 predicted value: tensor([[0.8711], [0.8594], [0.8516], [0.7266], [0.6406], [0.8789], [0.6523], [0.5703], [0.6133], [0.7109], [0.8555], [0.3965], [0.3770], [0.1855], [0.2812], [0.1465]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [1.0000], [0.8008], [0.7500], [1.0000], [0.5703], [0.6016], [0.6016], [0.8008], [1.0000], [0.4004], [0.4004], [0.2002], [0.2852], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00537109375 loss: 0.003387451171875 loss: 0.002166748046875 loss: 0.000644683837890625 79%|███████▉ | 391/492 [3:32:09<53:55, 32.03s/it] {'loss': 0.0092, 'learning_rate': 1e-05, 'epoch': 0.79} 79%|███████▉ | 391/492 [3:32:09<53:55, 32.03s/it]predicted value: tensor([[0.6289], [0.9570], [1.0156], [0.5039], [0.5391], [0.8945], [0.9688], [0.8047], [0.6914], [0.7344], [0.9219], [0.5312], [0.7930], [0.6172], [0.2832], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.4668], [0.4668], [0.8008], [1.0000], [0.6680], [0.6016], [0.6016], [1.0000], [0.4004], [0.7500], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0014495849609375 loss: 0.00112152099609375 loss: 0.0016937255859375 loss: 0.0035552978515625 predicted value: tensor([[0.6367], [0.5273], [0.3652], [0.3867], [0.9531], [0.9648], [0.3359], [0.6914], [0.6875], [0.5547], [0.3301], [0.9141], [0.6914], [0.4727], [0.2539], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.3340], [0.3340], [1.0000], [1.0000], [0.2500], [0.8320], [0.6016], [0.4668], [0.2500], [1.0000], [0.6016], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001708984375 loss: 0.0030670166015625 loss: 0.00167083740234375 loss: 0.00262451171875 predicted value: tensor([[0.6172], [0.8594], [0.9961], [0.6133], [0.9727], [0.3711], [0.6914], [0.6016], [0.8203], [0.6367], [0.3359], [0.7305], [0.6719], [0.5312], [0.2500], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [1.0000], [0.6016], [1.0000], [0.3340], [0.6016], [0.6016], [0.8008], [0.4648], [0.2500], [0.6016], [0.8008], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00262451171875 loss: 0.0021209716796875 loss: 0.00164031982421875 loss: 0.0045166015625 predicted value: tensor([[0.3652], [0.9648], [0.6289], [0.9609], [0.5938], [1.0000], [0.9805], [0.7695], [0.9766], [0.3848], [0.6602], [0.5625], [0.7344], [0.5195], [0.2871], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [1.0000], [0.5547], [1.0000], [0.5000], [1.0000], [1.0000], [0.7500], [1.0000], [0.2002], [0.6016], [0.5000], [0.6016], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011444091796875 loss: 0.0034942626953125 loss: 0.0028228759765625 loss: 0.0021514892578125 80%|███████▉ | 392/492 [3:32:41<53:25, 32.05s/it] {'loss': 0.0094, 'learning_rate': 1e-05, 'epoch': 0.8} 80%|███████▉ | 392/492 [3:32:41<53:25, 32.05s/it]predicted value: tensor([[0.5625], [0.7031], [0.5078], [0.5781], [0.9805], [0.8477], [0.4082], [0.7109], [1.0156], [0.6523], [0.9766], [0.5312], [0.7266], [0.5703], [0.5508], [0.3027]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.3750], [0.4668], [0.8320], [0.8008], [0.2500], [0.6016], [1.0000], [0.4277], [1.0000], [0.4004], [0.5000], [0.3340], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0057373046875 loss: 0.00531005859375loss: 0.0038299560546875 loss: 0.00360107421875 predicted value: tensor([[0.5469], [0.5195], [0.5742], [0.9922], [0.3125], [1.0156], [0.9219], [0.9922], [0.7344], [0.7461], [0.8320], [0.4863], [0.6641], [0.2793], [0.5391], [0.2949]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [1.0000], [0.2002], [1.0000], [0.8008], [1.0000], [0.6016], [0.6016], [0.8008], [0.5000], [0.6016], [0.2002], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028076171875 loss: 0.00186920166015625loss: 0.002471923828125 loss: 0.004974365234375 predicted value: tensor([[1.0078], [0.8945], [0.7734], [0.8242], [0.9453], [0.4102], [0.9102], [0.8828], [0.4434], [0.6289], [0.7500], [0.4961], [0.4707], [0.2949], [0.4395], [0.2891]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.7500], [0.5703], [0.7148], [0.2500], [0.8008], [0.8008], [0.2500], [0.3750], [0.6016], [0.3340], [0.3340], [0.1670], [0.0400], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0081787109375 loss: 0.0081787109375 loss: 0.004486083984375 loss: 0.00396728515625 predicted value: tensor([[1.0469], [0.6523], [0.6328], [0.8164], [0.8359], [0.7617], [0.6953], [0.8320], [0.7773], [0.7656], [0.6914], [0.4902], [0.3086], [0.5000], [0.5273], [0.3223]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.5547], [0.8008], [0.6680], [0.6016], [0.6016], [0.7500], [0.5000], [0.6016], [0.5000], [0.3340], [0.2002], [0.3340], [0.4004], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0054931640625 loss: 0.003753662109375 loss: 0.00531005859375loss: 0.0030059814453125 80%|███████▉ | 393/492 [3:33:13<52:46, 31.99s/it] {'loss': 0.0182, 'learning_rate': 1e-05, 'epoch': 0.8} 80%|███████▉ | 393/492 [3:33:13<52:46, 31.99s/it]predicted value: tensor([[0.5508], [0.5156], [1.0156], [0.5273], [0.7773], [0.5195], [0.8867], [0.7266], [0.5391], [0.7852], [0.5703], [0.9805], [0.5234], [0.2969], [0.2344], [0.2402]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6094], [0.4668], [1.0000], [0.4668], [0.6680], [0.4668], [0.8008], [0.7500], [0.4668], [0.6680], [0.3145], [1.0000], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026397705078125 loss: 0.0021514892578125loss: 0.00121307373046875 loss: 0.0012969970703125 predicted value: tensor([[0.5938], [0.8281], [0.6484], [0.4961], [0.3633], [0.3184], [0.6523], [0.7695], [0.5781], [1.0078], [0.7344], [0.5117], [0.7109], [0.4609], [0.2393], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.5547], [0.4668], [0.3340], [0.2500], [0.6680], [0.6016], [0.4668], [1.0000], [0.8008], [0.5000], [0.7500], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019378662109375 loss: 0.00112152099609375 loss: 0.004791259765625 loss: 0.005096435546875 predicted value: tensor([[1.0391], [0.5391], [0.6328], [0.9570], [0.8906], [0.8086], [0.7578], [0.8477], [0.7227], [0.7617], [0.5820], [0.8203], [0.4609], [0.7305], [0.2773], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.5547], [1.0000], [0.8008], [0.7500], [0.6016], [0.8555], [0.6680], [0.4668], [0.5000], [0.8008], [0.3340], [0.6016], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00238037109375 loss: 0.00109100341796875 loss: 0.0030059814453125 loss: 0.0024261474609375 predicted value: tensor([[0.6250], [0.3613], [0.9570], [0.5156], [0.3398], [0.7383], [0.9844], [0.7305], [1.0234], [0.6133], [0.9609], [0.7148], [0.4414], [0.2891], [0.0645], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [1.0000], [0.4668], [0.3340], [0.4668], [1.0000], [0.7500], [1.0000], [0.5000], [1.0000], [0.6016], [0.3340], [0.2852], [0.0625], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001190185546875 loss: 0.0021209716796875 loss: 0.0027008056640625 loss: 0.002197265625 80%|████████ | 394/492 [3:33:46<52:37, 32.22s/it] {'loss': 0.0093, 'learning_rate': 1e-05, 'epoch': 0.8} 80%|████████ | 394/492 [3:33:46<52:37, 32.22s/it]predicted value: tensor([[0.4590], [0.3262], [0.7578], [0.1924], [0.5352], [0.4141], [0.4375], [0.6836], [0.6055], [0.2734], [0.6523], [0.4609], [0.4121], [0.1069], [0.1494], [0.1836]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.6680], [0.3340], [0.8008], [0.4668], [0.6016], [0.7500], [0.5000], [0.3340], [0.6680], [0.6680], [0.5000], [0.1670], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027008056640625 loss: 0.00482177734375 loss: 0.0034027099609375 loss: 0.00086212158203125 predicted value: tensor([[0.4922], [0.5234], [0.5234], [0.4160], [0.4531], [0.3770], [0.2383], [0.4805], [0.3887], [0.2656], [0.2812], [0.4004], [0.3906], [0.0166], [0.1162], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.5547], [0.4668], [0.8008], [0.3750], [0.2500], [0.4668], [0.4668], [0.3340], [0.6016], [0.3340], [0.4004], [0.0400], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002410888671875 loss: 0.00110626220703125 loss: 0.00408935546875 loss: 0.0020294189453125 predicted value: tensor([[0.3828], [0.3750], [0.4492], [0.5156], [0.3789], [0.5977], [0.3457], [0.8750], [0.2402], [0.6562], [0.8984], [0.5039], [0.4629], [0.6250], [0.1328], [0.1553]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.4668], [0.8008], [0.4668], [0.6016], [0.3750], [1.0000], [0.2500], [0.7500], [1.0000], [0.5000], [0.2500], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019989013671875 loss: 0.0029144287109375loss: 0.002166748046875 loss: 0.00107574462890625 predicted value: tensor([[ 0.8945], [ 0.3672], [ 0.8906], [ 0.4258], [ 0.3848], [ 0.7891], [ 0.4102], [ 0.6680], [ 0.6836], [ 0.7070], [-0.0464], [ 0.4043], [ 0.3418], [ 0.1216], [ 0.1553], [ 0.4375]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.4668], [0.4668], [0.8008], [0.4668], [0.8008], [0.7500], [0.7500], [0.0400], [0.3340], [0.4004], [0.1670], [0.2500], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00115966796875 loss: 0.001129150390625 loss: 0.0032196044921875 loss: 0.001556396484375 80%|████████ | 395/492 [3:34:17<51:55, 32.12s/it] {'loss': 0.0092, 'learning_rate': 1e-05, 'epoch': 0.8} 80%|████████ | 395/492 [3:34:17<51:55, 32.12s/it]predicted value: tensor([[0.9219], [0.9141], [0.3711], [0.4961], [0.3496], [0.3965], [0.5703], [0.7266], [0.1040], [0.7695], [0.3984], [0.5742], [0.2617], [0.3672], [0.1416], [0.1133]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.3750], [0.4648], [0.4668], [0.4668], [0.6016], [0.7500], [0.2500], [0.8008], [0.4004], [0.7500], [0.2500], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00180816650390625 loss: 0.00186920166015625loss: 0.0021820068359375 loss: 0.0019378662109375 predicted value: tensor([[0.9180], [0.6523], [0.4023], [0.4590], [0.8984], [0.8867], [0.3672], [0.7461], [0.2070], [0.3848], [0.5391], [0.5234], [0.3398], [0.3555], [0.1035], [0.1226]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [0.4668], [0.5547], [1.0000], [1.0000], [0.4668], [0.8008], [0.1670], [0.4668], [0.5000], [0.6016], [0.3340], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020751953125 loss: 0.00173187255859375 loss: 0.00138092041015625 loss: 0.0037078857421875 predicted value: tensor([[0.7578], [0.3594], [0.3672], [0.3398], [0.1514], [0.6289], [0.9297], [0.8047], [0.6055], [0.8828], [0.2314], [0.3965], [0.5703], [0.2441], [0.3594], [0.1396]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.4668], [0.3750], [0.2002], [0.8008], [1.0000], [0.8320], [0.6016], [1.0000], [0.3340], [0.5000], [0.6016], [0.3340], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00110626220703125 loss: 0.00457763671875 loss: 0.00191497802734375 loss: 0.004547119140625 predicted value: tensor([[0.3457], [0.1758], [0.9102], [0.3984], [0.3242], [0.6992], [0.5742], [0.6367], [0.6055], [0.5430], [0.3105], [0.8672], [0.8633], [0.1299], [0.1338], [0.1226]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3340], [1.0000], [0.4668], [0.4668], [0.3750], [0.7500], [0.7500], [0.7500], [0.5000], [0.4004], [1.0000], [1.0000], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023651123046875 loss: 0.002197265625 loss: 0.0048828125 loss: 0.0020904541015625 80%|████████ | 396/492 [3:34:50<51:24, 32.13s/it] {'loss': 0.0101, 'learning_rate': 1e-05, 'epoch': 0.8} 80%|████████ | 396/492 [3:34:50<51:24, 32.13s/it]predicted value: tensor([[0.5352], [0.2500], [0.9297], [0.1895], [0.3711], [0.4551], [0.9375], [0.4258], [0.6172], [0.4102], [0.6758], [0.5000], [0.4375], [0.2969], [0.2080], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [1.0000], [0.2500], [0.4668], [0.4668], [1.0000], [0.5000], [0.7500], [0.5000], [0.4668], [0.6016], [0.4004], [0.0400], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018310546875 loss: 0.0028076171875 loss: 0.0025177001953125 loss: 0.0011749267578125 predicted value: tensor([[0.8594], [0.1846], [0.9609], [0.4004], [0.4219], [0.4883], [0.6602], [0.5977], [0.5469], [0.3105], [0.5742], [0.4531], [0.5820], [0.3828], [0.1699], [0.1680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2002], [1.0000], [0.4668], [0.3750], [0.4668], [0.6680], [0.7500], [0.6016], [0.2500], [0.6016], [0.4004], [0.5000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00104522705078125 loss: 0.000797271728515625loss: 0.000530242919921875 loss: 0.001953125 predicted value: tensor([[1.0078], [0.9531], [0.2080], [0.6953], [0.2207], [0.5938], [0.4375], [0.3887], [0.5742], [0.4395], [0.4473], [0.6211], [0.3984], [0.1533], [0.2109], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.2500], [0.6680], [0.2002], [0.5000], [0.4668], [0.4668], [0.6016], [0.3750], [0.4004], [0.3750], [0.5000], [0.1670], [0.2002], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028076171875 loss: 0.0012359619140625 loss: 0.0019378662109375 loss: 0.0009002685546875 predicted value: tensor([[0.5000], [0.9570], [0.6797], [0.4062], [0.7734], [0.5430], [0.8008], [0.9453], [0.9180], [0.6953], [0.5312], [0.6016], [0.3633], [0.4395], [0.1592], [0.1377]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.6680], [0.4668], [0.8008], [0.4668], [0.8008], [1.0000], [1.0000], [0.6016], [0.6016], [0.6016], [0.3340], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00067901611328125 loss: 0.00084686279296875 loss: 0.000652313232421875loss: 0.00109100341796875 81%|████████ | 397/492 [3:35:21<50:29, 31.89s/it] {'loss': 0.0057, 'learning_rate': 1e-05, 'epoch': 0.81} 81%|████████ | 397/492 [3:35:21<50:29, 31.89s/it]predicted value: tensor([[0.5625], [0.5312], [0.8750], [1.0781], [1.1016], [0.7656], [0.3438], [0.9023], [0.7383], [0.4863], [0.5547], [0.5977], [0.5156], [0.2832], [0.5117], [0.2754]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8320], [1.0000], [1.0000], [0.7500], [0.2500], [0.8008], [0.7500], [0.3340], [0.5000], [0.6016], [0.4004], [0.1670], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003326416015625 loss: 0.0018310546875 loss: 0.0030364990234375 loss: 0.002349853515625 predicted value: tensor([[0.5156], [0.9023], [0.5039], [0.3516], [0.9102], [0.6562], [0.3340], [1.0938], [0.3398], [0.8672], [0.6797], [0.5039], [0.6719], [1.0469], [0.2676], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.7500], [0.4668], [0.2500], [0.8008], [0.3750], [0.2002], [1.0000], [0.3340], [0.8008], [0.4668], [0.3340], [0.5000], [1.0000], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025634765625 loss: 0.0035400390625 loss: 0.0045166015625 loss: 0.00147247314453125 predicted value: tensor([[0.9609], [1.0781], [0.5352], [1.0938], [1.0625], [0.5430], [0.5391], [0.7383], [0.7148], [1.0703], [0.4434], [0.4766], [0.4961], [0.5195], [0.2441], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.4668], [1.0000], [1.0000], [0.4668], [0.4668], [0.6680], [0.8320], [1.0000], [0.2002], [0.4004], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0036468505859375 loss: 0.00238037109375 loss: 0.0025482177734375 loss: 0.0025482177734375 predicted value: tensor([[0.9258], [0.5312], [1.0781], [1.0859], [0.2930], [0.6641], [0.4316], [0.3516], [0.6875], [0.8125], [0.4980], [0.5156], [0.5898], [0.5195], [0.2871], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [1.0000], [1.0000], [0.1670], [0.4277], [0.2500], [0.2500], [0.6016], [0.6680], [0.4004], [0.4004], [0.4668], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025634765625 loss: 0.002288818359375 loss: 0.004486083984375 loss: 0.0034942626953125 81%|████████ | 398/492 [3:35:53<50:07, 31.99s/it] {'loss': 0.0116, 'learning_rate': 1e-05, 'epoch': 0.81} 81%|████████ | 398/492 [3:35:53<50:07, 31.99s/it]predicted value: tensor([[1.1719], [0.3691], [0.9883], [0.6797], [0.8320], [0.7227], [0.9531], [0.6328], [0.9375], [1.1328], [1.0859], [0.5859], [0.3281], [0.5156], [0.1543], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [0.8320], [0.5547], [0.6680], [0.6016], [0.8008], [0.6016], [0.8008], [1.0000], [1.0000], [0.3750], [0.2002], [0.2852], [0.0400], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004302978515625 loss: 0.004486083984375 loss: 0.0048828125 loss: 0.004425048828125 predicted value: tensor([[0.4980], [0.5273], [0.8594], [0.4141], [0.6875], [0.9297], [0.9023], [0.7305], [0.3965], [0.5508], [0.8125], [0.5508], [0.9023], [0.3262], [0.2910], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.6680], [0.3340], [0.5547], [0.6680], [0.8008], [0.5547], [0.2500], [0.3750], [0.6680], [0.3340], [0.8008], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004486083984375 loss: 0.0035552978515625 loss: 0.0054931640625 loss: 0.0042724609375 predicted value: tensor([[0.5469], [1.0000], [0.6602], [0.6445], [0.5625], [0.9453], [1.1016], [1.1875], [0.5586], [1.1484], [0.7031], [0.5391], [0.1816], [0.4902], [0.2949], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.5547], [0.5547], [0.4668], [0.8008], [1.0000], [1.0000], [0.4668], [1.0000], [0.6016], [0.4004], [0.0400], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005706787109375 loss: 0.0038604736328125 loss: 0.004058837890625 loss: 0.00445556640625 predicted value: tensor([[0.6602], [0.5586], [0.5508], [1.0859], [0.9180], [0.3516], [0.8672], [0.6367], [0.5547], [1.0938], [1.1328], [1.1094], [1.1172], [0.4258], [0.3320], [0.3242]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [1.0000], [0.8008], [0.2002], [0.8008], [0.5000], [0.4668], [1.0000], [1.0000], [1.0000], [1.0000], [0.0278], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002685546875 loss: 0.0037689208984375 loss: 0.0036468505859375 loss: 0.005615234375 81%|████████ | 399/492 [3:36:25<49:44, 32.10s/it] {'loss': 0.0174, 'learning_rate': 1e-05, 'epoch': 0.81} 81%|████████ | 399/492 [3:36:25<49:44, 32.10s/it]predicted value: tensor([[0.6211], [0.5078], [0.8789], [0.8789], [1.0781], [0.3379], [0.5312], [0.6055], [0.6445], [0.6172], [0.6406], [0.4219], [0.5078], [0.4727], [0.2734], [0.2715]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.8008], [0.8008], [1.0000], [0.2002], [0.4668], [0.6016], [0.6016], [0.6016], [0.7500], [0.2852], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023651123046875 loss: 0.0021514892578125 loss: 0.00177764892578125 loss: 0.0028533935546875 predicted value: tensor([[0.5547], [0.7227], [0.5312], [0.5391], [0.7383], [0.8164], [1.0938], [0.6875], [0.6836], [1.0781], [0.7344], [0.9062], [0.4824], [0.7031], [0.4883], [0.2812]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.3750], [0.4668], [0.4668], [0.7500], [0.6680], [1.0000], [0.6016], [0.6016], [1.0000], [0.7500], [0.8008], [0.3340], [0.7500], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024566650390625 loss: 0.0035400390625 loss: 0.0038909912109375 loss: 0.0026702880859375 predicted value: tensor([[0.7773], [0.8711], [0.5898], [0.6484], [1.1016], [0.5234], [0.6055], [0.6367], [0.7578], [1.1094], [0.6836], [0.6484], [0.4609], [0.4414], [0.2500], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.8320], [0.5547], [0.6016], [1.0000], [0.4668], [0.6016], [0.6016], [0.7500], [1.0000], [0.6016], [0.6016], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026702880859375 loss: 0.000911712646484375loss: 0.00244140625 loss: 0.0023956298828125 predicted value: tensor([[0.5312], [0.4727], [1.1172], [1.1016], [0.4805], [0.3145], [0.5703], [0.6875], [0.3750], [0.7148], [0.8594], [0.5000], [0.2520], [0.2754], [0.2432], [0.2559]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [1.0000], [1.0000], [0.3750], [0.2002], [0.3750], [0.7500], [0.2500], [0.6016], [0.8008], [0.4004], [0.2002], [0.2500], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001953125 loss: 0.0025482177734375loss: 0.00421142578125 loss: 0.00244140625 81%|████████▏ | 400/492 [3:36:58<49:11, 32.08s/it] {'loss': 0.0103, 'learning_rate': 1e-05, 'epoch': 0.81} 81%|████████▏ | 400/492 [3:36:58<49:11, 32.08s/it]Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 4096} /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /vol3/ctr/.conda/envs/llava-rlhf/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( predicted value: tensor([[0.4727], [0.3848], [0.5117], [0.4297], [0.3652], [0.3633], [0.3047], [0.6602], [0.7031], [1.0234], [0.3359], [0.4434], [0.4004], [0.1777], [0.1846], [0.1533]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.4668], [0.3750], [0.2500], [0.2500], [0.7500], [0.8008], [1.0000], [0.3340], [0.2500], [0.3340], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00145721435546875 loss: 0.00191497802734375 loss: 0.00146484375 loss: 0.001068115234375 predicted value: tensor([[0.4297], [0.4961], [0.9492], [0.4199], [0.1904], [0.3984], [0.2197], [0.7070], [0.9336], [0.9609], [0.4004], [0.9609], [0.1611], [0.3906], [0.1748], [0.1621]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4648], [1.0000], [0.3750], [0.2002], [0.3145], [0.2500], [0.8008], [1.0000], [1.0000], [0.5000], [1.0000], [0.2002], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001708984375 loss: 0.0022430419921875 loss: 0.000865936279296875 loss: 0.00112152099609375 predicted value: tensor([[1.0469], [0.3516], [0.8125], [0.6602], [1.0234], [0.9883], [1.0156], [0.2461], [0.4590], [0.5703], [0.5781], [0.2471], [0.3945], [0.3535], [0.1953], [0.1992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.8320], [0.8008], [1.0000], [1.0000], [1.0000], [0.2500], [0.4668], [0.6016], [0.7500], [0.2500], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00046539306640625 loss: 0.00238037109375 loss: 0.00106048583984375 loss: 0.0020294189453125 predicted value: tensor([[0.4062], [0.8008], [0.7812], [0.6680], [0.6172], [0.9727], [0.5195], [0.7734], [0.5117], [0.4141], [0.4824], [0.3730], [0.5234], [0.1670], [0.1748], [0.1445]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.8320], [0.8008], [0.7500], [1.0000], [0.6016], [0.8008], [0.6016], [0.5000], [0.4668], [0.4004], [0.5000], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00439453125 loss: 0.0004177093505859375 loss: 0.0013275146484375 loss: 0.00104522705078125 82%|████████▏ | 401/492 [3:39:34<1:45:14, 69.39s/it] {'loss': 0.0062, 'learning_rate': 1e-05, 'epoch': 0.82} 82%|████████▏ | 401/492 [3:39:34<1:45:14, 69.39s/it]predicted value: tensor([[1.0156], [0.2197], [0.3828], [0.9805], [0.6367], [0.4258], [0.6406], [0.2285], [0.2949], [0.1504], [0.3281], [0.2432], [0.6055], [0.3262], [0.1348], [0.1572]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [0.3750], [1.0000], [0.8008], [0.4668], [0.7500], [0.3340], [0.3340], [0.2500], [0.4004], [0.2500], [0.7500], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0014495849609375 loss: 0.0021209716796875 loss: 0.002288818359375 loss: 0.0008544921875 predicted value: tensor([[0.3398], [1.0000], [1.0078], [0.9531], [0.2158], [0.4355], [0.9805], [0.5469], [0.6016], [0.3184], [0.3262], [0.5820], [0.5781], [0.3828], [0.3945], [0.1396]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [1.0000], [0.3340], [0.3750], [1.0000], [0.7500], [0.4668], [0.4004], [0.3340], [0.6016], [0.6016], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001556396484375 loss: 0.0027008056640625 loss: 0.0020599365234375 loss: 0.002105712890625 predicted value: tensor([[0.8008], [0.4648], [0.6953], [0.5664], [0.9727], [0.1973], [0.2158], [0.3672], [0.6328], [0.3711], [0.9453], [0.3750], [0.5664], [0.5117], [0.1387], [0.1582]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.5547], [0.8008], [0.5547], [1.0000], [0.2002], [0.2500], [0.4668], [0.8008], [0.3340], [1.0000], [0.4004], [0.3340], [0.7500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035552978515625 loss: 0.0028228759765625loss: 0.00152587890625 loss: 0.00189971923828125 predicted value: tensor([[0.4590], [0.2236], [0.4258], [1.0000], [0.5938], [0.5312], [0.3750], [0.3418], [0.2188], [0.4648], [0.3594], [0.4199], [0.3340], [0.1543], [0.1514], [0.1396]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [0.4668], [1.0000], [0.8008], [0.4668], [0.2500], [0.3750], [0.2500], [0.5000], [0.4668], [0.5000], [0.3340], [0.2002], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00151824951171875 loss: 0.0026702880859375 loss: 0.0020294189453125 loss: 0.002044677734375 82%|████████▏ | 402/492 [3:40:07<1:27:44, 58.49s/it] {'loss': 0.0083, 'learning_rate': 1e-05, 'epoch': 0.82} 82%|████████▏ | 402/492 [3:40:07<1:27:44, 58.49s/it]predicted value: tensor([[0.7656], [0.5703], [0.6641], [0.7656], [1.0391], [0.4648], [0.4316], [0.4121], [1.0234], [0.3027], [0.3496], [0.3711], [0.3965], [0.3906], [0.1973], [0.1689]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.8008], [0.8008], [1.0000], [0.4668], [0.4668], [0.3340], [1.0000], [0.3340], [0.3340], [0.4004], [0.5000], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000553131103515625 loss: 0.00142669677734375 loss: 0.000934600830078125 loss: 0.000911712646484375 predicted value: tensor([[1.0078], [0.4434], [0.9961], [0.3945], [0.5234], [0.8594], [0.3809], [0.2793], [0.2188], [0.6367], [0.3848], [0.3770], [0.3457], [0.4258], [0.5156], [0.1865]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.3750], [0.5547], [0.8320], [0.2500], [0.2500], [0.2500], [0.7500], [0.5000], [0.4004], [0.2500], [0.4004], [0.5000], [0.1113]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012359619140625loss: 0.000690460205078125 loss: 0.00099945068359375 loss: 0.0010986328125 predicted value: tensor([[0.8711], [0.5195], [0.7773], [0.7734], [0.9688], [0.5195], [0.4004], [0.3945], [0.5508], [0.6406], [0.3828], [1.0078], [0.6016], [0.4355], [0.2061], [0.1611]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.8008], [0.8008], [1.0000], [0.6680], [0.4668], [0.7500], [0.5547], [0.4668], [0.4004], [1.0000], [0.7500], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013885498046875 loss: 0.0008392333984375 loss: 0.0033416748046875 loss: 0.000949859619140625 predicted value: tensor([[0.7344], [0.5664], [0.7695], [0.2812], [0.9844], [0.7422], [0.4258], [0.6914], [0.7148], [0.7500], [0.3887], [0.5742], [0.4941], [0.4980], [0.1895], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.5547], [0.8008], [0.2500], [1.0000], [0.8008], [0.3750], [0.8008], [0.8008], [0.8008], [0.4004], [0.5000], [0.5000], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000576019287109375 loss: 0.000637054443359375 loss: 0.00023365020751953125 loss: 0.002777099609375 82%|████████▏ | 403/492 [3:40:40<1:15:29, 50.89s/it] {'loss': 0.0046, 'learning_rate': 1e-05, 'epoch': 0.82} 82%|████████▏ | 403/492 [3:40:40<1:15:29, 50.89s/it]predicted value: tensor([[1.1172], [0.7539], [0.5312], [0.5039], [1.0781], [0.5234], [0.5156], [0.8750], [0.5312], [0.3379], [0.4785], [0.3672], [0.5078], [0.6523], [0.3125], [0.3242]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [0.4668], [0.4668], [1.0000], [0.4668], [0.4668], [0.8008], [0.4004], [0.2002], [0.4004], [0.2002], [0.5000], [0.6016], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027618408203125 loss: 0.002105712890625loss: 0.0034942626953125 loss: 0.0037078857421875 predicted value: tensor([[0.5273], [0.8203], [0.5078], [0.9102], [0.5195], [0.4980], [0.7773], [0.7500], [0.4043], [0.6836], [0.4219], [1.0859], [0.5000], [0.7266], [0.3242], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [0.3750], [0.8008], [0.3750], [0.4668], [0.6680], [0.6016], [0.2002], [0.6016], [0.3340], [1.0000], [0.4004], [0.7500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003082275390625 loss: 0.0030975341796875loss: 0.00299072265625 loss: 0.00151824951171875 predicted value: tensor([[0.5547], [1.1328], [1.1172], [0.4863], [0.6133], [1.1094], [1.1484], [1.0859], [0.6016], [1.0703], [0.3828], [0.5078], [0.6680], [0.3125], [0.3145], [0.3262]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2832], [1.0000], [1.0000], [0.4668], [0.5547], [1.0000], [1.0000], [1.0000], [0.5000], [1.0000], [0.2500], [0.5000], [0.6016], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021514892578125 loss: 0.0032806396484375 loss: 0.003082275390625 loss: 0.0037841796875 predicted value: tensor([[0.8789], [0.4922], [0.4082], [0.3398], [1.0938], [0.5781], [0.8281], [0.8086], [0.3633], [0.6992], [0.5391], [0.5039], [0.4355], [0.7383], [0.3262], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.2500], [0.3340], [1.0000], [0.4668], [0.8008], [0.8320], [0.2500], [0.4668], [0.3750], [0.3340], [0.2500], [0.6016], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021514892578125 loss: 0.0038604736328125loss: 0.0038604736328125 loss: 0.0036163330078125 82%|████████▏ | 404/492 [3:41:13<1:06:35, 45.40s/it] {'loss': 0.0121, 'learning_rate': 1e-05, 'epoch': 0.82} 82%|████████▏ | 404/492 [3:41:13<1:06:35, 45.40s/it]predicted value: tensor([[0.6250], [0.8750], [0.6211], [0.9023], [0.3594], [0.6133], [0.8789], [0.7422], [0.5586], [0.6367], [0.6406], [0.7539], [0.5625], [0.5078], [0.3008], [0.3320]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.8008], [0.8008], [0.2002], [0.5547], [0.8008], [0.7500], [0.3750], [0.6016], [0.5000], [0.6016], [0.5000], [0.3340], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0034942626953125 loss: 0.0032196044921875loss: 0.004608154296875 loss: 0.004638671875 predicted value: tensor([[0.5625], [0.8594], [1.1172], [0.5625], [0.8555], [0.9023], [1.1172], [0.6992], [0.9023], [1.1016], [0.5781], [0.5703], [0.5859], [0.7773], [0.3203], [0.3496]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [1.0000], [0.3750], [0.6016], [0.8008], [1.0000], [0.5000], [0.7500], [1.0000], [0.5000], [0.4668], [0.5000], [0.7500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00823974609375 loss: 0.00445556640625loss: 0.004730224609375 loss: 0.0030517578125 predicted value: tensor([[0.5156], [0.5586], [0.5586], [0.5703], [0.5586], [0.9492], [1.1016], [0.6992], [1.1328], [0.8945], [0.2500], [0.7266], [0.5312], [0.3281], [0.3262], [0.3164]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.4668], [0.4668], [0.4668], [0.8008], [1.0000], [0.6016], [1.0000], [0.8008], [0.0400], [0.6016], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007049560546875 loss: 0.00384521484375loss: 0.00396728515625 loss: 0.004547119140625 predicted value: tensor([[0.3594], [0.3984], [0.6211], [0.9766], [0.5898], [0.3730], [0.6406], [1.1328], [0.8086], [0.5898], [0.7266], [0.7148], [0.2432], [0.3398], [0.2949], [0.3262]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.3340], [0.5547], [0.8555], [0.3750], [0.2002], [0.5547], [1.0000], [0.8008], [0.5000], [0.7500], [0.6016], [0.5000], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00341796875 loss: 0.003814697265625loss: 0.0040283203125 loss: 0.004058837890625 82%|████████▏ | 405/492 [3:41:45<59:53, 41.31s/it] {'loss': 0.0178, 'learning_rate': 1e-05, 'epoch': 0.82} 82%|████████▏ | 405/492 [3:41:45<59:53, 41.31s/it]predicted value: tensor([[0.6484], [0.5117], [0.4883], [1.1016], [0.5703], [0.5352], [0.8906], [0.5234], [1.0547], [0.5078], [0.5039], [0.4941], [0.3359], [0.2832], [0.2559], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5312], [0.4668], [0.4668], [1.0000], [0.4668], [0.4668], [0.8008], [0.4668], [1.0000], [0.4004], [0.4004], [0.3340], [0.4004], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002227783203125 loss: 0.0027618408203125 loss: 0.0019683837890625 loss: 0.002899169921875 predicted value: tensor([[0.5312], [0.5156], [0.5156], [0.3516], [0.5195], [0.5391], [0.3867], [0.6484], [0.5195], [0.6289], [0.7148], [0.5898], [0.4883], [0.2871], [0.2871], [0.2432]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.2500], [0.4668], [0.4668], [0.2500], [0.8008], [0.4668], [0.4668], [0.6016], [0.5000], [0.4004], [0.1670], [0.1426], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00124359130859375 loss: 0.002593994140625 loss: 0.002838134765625 loss: 0.0019683837890625 predicted value: tensor([[1.0938], [0.6836], [0.7969], [1.1094], [0.3379], [0.7031], [0.5195], [0.4746], [0.5508], [0.4297], [0.7656], [0.3750], [0.5234], [0.3203], [0.3086], [0.2793]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [0.8008], [1.0000], [0.2002], [0.6016], [0.7500], [0.3750], [0.4668], [0.3340], [0.6016], [0.2500], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028228759765625 loss: 0.00225830078125 loss: 0.0032806396484375 loss: 0.0029754638671875 predicted value: tensor([[0.5898], [1.0859], [0.5195], [0.5117], [0.8789], [1.0781], [0.5664], [1.0391], [0.7070], [0.8008], [1.1016], [0.5508], [0.4219], [0.5312], [0.1729], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.4668], [0.8008], [1.0000], [0.6016], [1.0000], [0.5000], [0.7500], [1.0000], [0.4004], [0.0400], [0.4004], [0.0400], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037689208984375 loss: 0.00171661376953125 loss: 0.002960205078125 loss: 0.004608154296875 83%|████████▎ | 406/492 [3:42:17<55:12, 38.52s/it] {'loss': 0.0107, 'learning_rate': 1e-05, 'epoch': 0.83} 83%|████████▎ | 406/492 [3:42:17<55:12, 38.52s/it]predicted value: tensor([[0.5000], [0.4121], [0.4199], [0.2832], [0.8203], [0.4434], [0.7578], [0.0850], [0.3047], [0.3203], [0.2930], [0.3867], [0.5977], [0.1777], [0.1758], [0.1973]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4004], [0.4668], [0.4668], [0.2500], [0.8008], [0.3340], [0.8008], [0.0278], [0.3340], [0.3340], [0.2500], [0.3340], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00153350830078125 loss: 0.000774383544921875 loss: 0.0009613037109375 loss: 0.00115966796875 predicted value: tensor([[0.8242], [0.7852], [0.9648], [0.4004], [0.3574], [0.4570], [0.9727], [0.2520], [0.7500], [0.5273], [0.5859], [0.4238], [0.6250], [0.1328], [0.1826], [0.1689]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6680], [1.0000], [0.3750], [0.3145], [0.4668], [1.0000], [0.2002], [0.8008], [0.7500], [0.6016], [0.4004], [0.6016], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0008697509765625 loss: 0.00335693359375loss: 0.0020294189453125 loss: 0.00146484375 predicted value: tensor([[0.9648], [0.8438], [0.9688], [0.3848], [0.2617], [0.9648], [0.5430], [0.6758], [0.9258], [0.5586], [0.3164], [0.5469], [0.3555], [0.3359], [0.1543], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [1.0000], [0.4668], [0.2500], [1.0000], [0.4277], [0.7500], [1.0000], [0.6016], [0.2500], [0.6016], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010528564453125 loss: 0.00089263916015625 loss: 0.000888824462890625 loss: 0.0027923583984375 predicted value: tensor([[0.5391], [0.8438], [0.9766], [0.6211], [0.4121], [0.4219], [0.1992], [0.5898], [0.6172], [0.2734], [0.2793], [0.9492], [0.4219], [0.4414], [0.1445], [0.1650]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [1.0000], [0.4648], [0.4668], [0.4668], [0.2002], [0.8008], [0.7500], [0.3340], [0.3340], [1.0000], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0005340576171875 loss: 0.00107574462890625 loss: 0.000553131103515625 loss: 0.00167083740234375 83%|████████▎ | 407/492 [3:42:48<51:47, 36.55s/it] {'loss': 0.0054, 'learning_rate': 1e-05, 'epoch': 0.83} 83%|████████▎ | 407/492 [3:42:48<51:47, 36.55s/it]predicted value: tensor([[0.4082], [0.3574], [0.9414], [0.6055], [0.9258], [0.9062], [0.6641], [0.9258], [0.2480], [0.2539], [0.5508], [0.3711], [0.2754], [0.1865], [0.1504], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.6016], [1.0000], [1.0000], [0.7500], [1.0000], [0.2500], [0.2500], [0.6016], [0.4004], [0.2500], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001739501953125 loss: 0.000896453857421875 loss: 0.00167083740234375 loss: 0.002716064453125 predicted value: tensor([[0.4102], [0.5898], [0.3457], [0.9375], [0.5781], [0.2559], [0.9258], [0.4375], [0.5391], [0.4219], [0.2236], [0.3184], [0.8984], [0.1416], [0.1318], [0.1768]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.4668], [1.0000], [0.7500], [0.2500], [1.0000], [0.3750], [0.6680], [0.4668], [0.2500], [0.4004], [1.0000], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000812530517578125 loss: 0.001068115234375 loss: 0.002777099609375 loss: 0.00103759765625 predicted value: tensor([[0.9336], [0.3965], [0.2715], [0.4004], [0.9375], [0.3496], [0.5625], [0.3672], [0.3809], [0.3730], [0.3574], [0.5352], [0.1084], [0.3496], [0.3848], [0.1514]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3340], [0.4668], [1.0000], [0.3750], [0.5000], [0.4668], [0.4004], [0.4004], [0.3340], [0.6016], [0.0278], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020599365234375 loss: 0.00104522705078125 loss: 0.0036468505859375 loss: 0.001922607421875 predicted value: tensor([[0.2656], [0.3789], [0.9414], [0.3828], [0.3906], [0.9180], [0.6016], [0.1855], [0.3809], [0.4805], [0.4199], [0.9102], [0.3340], [0.1963], [0.1318], [0.1328]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.4668], [1.0000], [0.4668], [0.4668], [1.0000], [0.6016], [0.2500], [0.4004], [0.4668], [0.5000], [1.0000], [0.5000], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00140380859375 loss: 0.001373291015625 loss: 0.00115966796875 loss: 0.00057220458984375 83%|████████▎ | 408/492 [3:43:20<49:04, 35.05s/it] {'loss': 0.0065, 'learning_rate': 1e-05, 'epoch': 0.83} 83%|████████▎ | 408/492 [3:43:20<49:04, 35.05s/it]predicted value: tensor([[0.4336], [0.4102], [0.4395], [1.0000], [0.9492], [0.4219], [1.0000], [0.3066], [0.4062], [0.6914], [0.9961], [0.2930], [0.9453], [0.4863], [0.1943], [0.1660]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.4668], [1.0000], [1.0000], [0.4668], [1.0000], [0.2500], [0.4004], [0.7500], [1.0000], [0.2500], [1.0000], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00142669677734375 loss: 0.000904083251953125 loss: 0.0003223419189453125 loss: 0.0032806396484375 predicted value: tensor([[0.4570], [0.7773], [0.3984], [0.9805], [0.9883], [0.9883], [0.5625], [0.7422], [0.2852], [0.2051], [0.6016], [0.4199], [0.4199], [0.1592], [0.2207], [0.1816]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.4668], [1.0000], [1.0000], [1.0000], [0.5547], [0.6680], [0.2500], [0.2500], [0.5000], [0.4668], [0.5000], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013275146484375 loss: 0.001434326171875 loss: 0.000606536865234375 loss: 0.002777099609375 predicted value: tensor([[0.2773], [0.9727], [0.7930], [0.4102], [0.4883], [0.4590], [1.0000], [0.6289], [1.0000], [0.3203], [0.4609], [0.4531], [0.4629], [0.4570], [0.2285], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [1.0000], [0.8320], [0.3750], [0.8008], [0.4668], [1.0000], [0.5000], [1.0000], [0.3340], [0.4668], [0.5000], [0.5000], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019989013671875loss: 0.0004405975341796875 loss: 0.0011444091796875 loss: 0.005584716796875 predicted value: tensor([[0.5234], [0.9883], [0.7461], [0.9727], [0.9805], [0.9805], [0.5430], [0.6055], [0.3887], [0.9844], [0.5898], [0.5703], [0.4180], [0.4219], [0.2051], [0.1943]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.5547], [1.0000], [1.0000], [1.0000], [0.6016], [0.6016], [0.4668], [1.0000], [0.6016], [0.6016], [0.4004], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000698089599609375 loss: 0.00112152099609375 loss: 0.000827789306640625 loss: 0.000762939453125 83%|████████▎ | 409/492 [3:43:53<47:47, 34.55s/it] {'loss': 0.0062, 'learning_rate': 1e-05, 'epoch': 0.83} 83%|████████▎ | 409/492 [3:43:53<47:47, 34.55s/it]predicted value: tensor([[0.4980], [0.2598], [0.4746], [0.7422], [0.4727], [0.5391], [1.0391], [1.0156], [0.7266], [0.4688], [0.4277], [0.4863], [0.4492], [0.2168], [0.2393], [0.2080]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.2500], [0.4668], [0.8008], [0.4668], [0.4668], [1.0000], [1.0000], [0.6680], [0.4004], [0.4004], [0.5000], [0.4004], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000904083251953125 loss: 0.000637054443359375loss: 0.000896453857421875 loss: 0.00128173828125 predicted value: tensor([[0.4727], [0.7227], [0.6172], [0.7617], [1.0312], [1.0312], [1.0078], [0.6250], [0.4102], [0.5117], [0.1055], [0.5195], [0.4180], [0.2471], [0.2246], [0.1914]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.6680], [0.8320], [0.6680], [1.0000], [1.0000], [1.0000], [0.7500], [0.3340], [0.5000], [0.0400], [0.5000], [0.3340], [0.2500], [0.2002], [0.1250]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00069427490234375 loss: 0.000446319580078125 loss: 0.00168609619140625 loss: 0.00109100341796875 predicted value: tensor([[0.5781], [1.0312], [0.4492], [0.4375], [0.4297], [0.3359], [0.2578], [0.8164], [0.6680], [0.6562], [0.4473], [0.4414], [0.3516], [0.3574], [0.2451], [0.1973]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.3750], [0.3750], [0.3340], [0.2500], [0.8320], [0.5000], [0.6016], [0.5000], [0.5000], [0.2500], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010223388671875 loss: 0.00164794921875 loss: 0.001251220703125 loss: 0.0008697509765625 predicted value: tensor([[0.5508], [0.3008], [0.5469], [0.4648], [0.4512], [0.3418], [1.0234], [0.6719], [0.4297], [0.7344], [0.2402], [0.8047], [0.2051], [0.4395], [0.2119], [0.2012]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.4668], [0.4668], [0.4668], [0.3340], [1.0000], [0.5000], [0.3750], [0.7500], [0.2002], [0.6680], [0.2002], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000957489013671875 loss: 0.002197265625 loss: 0.00104522705078125 loss: 0.0009307861328125 83%|████████▎ | 410/492 [3:44:27<46:38, 34.12s/it] {'loss': 0.0044, 'learning_rate': 1e-05, 'epoch': 0.83} 83%|████████▎ | 410/492 [3:44:27<46:38, 34.12s/it]predicted value: tensor([[0.4277], [0.2871], [0.4395], [1.0156], [0.7070], [0.6211], [0.4531], [0.2412], [0.6875], [0.7227], [0.5820], [0.0282], [0.4336], [0.4238], [0.1602], [0.2012]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.4668], [1.0000], [0.6016], [0.6680], [0.4668], [0.2002], [0.6016], [0.7500], [0.6016], [0.0278], [0.5000], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00128173828125 loss: 0.000751495361328125loss: 0.0047607421875 loss: 0.000732421875 predicted value: tensor([[1.0156], [0.3555], [0.3555], [0.6055], [0.5508], [0.8438], [0.7031], [0.8203], [0.4492], [1.0156], [0.4473], [0.6367], [0.5742], [0.4082], [0.3438], [0.1699]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3145], [0.7500], [0.5547], [0.8320], [0.6680], [0.8008], [0.3750], [1.0000], [0.5000], [0.6016], [0.4277], [0.4004], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000904083251953125 loss: 0.0011138916015625 loss: 0.000518798828125 loss: 0.002410888671875 predicted value: tensor([[0.4355], [0.7344], [0.5469], [0.4160], [0.7539], [0.4043], [0.2207], [0.6719], [0.4180], [0.5977], [0.4395], [0.3867], [0.4023], [0.2207], [0.1797], [0.2158]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.5547], [0.4668], [0.7500], [0.4668], [0.2500], [0.7500], [0.3340], [0.6016], [0.4004], [0.4004], [0.4004], [0.2500], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000560760498046875loss: 0.001800537109375 loss: 0.00040435791015625 loss: 0.00061798095703125 predicted value: tensor([[0.4043], [0.3789], [0.4258], [0.9961], [0.6875], [0.6289], [0.6562], [1.0078], [0.2432], [0.4102], [0.5898], [1.0078], [0.3965], [0.4121], [0.1582], [0.1807]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [0.4668], [1.0000], [0.8008], [0.8008], [0.7500], [1.0000], [0.2500], [0.5000], [0.6016], [1.0000], [0.4004], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0004062652587890625 loss: 0.0009307861328125 loss: 0.00140380859375 loss: 0.00156402587890625 84%|████████▎ | 411/492 [3:45:00<45:36, 33.78s/it] {'loss': 0.005, 'learning_rate': 1e-05, 'epoch': 0.84} 84%|████████▎ | 411/492 [3:45:00<45:36, 33.78s/it]predicted value: tensor([[0.5039], [0.5000], [0.4961], [0.6211], [1.0156], [0.8438], [0.9062], [1.0156], [1.0234], [1.0781], [0.6836], [0.0559], [0.4297], [0.4902], [0.2012], [0.1992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.5547], [1.0000], [0.8008], [0.8320], [1.0000], [1.0000], [1.0000], [0.6016], [0.0400], [0.3340], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015716552734375 loss: 0.00118255615234375 loss: 0.000640869140625 loss: 0.0012664794921875 predicted value: tensor([[0.5820], [0.8984], [1.0000], [0.4473], [0.2441], [0.2812], [0.6523], [0.2480], [0.6523], [0.2715], [0.3105], [0.3945], [0.5000], [0.4980], [0.2100], [0.2031]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8555], [1.0000], [0.4668], [0.2500], [0.3340], [0.7500], [0.2500], [0.6016], [0.2500], [0.2500], [0.4004], [0.4004], [0.5000], [0.1670], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00086212158203125 loss: 0.003173828125 loss: 0.00058746337890625 loss: 0.00128936767578125 predicted value: tensor([[0.7422], [0.2490], [0.5977], [0.4531], [0.4766], [0.4941], [0.6914], [0.8633], [0.2695], [0.7070], [0.4434], [0.2969], [0.3027], [0.4473], [0.4062], [0.2285]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.2002], [0.4668], [0.4668], [0.4668], [0.4668], [0.8008], [0.8008], [0.2002], [0.7500], [0.4004], [0.3340], [0.3340], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0004711151123046875 loss: 0.0010986328125 loss: 0.0008392333984375 loss: 0.00714111328125 predicted value: tensor([[0.2773], [0.5781], [0.5352], [0.8398], [0.5977], [0.5977], [0.6992], [1.0469], [0.3066], [0.7656], [1.0469], [0.4238], [0.3906], [0.6562], [0.4180], [0.2236]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.5547], [0.2500], [0.8008], [0.4668], [0.6016], [0.6016], [1.0000], [0.2500], [0.8008], [1.0000], [0.4004], [0.4004], [0.8008], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000934600830078125 loss: 0.00042724609375 loss: 0.0020904541015625 loss: 0.002227783203125 84%|████████▎ | 412/492 [3:45:33<44:43, 33.55s/it] {'loss': 0.0065, 'learning_rate': 1e-05, 'epoch': 0.84} 84%|████████▎ | 412/492 [3:45:33<44:43, 33.55s/it]predicted value: tensor([[0.5430], [1.0000], [0.9922], [0.3926], [0.4883], [0.6758], [0.7148], [0.5352], [0.2354], [0.9688], [0.5586], [0.4082], [0.3926], [0.4121], [0.2021], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.3750], [0.4668], [0.7500], [0.7500], [0.6680], [0.2500], [1.0000], [0.7500], [0.6016], [0.3340], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016632080078125 loss: 0.00162506103515625 loss: 0.0024566650390625 loss: 0.0004425048828125 predicted value: tensor([[0.5391], [0.4590], [0.5859], [0.8477], [0.6797], [0.5586], [0.7422], [0.9922], [0.6406], [0.4453], [0.8086], [0.4023], [0.2656], [0.4277], [0.4082], [0.1582]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.5547], [0.8320], [0.7500], [0.4668], [0.8008], [1.0000], [0.6680], [0.4668], [0.8008], [0.4004], [0.3340], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012359619140625 loss: 0.000621795654296875 loss: 0.0018157958984375 loss: 0.000301361083984375 predicted value: tensor([[0.4980], [0.4512], [0.4160], [0.4609], [0.7344], [0.7969], [0.4238], [0.9922], [0.3828], [0.2578], [0.5430], [0.4414], [0.2227], [0.3613], [0.3730], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.4668], [0.8008], [0.8320], [0.4668], [1.0000], [0.3750], [0.2500], [0.6016], [0.5000], [0.2500], [0.4004], [0.2852], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00165557861328125 loss: 0.0004787445068359375loss: 0.002532958984375 loss: 0.001251220703125 predicted value: tensor([[0.4355], [0.4824], [1.0234], [0.1855], [0.7031], [1.0156], [0.4863], [0.6367], [0.1709], [0.6172], [0.5859], [0.2490], [0.4824], [0.3438], [0.3574], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.2002], [0.6680], [1.0000], [0.8008], [0.6680], [0.2002], [0.6016], [0.6016], [0.2002], [0.7500], [0.5000], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003204345703125 loss: 0.00081634521484375 loss: 0.000362396240234375 loss: 0.00112152099609375 84%|████████▍ | 413/492 [3:46:05<43:38, 33.15s/it] {'loss': 0.0054, 'learning_rate': 1e-05, 'epoch': 0.84} 84%|████████▍ | 413/492 [3:46:05<43:38, 33.15s/it]predicted value: tensor([[1.0312], [1.0234], [0.2578], [0.4707], [1.0000], [0.6875], [0.5859], [0.5195], [0.9961], [0.4180], [0.2383], [0.6250], [0.3359], [0.3926], [0.2021], [0.2432]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.2500], [0.4668], [1.0000], [0.6016], [0.5547], [0.4668], [1.0000], [0.4004], [0.2002], [0.7500], [0.3340], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009765625 loss: 0.00046539306640625 loss: 0.000518798828125 loss: 0.0002384185791015625 predicted value: tensor([[0.4844], [1.0234], [0.7930], [0.4785], [0.6523], [0.2871], [0.2559], [0.4824], [0.4492], [0.3066], [1.0312], [0.3672], [0.5000], [0.2168], [0.2441], [0.2451]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.8008], [0.3750], [0.6172], [0.3340], [0.2500], [0.4668], [0.4668], [0.3340], [1.0000], [0.3340], [0.5000], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000499725341796875 loss: 0.0003681182861328125loss: 0.0007171630859375 loss: 0.00128173828125 predicted value: tensor([[0.7578], [1.0000], [0.5039], [0.6211], [0.2148], [0.4414], [0.4434], [0.7773], [0.2461], [0.4258], [0.1934], [0.6133], [0.5938], [0.4531], [0.4414], [0.2285]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.4648], [0.2500], [0.3750], [0.3750], [0.8008], [0.2500], [0.3340], [0.2002], [0.6016], [0.6016], [0.4004], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000469207763671875 loss: 0.00179290771484375loss: 0.0014801025390625 loss: 0.00080108642578125 predicted value: tensor([[0.2490], [0.5742], [0.2090], [0.5078], [0.6914], [0.3555], [0.7891], [0.4980], [0.2227], [0.5273], [0.9805], [0.4219], [1.0312], [0.2432], [0.2002], [0.2236]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.5547], [0.2500], [0.4668], [0.6680], [0.2002], [0.8008], [0.5000], [0.1670], [0.5000], [1.0000], [0.5000], [1.0000], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020294189453125 loss: 0.00113677978515625 loss: 0.0011138916015625 loss: 0.000659942626953125 84%|████████▍ | 414/492 [3:46:37<42:55, 33.01s/it] {'loss': 0.0036, 'learning_rate': 1e-05, 'epoch': 0.84} 84%|████████▍ | 414/492 [3:46:37<42:55, 33.01s/it]predicted value: tensor([[0.5664], [0.1709], [0.3086], [0.7031], [0.2080], [0.1562], [0.6484], [0.5352], [0.5039], [0.9414], [0.6133], [0.4375], [0.3965], [0.3594], [0.2227], [0.2061]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.3340], [0.8008], [0.3340], [0.2500], [0.7500], [0.6016], [0.7500], [1.0000], [0.6016], [0.6016], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0008544921875 loss: 0.0026397705078125loss: 0.00115966796875 loss: 0.000774383544921875 predicted value: tensor([[0.6016], [0.9531], [0.4434], [0.4141], [0.4727], [0.8164], [0.7578], [0.7500], [0.1699], [0.2793], [0.4902], [0.9688], [0.9453], [0.3262], [0.2109], [0.2373]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3145], [0.6016], [0.4668], [0.8320], [0.8008], [0.6680], [0.2500], [0.3340], [0.6016], [1.0000], [1.0000], [0.7500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00146484375 loss: 0.004241943359375loss: 0.000522613525390625 loss: 0.001495361328125 predicted value: tensor([[0.5664], [0.7891], [0.3906], [0.4238], [0.2373], [0.8086], [0.3984], [0.2695], [0.5820], [0.4629], [0.9648], [0.3652], [0.3770], [0.4492], [0.4570], [0.2324]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.3750], [0.4668], [0.3340], [0.8320], [0.7500], [0.3340], [0.6016], [0.5000], [1.0000], [0.3340], [0.2852], [0.5000], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002349853515625 loss: 0.0016632080078125 loss: 0.0025177001953125 loss: 0.004058837890625 predicted value: tensor([[0.7852], [0.7969], [0.3984], [0.9648], [0.3887], [0.7109], [0.4766], [0.6445], [0.6406], [0.6680], [0.8086], [0.3711], [0.5234], [0.4824], [0.2178], [0.2461]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8008], [0.4668], [1.0000], [0.4668], [0.7500], [0.5000], [0.6016], [0.7500], [0.6016], [0.8008], [0.4004], [0.5000], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0028533935546875 loss: 0.00118255615234375 loss: 0.0010833740234375 loss: 0.000606536865234375 84%|████████▍ | 415/492 [3:47:10<42:11, 32.88s/it] {'loss': 0.0074, 'learning_rate': 1e-05, 'epoch': 0.84} 84%|████████▍ | 415/492 [3:47:10<42:11, 32.88s/it]predicted value: tensor([[0.5352], [0.4922], [0.5859], [0.8281], [0.8164], [0.2812], [0.4805], [0.9844], [0.6250], [0.9844], [0.5156], [0.2969], [0.3066], [0.4355], [0.4180], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4648], [0.8320], [0.8008], [0.2500], [0.4668], [1.0000], [0.5000], [1.0000], [0.4668], [0.2500], [0.4004], [0.4004], [0.0400], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00054931640625 loss: 0.00299072265625loss: 0.000858306884765625 loss: 0.000751495361328125 predicted value: tensor([[1.0000], [0.7812], [0.8008], [0.9922], [0.4688], [0.2930], [0.5000], [0.2109], [0.6328], [0.9688], [0.4023], [0.4395], [0.2422], [0.5977], [0.2715], [0.3125]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.7148], [1.0000], [0.3750], [0.3340], [0.3750], [0.2500], [0.6016], [1.0000], [0.4004], [0.4004], [0.2500], [0.6016], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000812530517578125loss: 0.00147247314453125 loss: 0.00107574462890625 loss: 0.00090789794921875 predicted value: tensor([[0.9805], [0.2256], [1.0000], [0.4160], [0.8359], [0.9531], [0.4941], [1.0000], [0.9766], [0.5117], [0.9688], [0.5781], [0.0850], [0.4219], [0.2695], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [1.0000], [0.4668], [0.8008], [1.0000], [0.4668], [1.0000], [1.0000], [0.4668], [1.0000], [0.6016], [0.0625], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000858306884765625loss: 0.0038299560546875 loss: 0.0003108978271484375 loss: 0.00066375732421875 predicted value: tensor([[0.3066], [0.9648], [0.5664], [0.6172], [0.5156], [0.9727], [0.3125], [0.5156], [0.2676], [0.6406], [0.5156], [0.5508], [0.5234], [0.1494], [0.2471], [0.2852]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [1.0000], [0.4648], [0.6016], [0.4648], [1.0000], [0.2002], [0.4277], [0.2002], [0.6016], [0.5000], [0.6016], [0.5000], [0.0625], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000545501708984375 loss: 0.00142669677734375 loss: 0.001129150390625 loss: 0.0010833740234375 85%|████████▍ | 416/492 [3:47:44<41:55, 33.09s/it] {'loss': 0.0048, 'learning_rate': 1e-05, 'epoch': 0.85} 85%|████████▍ | 416/492 [3:47:44<41:55, 33.09s/it]predicted value: tensor([[0.5312], [0.9492], [0.9766], [0.9531], [0.4590], [0.5547], [0.4395], [0.3613], [0.5430], [0.4199], [0.4922], [0.3965], [0.3613], [0.2314], [0.1963], [0.2236]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [1.0000], [1.0000], [1.0000], [0.4668], [0.5547], [0.6016], [0.4668], [0.6016], [0.5000], [0.5000], [0.4004], [0.4004], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00128936767578125 loss: 0.000885009765625 loss: 0.000972747802734375 loss: 0.000812530517578125 predicted value: tensor([[0.3984], [0.9766], [0.9883], [0.9492], [0.9609], [0.7305], [0.4297], [0.3750], [0.7891], [0.5195], [0.5547], [0.3398], [0.6250], [0.1875], [0.4043], [0.2119]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [1.0000], [1.0000], [0.6680], [0.4004], [0.3750], [0.8008], [0.6016], [0.5000], [0.2002], [0.6016], [0.2002], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000457763671875 loss: 0.00063323974609375 loss: 0.00077056884765625 loss: 0.0003948211669921875 predicted value: tensor([[0.1953], [0.3730], [0.9609], [0.9648], [0.9805], [0.9414], [0.6758], [0.8398], [0.9570], [0.7773], [0.2217], [0.3555], [0.3965], [0.4160], [0.1924], [0.2080]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.4668], [1.0000], [1.0000], [1.0000], [1.0000], [0.6680], [0.8320], [1.0000], [0.8008], [0.3340], [0.3340], [0.5000], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000762939453125 loss: 0.000728607177734375loss: 0.000843048095703125 loss: 0.0047607421875 predicted value: tensor([[0.7305], [0.9727], [0.7109], [0.3926], [0.9766], [0.4434], [0.9609], [0.4531], [0.7070], [0.5742], [0.5469], [0.4883], [0.1895], [0.3730], [0.4043], [0.2422]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.8008], [0.4668], [1.0000], [0.4668], [1.0000], [0.4668], [0.6680], [0.6016], [0.5000], [0.7500], [0.2002], [0.4004], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00077056884765625 loss: 0.00167083740234375 loss: 0.00112152099609375 loss: 0.00164031982421875 85%|████████▍ | 417/492 [3:48:18<41:41, 33.36s/it] {'loss': 0.0046, 'learning_rate': 1e-05, 'epoch': 0.85} 85%|████████▍ | 417/492 [3:48:18<41:41, 33.36s/it]predicted value: tensor([[0.3789], [1.0000], [0.4395], [0.4570], [0.7734], [0.4297], [0.2676], [1.0234], [0.5195], [0.0640], [0.5898], [0.3145], [0.4336], [0.4219], [0.2236], [0.2539]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [1.0000], [0.3750], [0.4668], [0.8008], [0.3750], [0.3340], [1.0000], [0.6016], [0.0278], [0.5000], [0.2500], [0.5000], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0005645751953125 loss: 0.00064849853515625 loss: 0.00067138671875 loss: 0.000972747802734375 predicted value: tensor([[0.6875], [1.0156], [0.4277], [0.9922], [0.7148], [1.0234], [0.6406], [0.2832], [0.4531], [1.0156], [0.6602], [0.4434], [0.3418], [0.2344], [0.2461], [0.2324]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [1.0000], [0.4668], [1.0000], [0.7500], [1.0000], [0.6016], [0.2500], [0.3750], [1.0000], [0.7500], [0.2500], [0.3340], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000537872314453125 loss: 0.00099945068359375loss: 0.001373291015625 loss: 0.00040435791015625 predicted value: tensor([[0.7578], [1.0234], [1.0156], [0.7930], [0.2559], [1.0156], [0.2402], [0.7188], [0.7227], [0.1816], [0.4219], [1.0078], [0.2227], [0.2578], [0.2188], [0.0596]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [1.0000], [0.8008], [0.2500], [1.0000], [0.1426], [0.7500], [0.7500], [0.2500], [0.4004], [1.0000], [0.1670], [0.2500], [0.1670], [0.0400]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00060272216796875 loss: 0.00225830078125 loss: 0.000400543212890625 loss: 0.00098419189453125 predicted value: tensor([[0.4609], [0.6836], [0.6914], [0.2480], [1.0312], [0.4629], [0.5938], [0.4082], [0.9961], [0.3574], [0.2275], [0.4121], [0.2256], [0.2451], [0.2148], [0.2129]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8008], [0.2500], [1.0000], [0.3750], [0.5000], [0.3750], [1.0000], [0.3340], [0.2500], [0.4004], [0.2500], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000614166259765625 loss: 0.0009918212890625 loss: 0.00077056884765625 loss: 0.00070953369140625 85%|████████▍ | 418/492 [3:48:51<41:00, 33.25s/it] {'loss': 0.0034, 'learning_rate': 1e-05, 'epoch': 0.85} 85%|████████▍ | 418/492 [3:48:51<41:00, 33.25s/it]predicted value: tensor([[0.8281], [0.7539], [0.9805], [0.6758], [0.2246], [0.5898], [0.9805], [0.5391], [0.9805], [0.6328], [0.6445], [0.4434], [0.4316], [0.1099], [0.1387], [0.1758]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.8008], [1.0000], [0.6680], [0.2500], [0.6016], [1.0000], [0.5000], [1.0000], [0.6680], [0.7500], [0.5000], [0.5000], [0.4004], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00102996826171875 loss: 0.00177001953125loss: 0.001434326171875 loss: 0.00121307373046875 predicted value: tensor([[0.4258], [0.4004], [0.5039], [0.7500], [0.6211], [0.9844], [0.5898], [0.4082], [0.8242], [0.6055], [0.5547], [0.6562], [0.4004], [0.4668], [0.4082], [0.1504]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [0.8008], [0.6016], [1.0000], [0.5547], [0.3750], [0.8320], [0.6016], [0.5000], [0.4668], [0.4004], [0.5000], [0.5000], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002166748046875 loss: 0.00098419189453125 loss: 0.000659942626953125 loss: 0.000675201416015625 predicted value: tensor([[0.4785], [0.5117], [0.6602], [0.3184], [0.4961], [0.2871], [0.9570], [0.9844], [0.9844], [0.2334], [0.2734], [0.6641], [0.2031], [0.3945], [0.1855], [0.1660]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.6172], [0.3145], [0.5547], [0.3340], [1.0000], [1.0000], [1.0000], [0.2500], [0.3340], [0.3340], [0.0278], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00067901611328125 loss: 0.00168609619140625 loss: 0.002716064453125 loss: 0.00125885009765625 predicted value: tensor([[0.5703], [0.3926], [0.9922], [0.6602], [0.4805], [0.1738], [0.7227], [0.5703], [0.9766], [0.2197], [0.2832], [0.6484], [0.3887], [0.4277], [0.1914], [0.1826]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.3750], [1.0000], [0.6680], [0.3750], [0.2500], [0.7500], [0.7500], [1.0000], [0.3340], [0.3340], [0.7500], [0.3340], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016937255859375 loss: 0.00144195556640625 loss: 0.0011138916015625 loss: 0.0012969970703125 85%|████████▌ | 419/492 [3:49:24<40:41, 33.45s/it] {'loss': 0.0055, 'learning_rate': 1e-05, 'epoch': 0.85} 85%|████████▌ | 419/492 [3:49:24<40:41, 33.45s/it]predicted value: tensor([[0.5547], [1.0156], [0.5977], [0.4004], [1.0312], [0.5742], [0.8008], [0.2988], [1.0234], [1.0312], [0.5820], [0.5039], [0.0530], [0.2256], [0.1914], [0.1914]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.7500], [0.3750], [1.0000], [0.5547], [0.8008], [0.2002], [1.0000], [1.0000], [0.2500], [0.5000], [0.0400], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0008697509765625 loss: 0.000812530517578125 loss: 0.0023193359375 loss: 0.00112152099609375 predicted value: tensor([[0.4316], [0.7969], [0.4473], [0.3027], [0.2852], [1.0234], [0.4883], [0.5586], [0.5664], [1.0391], [0.5000], [0.3867], [0.3789], [0.4004], [0.2061], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.4668], [0.2500], [0.2500], [1.0000], [0.4668], [0.5000], [0.3340], [1.0000], [0.5000], [0.3340], [0.3340], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00311279296875 loss: 0.0016632080078125 loss: 0.0011749267578125 loss: 0.0048828125 predicted value: tensor([[0.4004], [0.4648], [0.8008], [0.6328], [0.8320], [0.1187], [0.4238], [0.2305], [0.9766], [0.5586], [0.7930], [0.4023], [0.4414], [0.2412], [0.3672], [0.2070]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.8008], [0.4277], [0.8008], [0.0625], [0.3750], [0.2002], [1.0000], [0.5000], [0.8008], [0.2852], [0.5000], [0.2002], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000713348388671875 loss: 0.000507354736328125 loss: 0.00118255615234375 loss: 0.0023345947265625 predicted value: tensor([[0.4805], [0.5508], [0.3223], [1.0391], [0.7812], [0.7188], [1.0234], [0.2891], [0.4590], [1.0156], [0.3730], [1.0234], [0.7148], [0.2158], [0.1924], [0.2070]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.2500], [1.0000], [0.6680], [0.8008], [1.0000], [0.3340], [0.4668], [1.0000], [0.2852], [1.0000], [0.6016], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000518798828125 loss: 0.00213623046875 loss: 0.000858306884765625 loss: 0.000518798828125 85%|████████▌ | 420/492 [3:49:57<39:40, 33.06s/it] {'loss': 0.0062, 'learning_rate': 1e-05, 'epoch': 0.85} 85%|████████▌ | 420/492 [3:49:57<39:40, 33.06s/it]predicted value: tensor([[0.7578], [0.4238], [0.6602], [1.0078], [0.3379], [0.7383], [0.9844], [1.0000], [0.7383], [0.6875], [0.3887], [0.3477], [0.3672], [0.2949], [0.1348], [0.1494]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.6680], [1.0000], [0.3145], [0.8008], [1.0000], [1.0000], [0.8008], [0.7500], [0.5000], [0.4004], [0.4004], [0.0400], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00109100341796875 loss: 0.0015716552734375loss: 0.000507354736328125 loss: 0.000743865966796875 predicted value: tensor([[0.4199], [0.7695], [0.5273], [0.4902], [0.7070], [0.4766], [0.9961], [0.0623], [0.7617], [0.6328], [0.4805], [0.4102], [0.3340], [0.3633], [0.1562], [0.1357]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.2500], [0.5547], [0.8008], [0.4668], [1.0000], [0.0278], [0.8320], [0.7500], [0.4668], [0.4004], [0.3340], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00140380859375 loss: 0.000942230224609375 loss: 0.00188446044921875 loss: 0.000759124755859375 predicted value: tensor([[0.9805], [0.7617], [0.7656], [0.4160], [0.4023], [0.6328], [0.9961], [0.4492], [0.6836], [0.5430], [0.2490], [0.6367], [0.3359], [0.1533], [0.1865], [0.1436]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.8008], [0.4668], [0.3750], [0.7500], [1.0000], [0.4668], [0.4668], [0.5000], [0.2500], [0.6680], [0.5000], [0.2002], [0.2500], [0.1250]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0007476806640625 loss: 0.00162506103515625loss: 0.000751495361328125 loss: 0.000957489013671875 predicted value: tensor([[0.7695], [0.8242], [0.4473], [0.3789], [0.5195], [0.4082], [0.3770], [0.2139], [0.2871], [0.5859], [0.3926], [0.0588], [0.3242], [0.1768], [0.1523], [0.3594]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8555], [0.4668], [0.3750], [0.8008], [0.3750], [0.3750], [0.2002], [0.2500], [0.5000], [0.5000], [0.0625], [0.3340], [0.2500], [0.1670], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000843048095703125 loss: 0.001708984375 loss: 0.000553131103515625 loss: 0.00177001953125 86%|████████▌ | 421/492 [3:50:28<38:39, 32.67s/it] {'loss': 0.0045, 'learning_rate': 1e-05, 'epoch': 0.86} 86%|████████▌ | 421/492 [3:50:28<38:39, 32.67s/it]predicted value: tensor([[0.6758], [0.7773], [0.6758], [0.7656], [0.4473], [0.4629], [0.2871], [0.3242], [0.4434], [0.5273], [0.4922], [0.5742], [0.4199], [0.3906], [0.2207], [0.2139]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7344], [0.8320], [0.6680], [0.6680], [0.3750], [0.4668], [0.2002], [0.3145], [0.4668], [0.6016], [0.6016], [0.6016], [0.5000], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000751495361328125 loss: 0.00164031982421875 loss: 0.000934600830078125 loss: 0.001434326171875 predicted value: tensor([[0.5547], [0.5742], [0.3027], [0.4590], [1.0469], [0.4805], [1.0391], [0.6836], [0.7070], [0.6875], [0.5938], [1.0234], [0.4160], [0.3652], [0.2012], [0.2246]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.6172], [0.2500], [0.4668], [1.0000], [0.4668], [1.0000], [0.7500], [0.6680], [0.6016], [0.6016], [1.0000], [0.4004], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021514892578125 loss: 0.000514984130859375 loss: 0.0004825592041015625 loss: 0.000858306884765625 predicted value: tensor([[0.5000], [0.5195], [0.4434], [0.8047], [0.7852], [0.7891], [0.5508], [0.5781], [0.3613], [0.7461], [0.6172], [0.4355], [0.4043], [0.2139], [0.4082], [0.1953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.7148], [0.4668], [0.8320], [0.8008], [0.8320], [0.4004], [0.6016], [0.3340], [0.8008], [0.6680], [0.4004], [0.3340], [0.2002], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00151824951171875 loss: 0.0005950927734375 loss: 0.001434326171875 loss: 0.00140380859375 predicted value: tensor([[1.0078], [0.3418], [0.4414], [1.0078], [1.0078], [0.4629], [0.6680], [0.7578], [0.4434], [0.5391], [0.4238], [0.4570], [0.4141], [0.3965], [0.1611], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [0.4668], [1.0000], [1.0000], [0.4668], [0.5703], [0.8320], [0.4004], [0.5000], [0.4004], [0.5000], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00057220458984375 loss: 0.00081634521484375 loss: 0.0012664794921875 loss: 0.00038909912109375 86%|████████▌ | 422/492 [3:51:02<38:19, 32.85s/it] {'loss': 0.0042, 'learning_rate': 1e-05, 'epoch': 0.86} 86%|████████▌ | 422/492 [3:51:02<38:19, 32.85s/it]predicted value: tensor([[0.5000], [0.9805], [0.2520], [0.6719], [0.4531], [0.9766], [0.2871], [0.3828], [0.2490], [0.2520], [0.4688], [0.9570], [0.3691], [0.4062], [0.1650], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.2500], [0.7500], [0.3145], [1.0000], [0.2500], [0.3750], [0.3340], [0.3340], [0.6016], [1.0000], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00151824951171875 loss: 0.003265380859375 loss: 0.00118255615234375 loss: 0.00154876708984375 predicted value: tensor([[0.9609], [0.7734], [0.7422], [0.2871], [0.9922], [0.6484], [0.5156], [0.7422], [0.7266], [0.6133], [0.4258], [0.4355], [0.3867], [0.4355], [0.1943], [0.1631]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.8008], [0.3340], [1.0000], [0.7500], [0.6016], [0.7500], [0.8008], [0.6680], [0.5000], [0.5000], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0006561279296875 loss: 0.000926971435546875 loss: 0.00119781494140625 loss: 0.00146484375 predicted value: tensor([[0.6680], [0.4531], [0.3672], [0.5312], [0.9492], [0.5430], [0.9805], [0.9648], [0.3867], [0.6172], [0.4219], [0.3828], [0.3984], [0.1719], [0.1982], [0.2168]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.4668], [0.5547], [1.0000], [0.6016], [1.0000], [1.0000], [0.4004], [0.6016], [0.5000], [0.3340], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00093841552734375 loss: 0.000576019287109375 loss: 0.001129150390625loss: 0.000904083251953125 predicted value: tensor([[0.2334], [0.9766], [0.2734], [0.7539], [0.9844], [0.5859], [0.9922], [0.2715], [0.2598], [0.5156], [0.4434], [0.2070], [0.4121], [0.1562], [0.1904], [0.1797]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [1.0000], [0.3340], [0.8320], [1.0000], [0.7500], [1.0000], [0.2500], [0.3340], [0.6016], [0.4004], [0.2500], [0.4004], [0.1670], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000518798828125 loss: 0.0008697509765625 loss: 0.00030517578125 loss: 0.0009307861328125 86%|████████▌ | 423/492 [3:51:35<37:57, 33.01s/it] {'loss': 0.0045, 'learning_rate': 1e-05, 'epoch': 0.86} 86%|████████▌ | 423/492 [3:51:35<37:57, 33.01s/it]predicted value: tensor([[0.6367], [0.3223], [0.9961], [0.7617], [0.4922], [0.2500], [0.6836], [0.3105], [0.3770], [0.4336], [0.9844], [0.4199], [0.4141], [0.4355], [0.2246], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3340], [1.0000], [0.8008], [0.4668], [0.2002], [0.8008], [0.2500], [0.4668], [0.7500], [1.0000], [0.3750], [0.2500], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000579833984375 loss: 0.003143310546875 loss: 0.00170135498046875 loss: 0.000896453857421875 predicted value: tensor([[0.3789], [0.9844], [1.0000], [0.4609], [0.9883], [0.2637], [0.5664], [0.7812], [0.5430], [1.0078], [0.7578], [0.5391], [0.4980], [0.5195], [0.2344], [0.2266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3145], [1.0000], [1.0000], [0.4668], [1.0000], [0.2500], [0.5000], [0.8008], [0.6016], [1.0000], [0.8555], [0.6016], [0.5000], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001495361328125 loss: 0.00043487548828125loss: 0.00069427490234375 loss: 0.0004711151123046875 predicted value: tensor([[0.4531], [0.4160], [0.5195], [0.5547], [0.7461], [0.6523], [0.9805], [0.4590], [0.6094], [0.7930], [0.2480], [0.6211], [0.4570], [0.9961], [0.3672], [0.2129]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [0.6016], [0.6680], [0.7500], [1.0000], [0.4668], [0.6016], [0.8008], [0.2002], [0.6680], [0.4004], [1.0000], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0008392333984375 loss: 0.00048828125 loss: 0.00179290771484375 loss: 0.0016326904296875 predicted value: tensor([[0.4238], [0.8320], [0.4551], [0.4277], [0.4551], [1.0078], [0.3047], [0.7500], [0.4590], [1.0156], [0.5234], [0.2676], [0.4043], [0.0322], [0.4824], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.4668], [0.4668], [0.4668], [1.0000], [0.3340], [0.7500], [0.6016], [1.0000], [0.5000], [0.0400], [0.5000], [0.0400], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000766754150390625 loss: 0.00075531005859375 loss: 0.0022125244140625loss: 0.00087738037109375 86%|████████▌ | 424/492 [3:52:09<37:37, 33.19s/it] {'loss': 0.0047, 'learning_rate': 1e-05, 'epoch': 0.86} 86%|████████▌ | 424/492 [3:52:09<37:37, 33.19s/it]predicted value: tensor([[0.5703], [0.2871], [0.4531], [0.6133], [0.4355], [1.0547], [0.5977], [0.4609], [1.0156], [0.6172], [0.3750], [0.4473], [0.3770], [0.5078], [0.2676], [0.2207]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.3750], [0.8008], [0.3340], [1.0000], [0.6016], [0.4668], [1.0000], [0.6016], [0.3340], [0.4004], [0.3340], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000637054443359375 loss: 0.0011138916015625 loss: 0.00106048583984375loss: 0.001220703125 predicted value: tensor([[0.7500], [0.5664], [0.7734], [0.4590], [0.4805], [0.3359], [0.7500], [0.4766], [0.6602], [0.4121], [0.4902], [0.3867], [0.5234], [0.2480], [0.2578], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.5547], [0.6680], [0.3750], [0.4668], [0.2500], [0.7500], [0.4668], [0.6016], [0.4004], [0.4004], [0.3340], [0.5000], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018157958984375 loss: 0.001190185546875 loss: 0.000762939453125 loss: 0.000640869140625 predicted value: tensor([[0.4375], [0.7422], [0.5000], [0.7578], [1.0312], [0.3242], [0.7500], [0.3008], [0.6211], [0.6875], [0.5586], [0.4180], [0.5000], [0.2383], [0.2383], [0.2285]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.4668], [0.8008], [1.0000], [0.2500], [0.6680], [0.2500], [0.6016], [0.6016], [0.8320], [0.4004], [0.4004], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000606536865234375 loss: 0.0019073486328125 loss: 0.0010986328125 loss: 0.000701904296875 predicted value: tensor([[0.4727], [0.4512], [0.5195], [0.5352], [0.4844], [0.4980], [1.0312], [0.3574], [0.3945], [0.2812], [0.6406], [0.3457], [0.4082], [0.2285], [0.2197], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [0.3750], [0.4668], [0.4668], [0.4668], [1.0000], [0.2500], [0.3750], [0.2500], [0.5000], [0.3340], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00139617919921875loss: 0.0009002685546875 loss: 0.00133514404296875 loss: 0.000762939453125 86%|████████▋ | 425/492 [3:52:42<37:02, 33.18s/it] {'loss': 0.0043, 'learning_rate': 1e-05, 'epoch': 0.86} 86%|████████▋ | 425/492 [3:52:42<37:02, 33.18s/it]predicted value: tensor([[0.5508], [0.8555], [0.4219], [0.4883], [0.8086], [0.6172], [0.4102], [0.2031], [0.5898], [0.3398], [0.6523], [0.5117], [0.4746], [0.6484], [0.3496], [0.1924]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.3750], [0.4668], [0.8008], [0.5547], [0.4668], [0.4004], [0.5000], [0.7500], [0.7500], [0.6016], [0.4004], [0.7500], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00144195556640625 loss: 0.00445556640625loss: 0.000568389892578125 loss: 0.0018768310546875 predicted value: tensor([[0.9961], [1.0078], [0.4434], [0.4785], [0.3789], [0.5312], [0.2637], [0.3867], [0.7266], [0.7188], [0.4375], [0.5938], [0.4141], [0.4141], [0.4395], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.3750], [0.4668], [0.4668], [0.8008], [0.2500], [0.3750], [0.7500], [0.6680], [0.5000], [0.6016], [0.4004], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000823974609375 loss: 0.000698089599609375 loss: 0.0021820068359375 loss: 0.0021209716796875 predicted value: tensor([[0.4453], [0.4277], [1.0156], [0.8359], [0.5312], [0.6953], [0.4082], [0.6328], [0.3359], [0.4023], [0.5781], [0.3711], [0.2637], [0.2051], [0.1982], [0.1709]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [1.0000], [0.8320], [0.6680], [0.8008], [0.5000], [0.6016], [0.3340], [0.6016], [0.5000], [0.4004], [0.2500], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025787353515625 loss: 0.00145721435546875 loss: 0.000408172607421875 loss: 0.0026397705078125 predicted value: tensor([[0.8320], [0.4473], [1.0156], [0.4590], [0.9961], [0.4121], [0.7500], [0.6133], [0.2637], [0.5664], [0.4023], [0.3926], [0.3789], [0.2266], [0.1855], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [1.0000], [0.4668], [1.0000], [0.3145], [0.6680], [0.6016], [0.2500], [0.6016], [0.2500], [0.4004], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000835418701171875 loss: 0.0009002685546875 loss: 0.00067138671875 loss: 0.000598907470703125 87%|████████▋ | 426/492 [3:53:15<36:33, 33.23s/it] {'loss': 0.0061, 'learning_rate': 1e-05, 'epoch': 0.87} 87%|████████▋ | 426/492 [3:53:15<36:33, 33.23s/it]predicted value: tensor([[0.6094], [1.0469], [0.4941], [1.0469], [1.0391], [0.2891], [0.3535], [0.8164], [0.8398], [1.0391], [0.2832], [0.4492], [0.0452], [0.2061], [0.2168], [0.1118]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [1.0000], [1.0000], [0.2500], [0.2500], [0.6680], [0.8008], [1.0000], [0.3340], [0.5000], [0.0278], [0.2002], [0.2500], [0.0625]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00087738037109375 loss: 0.000873565673828125loss: 0.00286865234375 loss: 0.00072479248046875 predicted value: tensor([[0.5352], [0.5039], [1.0391], [0.5000], [0.3457], [1.0391], [0.4590], [0.7852], [0.3223], [0.7188], [0.2207], [0.4863], [0.4785], [0.1436], [0.2090], [0.2070]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [1.0000], [0.4668], [0.2500], [1.0000], [0.3750], [0.6680], [0.2500], [0.7500], [0.2002], [0.4004], [0.4004], [0.0204], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012054443359375loss: 0.00145721435546875 loss: 0.00144195556640625 loss: 0.0020599365234375 predicted value: tensor([[0.5273], [1.0312], [0.4727], [0.3359], [0.8438], [0.5938], [0.8125], [0.4980], [0.4766], [0.5508], [0.2793], [0.6797], [0.4961], [0.4043], [0.3359], [0.2393]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.3750], [0.3340], [0.8008], [0.5547], [0.8008], [0.4668], [0.4668], [0.6016], [0.2500], [0.6016], [0.5000], [0.4004], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000507354736328125 loss: 0.0015411376953125 loss: 0.00106048583984375 loss: 0.000926971435546875 predicted value: tensor([[0.8750], [0.5078], [0.4746], [0.8477], [0.4629], [0.8320], [0.4531], [0.5430], [0.3066], [0.6172], [0.6758], [0.4473], [0.4316], [0.1934], [0.1992], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.4668], [0.8008], [0.4668], [0.8008], [0.3750], [0.4648], [0.2500], [0.6016], [0.8008], [0.5000], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000823974609375 loss: 0.00115966796875 loss: 0.00087738037109375 loss: 0.000732421875 87%|████████▋ | 427/492 [3:53:48<36:01, 33.26s/it] {'loss': 0.0048, 'learning_rate': 1e-05, 'epoch': 0.87} 87%|████████▋ | 427/492 [3:53:48<36:01, 33.26s/it]predicted value: tensor([[0.5547], [0.5586], [0.3164], [0.4434], [0.2812], [0.3105], [0.4355], [0.7461], [0.5586], [0.2793], [0.3867], [0.3320], [0.3281], [0.1621], [0.1943], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.2500], [0.4668], [0.3340], [0.3340], [0.4668], [0.7148], [0.5000], [0.2002], [0.3340], [0.4004], [0.2500], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00103759765625 loss: 0.004486083984375 loss: 0.00052642822265625 loss: 0.00133514404296875 predicted value: tensor([[1.0156], [0.5859], [0.6602], [0.3691], [0.6484], [1.0234], [0.2754], [0.4414], [0.5352], [0.2559], [0.3652], [0.5508], [0.3789], [0.3945], [0.1865], [0.1631]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.7500], [0.6016], [0.6680], [1.0000], [0.2500], [0.4668], [0.6016], [0.3340], [0.4004], [0.5000], [0.4004], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0005035400390625 loss: 0.00154876708984375loss: 0.0015716552734375 loss: 0.00070953369140625 predicted value: tensor([[1.0078], [0.4297], [0.5078], [1.0234], [0.5078], [0.6992], [1.0000], [0.4336], [0.4570], [0.6094], [0.4102], [0.3848], [0.3887], [0.1885], [0.1943], [0.1572]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.4668], [1.0000], [0.4668], [0.6680], [1.0000], [0.4668], [0.4668], [0.7500], [0.5000], [0.4004], [0.4004], [0.1670], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0003871917724609375 loss: 0.000640869140625loss: 0.000835418701171875 loss: 0.000904083251953125 predicted value: tensor([[0.5039], [0.8047], [0.9062], [0.2773], [1.0156], [0.4199], [1.0078], [0.5312], [0.3047], [0.9492], [0.9922], [0.6250], [0.4180], [0.4551], [0.2021], [0.1855]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.8320], [0.2002], [1.0000], [0.4668], [1.0000], [0.4668], [0.3340], [1.0000], [1.0000], [0.6016], [0.5000], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00164031982421875 loss: 0.00124359130859375 loss: 0.0005950927734375 loss: 0.0005645751953125 87%|████████▋ | 428/492 [3:54:21<35:20, 33.13s/it] {'loss': 0.0046, 'learning_rate': 1e-05, 'epoch': 0.87} 87%|████████▋ | 428/492 [3:54:21<35:20, 33.13s/it]predicted value: tensor([[0.3613], [0.5352], [1.0234], [0.8203], [1.0156], [0.2812], [0.3457], [0.4199], [0.6992], [0.9961], [0.6797], [0.4551], [0.3184], [0.2354], [0.2188], [0.2314]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.4648], [1.0000], [0.8008], [1.0000], [0.3340], [0.3340], [0.3340], [0.6680], [1.0000], [0.3750], [0.4668], [0.3340], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002288818359375 loss: 0.000843048095703125 loss: 0.001800537109375loss: 0.0014801025390625 predicted value: tensor([[1.0234], [0.7344], [0.5273], [0.8555], [0.8164], [0.5742], [0.4766], [0.5039], [0.7461], [1.0000], [0.6875], [0.3242], [0.2715], [0.2471], [0.4141], [0.1826]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [0.4648], [0.8008], [0.8008], [0.5547], [0.4668], [0.4668], [0.8008], [1.0000], [0.6016], [0.3340], [0.2500], [0.2002], [0.4004], [0.0400]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021209716796875 loss: 0.0013427734375 loss: 0.000743865966796875 loss: 0.00104522705078125 predicted value: tensor([[0.4707], [0.4219], [0.9961], [1.0156], [1.0391], [0.9961], [0.7812], [0.2637], [0.6797], [0.4434], [0.5625], [0.6875], [0.3320], [0.3672], [0.2070], [0.2207]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [1.0000], [1.0000], [1.0000], [1.0000], [0.7500], [0.2002], [0.8008], [0.4004], [0.6016], [0.7500], [0.3340], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013275146484375 loss: 0.0005340576171875 loss: 0.0012054443359375 loss: 0.0011444091796875 predicted value: tensor([[0.6211], [0.7383], [0.8672], [0.5352], [0.3496], [0.6523], [0.4590], [0.5195], [0.4727], [0.4844], [0.6758], [0.4590], [0.5273], [0.4004], [0.2197], [0.2354]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.6680], [0.8320], [0.8008], [0.3340], [0.5000], [0.3750], [0.5000], [0.6016], [0.4668], [0.6016], [0.5000], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002471923828125 loss: 0.0004024505615234375 loss: 0.0031890869140625 loss: 0.000732421875 87%|████████▋ | 429/492 [3:54:53<34:23, 32.75s/it] {'loss': 0.0057, 'learning_rate': 1e-05, 'epoch': 0.87} 87%|████████▋ | 429/492 [3:54:53<34:23, 32.75s/it]predicted value: tensor([[0.5273], [0.5938], [0.2422], [0.3398], [0.9766], [0.4609], [0.9609], [0.9727], [0.4941], [0.5195], [0.2285], [0.4277], [0.5781], [0.7305], [0.2188], [0.2168]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.2500], [0.2500], [1.0000], [0.2002], [1.0000], [1.0000], [0.5000], [0.6680], [0.2500], [0.4004], [0.5000], [0.8008], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00188446044921875 loss: 0.00185394287109375 loss: 0.00250244140625 loss: 0.000637054443359375 predicted value: tensor([[0.9922], [0.7969], [0.5117], [0.4414], [0.4199], [0.9492], [0.9805], [0.6953], [0.7188], [0.6484], [0.5586], [0.3555], [0.4336], [0.4902], [0.2021], [0.2246]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.5547], [0.4668], [0.3750], [1.0000], [1.0000], [0.6680], [0.7500], [0.6680], [0.6016], [0.2852], [0.3340], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0017242431640625 loss: 0.000659942626953125 loss: 0.0004425048828125 loss: 0.000545501708984375 predicted value: tensor([[0.8828], [0.9609], [0.4492], [0.4043], [0.4062], [0.6250], [0.2949], [0.2490], [0.4238], [0.5508], [0.6367], [0.4844], [0.4180], [0.6094], [0.2158], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.4668], [0.3750], [0.3750], [0.5000], [0.2500], [0.3340], [0.4668], [0.6016], [0.6016], [0.5000], [0.4004], [0.6016], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00116729736328125 loss: 0.00112152099609375 loss: 0.000652313232421875 loss: 0.00115203857421875 predicted value: tensor([[0.8164], [0.4277], [0.2637], [0.4062], [0.3984], [0.5508], [0.5391], [0.2617], [0.6328], [0.2598], [0.4941], [0.6250], [0.4199], [0.6875], [0.2256], [0.1816]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.2500], [0.4668], [0.4668], [0.6016], [0.5547], [0.2002], [0.6016], [0.2500], [0.2500], [0.6016], [0.4004], [0.7500], [0.2002], [0.0625]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000545501708984375 loss: 0.001861572265625 loss: 0.000598907470703125 loss: 0.00151824951171875 87%|████████▋ | 430/492 [3:55:25<33:26, 32.36s/it] {'loss': 0.0047, 'learning_rate': 1e-05, 'epoch': 0.87} 87%|████████▋ | 430/492 [3:55:25<33:26, 32.36s/it]predicted value: tensor([[0.9727], [0.2070], [0.6875], [0.9102], [0.3145], [0.4844], [0.9570], [0.7617], [0.4648], [0.6914], [0.5547], [0.1660], [0.1128], [0.1338], [0.1572], [0.1660]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [0.8008], [1.0000], [0.3750], [0.6016], [1.0000], [0.8320], [0.5000], [0.8008], [0.6016], [0.2002], [0.2002], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0035247802734375 loss: 0.00107574462890625 loss: 0.001800537109375 loss: 0.002899169921875 predicted value: tensor([[0.9258], [0.7148], [0.9141], [0.2852], [0.9258], [0.9141], [0.6484], [0.3379], [0.9258], [0.4590], [0.5508], [0.0143], [0.3691], [0.1289], [0.1523], [0.1338]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [1.0000], [0.4668], [1.0000], [1.0000], [0.6680], [0.4668], [1.0000], [0.6016], [0.7500], [0.0625], [0.4004], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00286865234375 loss: 0.00274658203125loss: 0.0020904541015625 loss: 0.00250244140625 predicted value: tensor([[0.9297], [0.6797], [0.2832], [0.5781], [0.8008], [0.9141], [0.1060], [0.7109], [0.6758], [0.5195], [0.3887], [0.9023], [0.3008], [0.3359], [0.1226], [0.2061]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [0.3750], [0.8008], [0.8320], [1.0000], [0.2500], [0.8008], [0.8008], [0.6016], [0.5000], [1.0000], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0017547607421875 loss: 0.0029296875 loss: 0.0030975341796875 loss: 0.00543212890625 predicted value: tensor([[0.9023], [0.9492], [0.7891], [0.3145], [0.9258], [0.4785], [0.2695], [0.4785], [0.4902], [0.9180], [0.4238], [0.3516], [0.3359], [0.3555], [0.1387], [0.1030]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.8320], [0.4668], [1.0000], [0.5547], [0.3750], [0.3340], [0.6016], [1.0000], [0.5000], [0.5000], [0.4004], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001983642578125 loss: 0.0050048828125 loss: 0.0026397705078125 loss: 0.0022735595703125 88%|████████▊ | 431/492 [3:55:58<33:04, 32.54s/it] {'loss': 0.0112, 'learning_rate': 1e-05, 'epoch': 0.88} 88%|████████▊ | 431/492 [3:55:58<33:04, 32.54s/it]predicted value: tensor([[0.5234], [0.7461], [0.1738], [0.4141], [0.5625], [0.1738], [0.6602], [0.5938], [0.3438], [0.1816], [0.2930], [0.2910], [0.3574], [0.3203], [0.6523], [0.1206]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.8320], [0.2500], [0.4648], [0.6680], [0.3340], [0.8008], [0.6016], [0.5000], [0.3340], [0.4004], [0.2500], [0.4004], [0.3340], [0.7500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00433349609375 loss: 0.00250244140625 loss: 0.003173828125 loss: 0.00433349609375 predicted value: tensor([[0.9453], [0.1406], [0.3574], [0.1973], [0.1816], [0.9570], [0.3008], [0.9492], [0.5430], [0.6484], [0.9219], [0.6016], [0.2930], [0.3223], [0.1279], [0.1602]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [0.4668], [0.2500], [0.3340], [1.0000], [0.4668], [1.0000], [0.6016], [0.7500], [1.0000], [0.6016], [0.5000], [0.4004], [0.2002], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00225830078125 loss: 0.00311279296875 loss: 0.004150390625 loss: 0.00225830078125 predicted value: tensor([[0.9414], [0.0962], [0.3613], [0.6328], [0.4375], [0.6953], [0.2910], [0.7266], [0.2754], [0.9180], [0.4570], [0.5703], [0.7148], [0.2100], [0.1406], [0.1309]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.4668], [0.8008], [0.4668], [0.8008], [0.4668], [0.8008], [0.3750], [1.0000], [0.6016], [0.7500], [0.8008], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002899169921875 loss: 0.00274658203125 loss: 0.0033721923828125 loss: 0.002288818359375 predicted value: tensor([[0.4473], [0.3730], [0.9570], [0.3438], [0.7617], [0.9297], [0.7852], [0.9570], [0.6328], [0.4512], [0.9414], [0.3457], [0.3359], [0.3242], [0.1270], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [1.0000], [0.4668], [0.8320], [1.0000], [0.8320], [1.0000], [0.7500], [0.6016], [1.0000], [0.5000], [0.4004], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021514892578125 loss: 0.003570556640625 loss: 0.00238037109375 loss: 0.0029144287109375 88%|████████▊ | 432/492 [3:56:30<32:32, 32.54s/it] {'loss': 0.0121, 'learning_rate': 1e-05, 'epoch': 0.88} 88%|████████▊ | 432/492 [3:56:30<32:32, 32.54s/it]predicted value: tensor([[0.1855], [1.0391], [0.5039], [1.0234], [0.9844], [1.0469], [1.0234], [0.7188], [0.1660], [1.0312], [0.5156], [0.4043], [0.2891], [0.3906], [0.0564], [0.2266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [1.0000], [0.4668], [1.0000], [1.0000], [1.0000], [1.0000], [0.4668], [0.2500], [1.0000], [0.5000], [0.3340], [0.2500], [0.5000], [0.0400], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00128173828125 loss: 0.00152587890625loss: 0.00144195556640625 loss: 0.0008087158203125 predicted value: tensor([[0.8281], [1.0625], [1.0391], [0.5156], [0.7148], [0.5156], [0.6328], [0.4062], [0.4180], [0.6562], [0.3457], [0.6094], [0.4766], [0.2363], [0.2002], [0.2148]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [1.0000], [0.6016], [0.6680], [0.5547], [0.6016], [0.4668], [0.4004], [0.6680], [0.4004], [0.2500], [0.6016], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00274658203125 loss: 0.000949859619140625loss: 0.002685546875 loss: 0.00390625 predicted value: tensor([[0.4180], [0.4746], [0.7148], [0.3281], [0.3887], [0.6602], [0.3945], [0.3262], [0.2871], [0.2441], [0.3398], [0.3496], [0.3730], [0.1875], [0.1865], [0.1621]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.4668], [0.3750], [0.4668], [0.6680], [0.4668], [0.3750], [0.6016], [0.3340], [0.5000], [0.2500], [0.4004], [0.1670], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00174713134765625 loss: 0.00360107421875 loss: 0.0042724609375 loss: 0.0030670166015625 predicted value: tensor([[0.5586], [0.4043], [1.0234], [1.0312], [0.3906], [0.5820], [0.5234], [0.3965], [0.9961], [0.5195], [0.6914], [0.4277], [0.3730], [0.1885], [0.1885], [0.1729]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.3750], [1.0000], [1.0000], [0.4668], [0.6016], [0.6016], [0.3750], [1.0000], [0.6016], [0.8008], [0.5000], [0.4004], [0.1670], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001983642578125 loss: 0.00060272216796875 loss: 0.00107574462890625 loss: 0.00083160400390625 88%|████████▊ | 433/492 [3:57:03<32:03, 32.60s/it] {'loss': 0.0081, 'learning_rate': 1e-05, 'epoch': 0.88} 88%|████████▊ | 433/492 [3:57:03<32:03, 32.60s/it]predicted value: tensor([[0.4688], [0.4590], [0.7773], [0.3301], [1.1172], [0.7422], [1.1562], [0.5352], [0.3574], [0.5156], [0.8086], [1.1172], [0.8750], [0.5664], [0.3262], [0.3125]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.6680], [0.2500], [1.0000], [0.6680], [1.0000], [0.3750], [0.2500], [0.5000], [0.7500], [1.0000], [0.8008], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00213623046875 loss: 0.0048828125 loss: 0.0025482177734375 loss: 0.003997802734375 predicted value: tensor([[0.4316], [0.4883], [0.8672], [0.6562], [0.6719], [0.5000], [0.7695], [0.8477], [0.6953], [0.7773], [0.6211], [0.6016], [0.5156], [0.3066], [0.3164], [0.3223]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [0.4668], [0.5547], [0.3145], [0.7500], [0.8008], [0.6016], [0.7500], [0.5000], [0.2500], [0.4004], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0024871826171875 loss: 0.00469970703125 loss: 0.0047607421875 loss: 0.0035247802734375 predicted value: tensor([[0.9727], [1.1328], [1.1094], [0.8672], [0.5977], [0.4941], [0.3301], [0.6953], [0.5547], [0.3613], [0.8047], [0.7344], [0.7305], [0.4258], [0.5469], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [1.0000], [1.0000], [0.8008], [0.4668], [0.2500], [0.3340], [0.6016], [0.5000], [0.2500], [0.7500], [0.6016], [0.7500], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00555419921875 loss: 0.002838134765625 loss: 0.0042724609375 loss: 0.0027618408203125 predicted value: tensor([[0.9102], [0.6172], [0.9219], [0.5273], [0.5234], [0.7266], [0.8789], [0.6602], [1.1484], [0.5117], [1.1250], [0.9180], [0.6406], [0.5117], [0.3457], [0.3516]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.5547], [0.8320], [0.4668], [0.4668], [0.6680], [0.8008], [0.5000], [1.0000], [0.5000], [1.0000], [0.8008], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027008056640625 loss: 0.0029296875 loss: 0.00457763671875 loss: 0.003265380859375 88%|████████▊ | 434/492 [3:57:35<31:17, 32.36s/it] {'loss': 0.0145, 'learning_rate': 1e-05, 'epoch': 0.88} 88%|████████▊ | 434/492 [3:57:35<31:17, 32.36s/it]predicted value: tensor([[0.9844], [1.1797], [0.5820], [0.5234], [0.5781], [0.3945], [0.5273], [0.6836], [0.7227], [1.1641], [0.6953], [0.7070], [0.5508], [0.5039], [0.3887], [0.3555]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [1.0000], [0.4668], [0.3750], [0.3340], [0.3340], [0.4668], [0.5547], [0.6016], [1.0000], [0.2500], [0.6016], [0.3340], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01116943359375 loss: 0.006622314453125 loss: 0.00799560546875 loss: 0.0079345703125 predicted value: tensor([[0.9258], [0.9688], [0.5273], [0.3711], [0.7344], [1.1562], [1.1953], [0.8594], [0.6523], [0.6875], [0.7969], [0.5273], [0.5391], [0.5625], [0.3594], [0.3574]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.8320], [0.3340], [0.2500], [0.6016], [1.0000], [1.0000], [0.6680], [0.4668], [0.6016], [0.6016], [0.4004], [0.4004], [0.4004], [0.2500], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007568359375 loss: 0.0101318359375 loss: 0.006195068359375 loss: 0.01116943359375 predicted value: tensor([[0.5586], [1.1562], [0.5781], [1.1797], [0.3984], [1.1875], [0.4277], [1.1797], [1.1172], [0.7930], [0.4492], [1.1641], [0.5859], [0.5078], [0.6680], [0.3457]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [1.0000], [0.2002], [1.0000], [0.2500], [1.0000], [1.0000], [0.7500], [0.2500], [1.0000], [0.4004], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00677490234375 loss: 0.005706787109375 loss: 0.006134033203125 loss: 0.0064697265625 predicted value: tensor([[0.5469], [1.1641], [0.6211], [1.1719], [0.6484], [0.9609], [0.5312], [0.6797], [1.1562], [0.4023], [0.5312], [1.1875], [0.5508], [0.6484], [0.5273], [0.3711]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [1.0000], [0.5547], [0.8320], [0.4668], [0.5000], [1.0000], [0.3340], [0.3340], [1.0000], [0.4004], [0.5000], [0.2852], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00787353515625 loss: 0.0089111328125 loss: 0.0047607421875 loss: 0.006195068359375 88%|████████▊ | 435/492 [3:58:07<30:36, 32.22s/it] {'loss': 0.0304, 'learning_rate': 1e-05, 'epoch': 0.88} 88%|████████▊ | 435/492 [3:58:07<30:36, 32.22s/it]predicted value: tensor([[0.7891], [0.5352], [0.6797], [1.1172], [0.6406], [0.9258], [0.7500], [0.5977], [0.7422], [0.7891], [0.5625], [0.4375], [0.3594], [0.3535], [0.5547], [0.3340]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [0.4668], [0.4668], [1.0000], [0.5547], [0.8320], [0.7500], [0.3750], [0.6016], [0.7500], [0.5000], [0.3340], [0.2002], [0.1670], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004241943359375 loss: 0.003875732421875 loss: 0.003936767578125 loss: 0.00299072265625 predicted value: tensor([[0.6406], [0.6484], [1.1484], [0.4824], [0.5586], [0.4082], [0.8359], [1.1250], [0.8281], [0.6680], [0.8711], [0.4922], [1.1094], [0.3535], [0.3555], [0.5273]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [1.0000], [0.4668], [0.4668], [0.2002], [0.8320], [1.0000], [0.7500], [0.5000], [0.8008], [0.4004], [1.0000], [0.2002], [0.1670], [0.0278]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00421142578125 loss: 0.006805419921875 loss: 0.007415771484375 loss: 0.0037384033203125 predicted value: tensor([[0.3730], [1.1641], [0.8906], [0.9180], [0.4336], [1.1562], [0.6758], [0.4297], [0.6641], [0.5625], [0.7852], [0.5547], [0.4570], [0.3438], [0.3281], [0.3516]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [1.0000], [0.8008], [0.8320], [0.3340], [1.0000], [0.6016], [0.3340], [0.5000], [0.5000], [0.6016], [0.4004], [0.4004], [0.1670], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033111572265625 loss: 0.00433349609375loss: 0.00347900390625 loss: 0.004974365234375 predicted value: tensor([[0.5234], [0.4824], [0.6211], [0.6836], [0.5547], [0.8203], [0.8359], [0.8047], [0.6992], [0.5742], [0.3887], [0.5117], [0.5820], [0.5352], [0.3516], [0.3281]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8008], [0.4668], [0.4668], [0.7500], [0.7500], [0.7500], [0.4277], [0.5000], [0.2002], [0.2852], [0.4004], [0.4004], [0.1113], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0068359375 loss: 0.00628662109375 loss: 0.00604248046875 loss: 0.0047607421875 89%|████████▊ | 436/492 [3:58:39<30:11, 32.35s/it] {'loss': 0.0193, 'learning_rate': 1e-05, 'epoch': 0.89} 89%|████████▊ | 436/492 [3:58:39<30:11, 32.35s/it]predicted value: tensor([[1.0781], [0.6641], [0.7617], [0.2500], [0.4863], [0.7422], [1.0391], [0.3887], [0.5156], [0.5977], [0.5508], [0.3164], [0.3926], [0.2734], [0.2695], [0.2314]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [0.8008], [0.2500], [0.3750], [0.8320], [1.0000], [0.2500], [0.4648], [0.2500], [0.5000], [0.2500], [0.3340], [0.1670], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015716552734375 loss: 0.0032196044921875 loss: 0.0021209716796875 loss: 0.0011749267578125 predicted value: tensor([[0.7656], [0.7266], [0.4453], [0.4961], [0.5312], [0.4609], [0.4844], [0.3320], [0.5820], [0.4023], [0.4023], [0.3105], [0.4746], [0.2559], [0.2500], [0.2637]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.6680], [0.4668], [0.4668], [0.4668], [0.3750], [0.3750], [0.2500], [0.5000], [0.3340], [0.4004], [0.4004], [0.4004], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00131988525390625 loss: 0.00115203857421875 loss: 0.00133514404296875 loss: 0.000946044921875 predicted value: tensor([[1.0234], [0.5703], [0.5625], [0.4082], [1.0547], [0.5586], [0.7148], [0.7578], [0.5312], [0.4922], [1.0469], [1.0547], [0.4570], [0.0801], [0.4648], [0.2773]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.5547], [0.4668], [1.0000], [0.6016], [0.6016], [0.8008], [0.6016], [0.3750], [1.0000], [1.0000], [0.4004], [0.0400], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011444091796875 loss: 0.00106048583984375 loss: 0.0009765625 loss: 0.0005950927734375 predicted value: tensor([[0.8359], [0.4336], [0.4453], [1.0703], [0.8320], [0.7148], [0.5391], [0.4082], [0.2559], [0.3164], [0.4590], [0.3984], [0.4375], [1.0469], [0.0654], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.3750], [1.0000], [0.8320], [0.8008], [0.6016], [0.3750], [0.3340], [0.2500], [0.4668], [0.1670], [0.4004], [1.0000], [0.0400], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026397705078125 loss: 0.0014801025390625 loss: 0.0016937255859375 loss: 0.00156402587890625 89%|████████▉ | 437/492 [3:59:11<29:36, 32.29s/it] {'loss': 0.006, 'learning_rate': 1e-05, 'epoch': 0.89} 89%|████████▉ | 437/492 [3:59:11<29:36, 32.29s/it]predicted value: tensor([[0.9414], [0.6562], [0.3027], [0.9023], [0.6914], [0.9023], [0.3828], [0.1963], [0.8789], [0.3242], [0.5742], [0.4824], [0.4980], [0.1152], [0.1416], [0.1504]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [0.4668], [1.0000], [0.8320], [1.0000], [0.4668], [0.2002], [1.0000], [0.5000], [0.8008], [0.7500], [0.7500], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0050048828125 loss: 0.00341796875 loss: 0.005462646484375 loss: 0.0036773681640625 predicted value: tensor([[0.4512], [0.3418], [0.3691], [0.4629], [0.9062], [0.6367], [0.5977], [0.8750], [0.3848], [0.4805], [0.4766], [0.1846], [0.3027], [0.1357], [0.1289], [0.1689]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.7500], [1.0000], [0.8008], [0.6680], [1.0000], [0.6016], [0.7500], [0.5000], [0.2500], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006591796875 loss: 0.0036163330078125 loss: 0.005401611328125 loss: 0.0081787109375 predicted value: tensor([[0.4805], [0.2930], [0.3320], [0.6367], [0.3496], [0.1582], [0.1904], [0.6641], [0.1953], [0.1055], [0.1689], [0.4199], [0.3066], [0.3750], [0.3184], [0.1406]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5312], [0.4668], [0.4668], [0.8008], [0.4668], [0.3340], [0.2500], [0.8008], [0.3340], [0.2500], [0.1670], [0.5000], [0.4004], [0.5000], [0.4004], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00457763671875 loss: 0.00347900390625loss: 0.003021240234375 loss: 0.005523681640625 predicted value: tensor([[ 0.4355], [ 0.9023], [ 0.2793], [ 0.3105], [ 0.3262], [-0.0225], [ 0.1924], [ 0.5234], [ 0.3516], [ 0.5391], [ 0.4746], [ 0.3516], [ 0.4277], [ 0.4609], [ 0.1270], [ 0.1270]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.3750], [0.3750], [0.4668], [0.0278], [0.3340], [0.6680], [0.4668], [0.5703], [0.7500], [0.4004], [0.6016], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003662109375 loss: 0.004486083984375 loss: 0.004974365234375 loss: 0.0036163330078125 89%|████████▉ | 438/492 [3:59:43<28:57, 32.18s/it] {'loss': 0.0187, 'learning_rate': 1e-05, 'epoch': 0.89} 89%|████████▉ | 438/492 [3:59:43<28:57, 32.18s/it]predicted value: tensor([[0.4609], [0.5820], [0.5391], [0.2539], [0.3262], [0.2910], [0.2090], [0.4492], [0.8164], [0.5508], [0.3613], [0.8359], [0.1064], [0.0874], [0.0679], [0.0840]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6250], [0.6680], [0.8008], [0.4668], [0.5547], [0.3750], [0.3340], [0.7500], [1.0000], [0.8008], [0.6016], [1.0000], [0.2002], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00775146484375 loss: 0.0087890625 loss: 0.00897216796875 loss: 0.01080322265625 predicted value: tensor([[0.3594], [0.0713], [0.8398], [0.5156], [0.8477], [0.8281], [0.5742], [0.5664], [0.4688], [0.3008], [0.3105], [0.2158], [0.2471], [0.0444], [0.0635], [0.0684]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [1.0000], [0.7148], [1.0000], [1.0000], [0.8008], [0.7148], [0.7500], [0.5000], [0.6016], [0.3340], [0.5000], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00994873046875 loss: 0.0123291015625 loss: 0.00970458984375 loss: 0.01031494140625 predicted value: tensor([[0.3379], [0.8320], [0.2656], [0.3047], [0.8555], [0.8320], [0.4629], [0.2236], [0.3965], [0.8320], [0.1455], [0.1895], [0.2891], [0.2832], [0.0596], [0.0859]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.4668], [1.0000], [1.0000], [0.5703], [0.2715], [0.6680], [1.0000], [0.3340], [0.3340], [0.4004], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.010986328125 loss: 0.0096435546875 loss: 0.007110595703125 loss: 0.00787353515625 predicted value: tensor([[0.3008], [0.2734], [0.2891], [0.8203], [0.8398], [0.1074], [0.3867], [0.5352], [0.1138], [0.4648], [0.2051], [0.4141], [0.2949], [0.0598], [0.0564], [0.0801]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [1.0000], [1.0000], [0.2500], [0.6016], [0.8008], [0.2500], [0.6680], [0.4004], [0.7500], [0.6016], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00830078125 loss: 0.0072021484375 loss: 0.01068115234375loss: 0.00927734375 89%|████████▉ | 439/492 [4:00:16<28:34, 32.34s/it] {'loss': 0.0374, 'learning_rate': 1e-05, 'epoch': 0.89} 89%|████████▉ | 439/492 [4:00:16<28:34, 32.34s/it]predicted value: tensor([[0.8555], [0.2871], [0.2637], [0.1660], [0.3320], [0.8047], [0.1768], [0.4160], [0.3438], [0.2637], [0.2168], [0.2832], [0.4238], [0.0747], [0.0884], [0.1045]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3750], [0.3340], [0.6016], [1.0000], [0.2002], [0.6016], [0.4668], [0.6016], [0.4004], [0.5000], [0.6016], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0079345703125 loss: 0.00811767578125loss: 0.0091552734375 loss: 0.007781982421875 predicted value: tensor([[0.3125], [0.8555], [0.2676], [0.3203], [0.6328], [0.1553], [0.8516], [0.1089], [0.1299], [0.2441], [0.2891], [0.4414], [0.2275], [0.0747], [0.0708], [0.0977]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.4668], [0.4668], [0.8320], [0.3340], [1.0000], [0.2500], [0.2500], [0.3750], [0.5000], [0.5000], [0.3340], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0084228515625 loss: 0.00872802734375 loss: 0.005584716796875 loss: 0.006927490234375 predicted value: tensor([[0.6562], [0.2656], [0.4004], [0.2480], [0.5195], [0.3809], [0.2715], [0.2656], [0.8320], [0.8242], [0.4316], [0.2100], [0.4062], [0.2324], [0.0967], [0.0903]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.4668], [0.4668], [0.8008], [0.6016], [0.4668], [0.4668], [1.0000], [1.0000], [0.6016], [0.4004], [0.6016], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.013671875 loss: 0.00543212890625loss: 0.0084228515625 loss: 0.007659912109375 predicted value: tensor([[0.2910], [0.4102], [0.2168], [0.0967], [0.8359], [0.4277], [0.8320], [0.3281], [0.8164], [0.8281], [0.3262], [0.1885], [0.2422], [0.0981], [0.0850], [0.0378]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.3750], [0.2002], [1.0000], [0.5703], [1.0000], [0.5000], [1.0000], [1.0000], [0.5000], [0.3340], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01068115234375 loss: 0.00775146484375 loss: 0.006195068359375loss: 0.0091552734375 89%|████████▉ | 440/492 [4:00:49<28:04, 32.40s/it] {'loss': 0.0329, 'learning_rate': 1e-05, 'epoch': 0.89} 89%|████████▉ | 440/492 [4:00:49<28:04, 32.40s/it]predicted value: tensor([[0.2139], [0.8789], [0.8906], [0.3340], [0.8828], [0.3535], [0.3848], [0.1680], [0.9023], [0.6719], [0.4473], [0.4121], [0.2773], [0.1108], [0.1260], [0.1416]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [1.0000], [1.0000], [0.3750], [1.0000], [0.3750], [0.4668], [0.2500], [1.0000], [0.8008], [0.6016], [0.5000], [0.7500], [0.1670], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003173828125 loss: 0.00579833984375 loss: 0.00543212890625 loss: 0.005157470703125 predicted value: tensor([[0.5078], [0.4355], [0.4512], [0.3379], [0.3613], [0.8906], [0.5703], [0.8906], [0.8984], [0.3340], [0.0471], [0.5742], [0.3633], [0.2734], [0.1533], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.5547], [0.5547], [0.4668], [0.4668], [1.0000], [0.6680], [1.0000], [1.0000], [0.3750], [0.0625], [0.7500], [0.5000], [0.0625], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00194549560546875 loss: 0.0028533935546875 loss: 0.0032806396484375 loss: 0.00396728515625 predicted value: tensor([[0.4004], [0.2227], [0.3984], [0.1934], [0.6289], [0.4258], [0.9023], [0.8828], [0.3535], [0.4707], [0.5391], [0.5938], [0.4180], [0.1235], [0.1289], [0.1162]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.4668], [0.2500], [0.8320], [0.4668], [1.0000], [1.0000], [0.5000], [0.5547], [0.7500], [0.7500], [0.5000], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006103515625 loss: 0.00341796875 loss: 0.0029754638671875 loss: 0.0038604736328125 predicted value: tensor([[0.3477], [0.6758], [0.3984], [0.2119], [0.1240], [0.9219], [0.5117], [0.8867], [0.2559], [0.3242], [0.3926], [0.3633], [0.3008], [0.4062], [0.1167], [0.3887]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.4668], [0.2500], [0.1426], [1.0000], [0.7500], [1.0000], [0.3340], [0.4004], [0.5000], [0.5000], [0.4004], [0.5000], [0.2002], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003662109375 loss: 0.00537109375 loss: 0.004486083984375 loss: 0.0029449462890625 90%|████████▉ | 441/492 [4:01:22<27:52, 32.80s/it] {'loss': 0.0161, 'learning_rate': 1e-05, 'epoch': 0.9} 90%|████████▉ | 441/492 [4:01:22<27:52, 32.80s/it]predicted value: tensor([[0.4668], [0.8086], [0.3496], [0.6797], [0.4980], [0.3008], [0.2988], [0.3945], [0.6992], [0.6055], [0.3008], [0.6953], [0.3691], [0.2715], [0.2637], [0.3262]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.2500], [0.4668], [0.4668], [0.2500], [0.2500], [0.3340], [0.7500], [0.6016], [0.3340], [0.6680], [0.2500], [0.2002], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023345947265625 loss: 0.0016326904296875loss: 0.0023193359375 loss: 0.00089263916015625 predicted value: tensor([[1.0156], [0.3438], [0.6250], [0.7305], [0.3145], [1.0078], [1.0234], [0.7734], [0.6328], [0.6523], [0.3203], [0.7578], [0.4434], [0.5898], [0.4355], [0.2910]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [0.6680], [0.8320], [0.2500], [1.0000], [1.0000], [0.8008], [0.6680], [0.7500], [0.2500], [0.5703], [0.5000], [0.6016], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002166748046875 loss: 0.00164031982421875 loss: 0.00125885009765625 loss: 0.00151824951171875 predicted value: tensor([[0.5586], [0.6602], [0.7578], [1.0078], [1.0156], [0.5156], [0.4180], [0.5781], [0.6133], [0.6055], [0.5312], [0.2119], [0.4199], [0.4902], [0.3066], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.4648], [0.8008], [1.0000], [1.0000], [0.4668], [0.5000], [0.6016], [0.6016], [0.2500], [0.4668], [0.0625], [0.5000], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011444091796875 loss: 0.0034027099609375 loss: 0.003936767578125 loss: 0.00167083740234375 predicted value: tensor([[0.5898], [1.0156], [0.7773], [0.5117], [1.0000], [0.6133], [0.5430], [0.6836], [1.0234], [0.7539], [0.3691], [0.5820], [0.2334], [0.4414], [0.2637], [0.2402]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8008], [0.4668], [1.0000], [0.5547], [0.4668], [0.6016], [1.0000], [0.8008], [0.2500], [0.2500], [0.4004], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00140380859375 loss: 0.00148773193359375 loss: 0.00145721435546875 loss: 0.0027618408203125 90%|████████▉ | 442/492 [4:01:55<27:24, 32.89s/it] {'loss': 0.0078, 'learning_rate': 1e-05, 'epoch': 0.9} 90%|████████▉ | 442/492 [4:01:55<27:24, 32.89s/it]predicted value: tensor([[0.6250], [0.6484], [0.5586], [0.5742], [0.7539], [0.8516], [0.3594], [0.4082], [0.2793], [0.4980], [0.5039], [0.7070], [0.2871], [0.4863], [0.4844], [0.2949]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [0.4668], [0.6016], [0.8008], [0.3340], [0.2500], [0.4004], [0.5000], [0.4004], [0.6016], [0.2002], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027618408203125 loss: 0.00238037109375 loss: 0.002655029296875 loss: 0.00238037109375 predicted value: tensor([[0.5703], [0.6680], [0.3887], [1.0469], [1.0547], [0.8281], [0.8320], [0.7539], [0.5938], [0.6133], [0.4922], [0.5898], [0.3418], [0.2988], [0.3203], [0.3086]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.2500], [1.0000], [1.0000], [0.8008], [0.8008], [0.7500], [0.4668], [0.6016], [0.3340], [0.5000], [0.2002], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033416748046875 loss: 0.00244140625 loss: 0.00390625 loss: 0.003173828125 predicted value: tensor([[0.6328], [1.0469], [0.8203], [1.0938], [0.7109], [0.6406], [0.8242], [0.5781], [0.4238], [0.7305], [0.5742], [0.5234], [0.3906], [0.5039], [0.3301], [0.3301]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.8008], [1.0000], [0.6016], [0.3750], [0.8008], [0.4668], [0.2500], [0.7500], [0.6016], [0.4004], [0.4004], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004547119140625 loss: 0.003021240234375 loss: 0.002166748046875 loss: 0.0030364990234375 predicted value: tensor([[0.7305], [0.5586], [0.4395], [0.5742], [0.5547], [0.8633], [1.0703], [0.6250], [0.5039], [0.4141], [0.5703], [0.5039], [0.5391], [0.4668], [0.3320], [0.3496]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.3750], [0.2500], [0.4668], [0.4668], [0.8008], [1.0000], [0.4668], [0.4668], [0.2500], [0.5000], [0.3340], [0.4004], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00372314453125 loss: 0.003387451171875 loss: 0.0037384033203125 loss: 0.0023956298828125 90%|█████████ | 443/492 [4:02:29<27:03, 33.12s/it] {'loss': 0.0123, 'learning_rate': 1e-05, 'epoch': 0.9} 90%|█████████ | 443/492 [4:02:29<27:03, 33.12s/it]predicted value: tensor([[0.5586], [0.5039], [0.5508], [1.0469], [0.6289], [0.6992], [0.6719], [1.0547], [1.0547], [0.3594], [0.6094], [0.5586], [0.6016], [0.5312], [0.5508], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.4668], [1.0000], [0.5000], [0.6016], [0.6016], [1.0000], [1.0000], [0.2500], [0.4668], [0.5000], [0.6016], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023956298828125 loss: 0.0012054443359375 loss: 0.0023956298828125 loss: 0.002838134765625 predicted value: tensor([[0.4902], [0.4727], [0.6875], [0.5078], [0.5391], [1.0312], [0.3984], [0.7969], [0.5781], [1.0625], [1.0391], [0.2480], [0.5195], [0.2500], [0.2910], [0.5039]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [0.7148], [0.4668], [0.4668], [1.0000], [0.2500], [0.8008], [0.7500], [1.0000], [1.0000], [0.5000], [0.4004], [0.2002], [0.2002], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0033721923828125 loss: 0.002685546875 loss: 0.0027618408203125 loss: 0.002716064453125 predicted value: tensor([[0.8242], [0.3750], [0.3691], [0.5156], [0.7617], [0.3477], [1.0469], [0.7188], [0.7539], [0.4902], [0.6719], [0.6016], [0.6641], [0.2656], [0.2773], [0.2891]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.2500], [0.2500], [0.4668], [0.8008], [0.2002], [1.0000], [0.6016], [0.8008], [0.4004], [0.6016], [0.6016], [0.7500], [0.1670], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0012664794921875 loss: 0.0019683837890625 loss: 0.002044677734375 loss: 0.002227783203125 predicted value: tensor([[0.5898], [1.0156], [1.0625], [0.7617], [0.5820], [0.6211], [1.0391], [1.0625], [0.4141], [0.3672], [0.7070], [0.7695], [0.7461], [0.5039], [0.2910], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.6680], [0.3750], [0.5547], [1.0000], [1.0000], [0.3340], [0.2500], [0.6016], [0.6680], [0.7500], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023040771484375 loss: 0.0031280517578125 loss: 0.00244140625 loss: 0.00180816650390625 90%|█████████ | 444/492 [4:03:02<26:33, 33.19s/it] {'loss': 0.0094, 'learning_rate': 1e-05, 'epoch': 0.9} 90%|█████████ | 444/492 [4:03:02<26:33, 33.19s/it]predicted value: tensor([[0.4707], [0.4648], [0.2383], [0.6562], [0.4570], [0.9805], [0.6719], [0.3164], [0.2412], [0.6328], [0.4473], [0.4082], [0.6367], [0.4043], [0.2090], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.2500], [0.4668], [0.4668], [1.0000], [0.6680], [0.3340], [0.2500], [0.7500], [0.5000], [0.4004], [0.7500], [0.3340], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001678466796875 loss: 0.0011444091796875loss: 0.00179290771484375 loss: 0.00067138671875 predicted value: tensor([[0.9766], [0.4707], [0.4785], [0.7461], [0.4922], [0.4473], [0.6133], [0.9609], [0.6250], [0.6602], [0.6875], [0.4160], [0.4141], [0.4473], [0.2217], [0.1982]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.8008], [0.4668], [0.4668], [0.6016], [1.0000], [0.6016], [0.6016], [0.7500], [0.4004], [0.4004], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000286102294921875loss: 0.00179290771484375 loss: 0.00069427490234375 loss: 0.000873565673828125 predicted value: tensor([[0.5625], [0.9688], [0.4219], [0.5430], [0.6641], [0.7031], [0.3984], [0.9922], [0.5234], [0.2852], [0.2969], [0.4121], [0.6758], [0.1982], [0.1904], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.5547], [0.8008], [0.6680], [0.3750], [1.0000], [0.5703], [0.2002], [0.3340], [0.4004], [0.7500], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023651123046875 loss: 0.001129150390625 loss: 0.000629425048828125 loss: 0.0027008056640625 predicted value: tensor([[0.4883], [0.4375], [0.4668], [0.6211], [0.9844], [0.9414], [0.7227], [0.5234], [1.0000], [0.4258], [0.9844], [0.4023], [0.3926], [0.3965], [0.2021], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.4668], [0.6172], [1.0000], [1.0000], [0.8008], [0.6016], [1.0000], [0.4004], [1.0000], [0.4004], [0.3340], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010986328125 loss: 0.00140380859375 loss: 0.0007476806640625 loss: 0.000347137451171875 90%|█████████ | 445/492 [4:03:36<26:07, 33.35s/it] {'loss': 0.0048, 'learning_rate': 1e-05, 'epoch': 0.9} 90%|█████████ | 445/492 [4:03:36<26:07, 33.35s/it]predicted value: tensor([[0.6094], [1.0156], [0.4336], [0.4336], [0.2773], [0.2637], [0.2383], [0.6445], [0.5820], [0.4707], [0.4141], [0.3477], [0.1807], [0.1914], [0.1777], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [1.0000], [0.3750], [0.3750], [0.2500], [0.3340], [0.3340], [0.7500], [0.6016], [0.4668], [0.3340], [0.3340], [0.2002], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000644683837890625 loss: 0.00182342529296875 loss: 0.00057220458984375 loss: 0.000576019287109375 predicted value: tensor([[0.7852], [0.4688], [0.4336], [0.2637], [0.6133], [0.6094], [0.7305], [0.1572], [0.5039], [0.4121], [0.5117], [0.4238], [0.3594], [0.4199], [0.1748], [0.1816]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.3750], [0.3340], [0.8008], [0.6016], [0.8008], [0.2500], [0.5000], [0.3340], [0.4668], [0.4004], [0.3340], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000949859619140625 loss: 0.00164031982421875 loss: 0.0012054443359375 loss: 0.003204345703125 predicted value: tensor([[0.4004], [0.2188], [0.6992], [0.4336], [0.5938], [0.5352], [0.4160], [1.0000], [0.7305], [0.2539], [0.9883], [0.0593], [0.3750], [0.1406], [0.5898], [0.2041]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.7148], [0.4668], [0.8320], [0.5547], [0.4668], [1.0000], [0.7500], [0.2500], [1.0000], [0.0400], [0.3340], [0.2002], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00125885009765625 loss: 0.00093841552734375loss: 0.0021514892578125 loss: 0.0022125244140625 predicted value: tensor([[0.4355], [0.5547], [0.4082], [0.5195], [0.6875], [0.7617], [0.7227], [0.9961], [0.6406], [0.5000], [0.4434], [0.3652], [0.3691], [0.1836], [0.2275], [0.1963]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.4668], [0.5547], [0.6680], [0.8008], [0.6680], [1.0000], [0.6016], [0.5000], [0.4004], [0.3340], [0.3340], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010223388671875 loss: 0.000843048095703125 loss: 0.002532958984375 loss: 0.0002651214599609375 91%|█████████ | 446/492 [4:04:10<25:41, 33.51s/it] {'loss': 0.0055, 'learning_rate': 1e-05, 'epoch': 0.91} 91%|█████████ | 446/492 [4:04:10<25:41, 33.51s/it]predicted value: tensor([[0.6055], [1.0312], [1.0859], [0.2314], [0.4570], [1.0625], [0.8242], [0.4062], [0.6250], [0.2188], [0.3418], [0.5234], [0.4023], [0.2070], [0.1963], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5430], [1.0000], [1.0000], [0.2500], [0.4668], [1.0000], [0.8008], [0.7500], [0.5000], [0.2002], [0.3340], [0.4004], [0.5000], [0.2002], [0.0400], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00144195556640625 loss: 0.00066375732421875 loss: 0.0031280517578125 loss: 0.0023345947265625 predicted value: tensor([[0.5547], [0.7617], [0.7227], [0.4766], [0.6250], [0.2891], [0.7656], [0.7227], [0.2617], [0.7461], [0.6719], [1.0703], [0.5586], [0.2197], [0.2246], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.8008], [0.3750], [0.7500], [0.3340], [0.5703], [0.6016], [0.2500], [0.7500], [0.6016], [1.0000], [0.3750], [0.2500], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00128173828125 loss: 0.0034027099609375 loss: 0.002197265625 loss: 0.000659942626953125 predicted value: tensor([[0.4395], [1.0547], [0.3516], [0.2227], [0.4902], [0.7656], [0.2695], [0.6523], [0.5078], [0.3652], [0.6875], [0.6758], [0.5078], [0.4141], [0.4961], [0.1865]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.3340], [0.2500], [0.4668], [0.8008], [0.2500], [0.5703], [0.6016], [0.7500], [0.6016], [0.7500], [0.5000], [0.6016], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000652313232421875 loss: 0.003570556640625loss: 0.00153350830078125 loss: 0.000881195068359375 predicted value: tensor([[0.2812], [0.3242], [1.0703], [0.4570], [0.3965], [0.4531], [0.4746], [0.8242], [0.7148], [1.0469], [1.0703], [0.4395], [0.4941], [0.2852], [0.2451], [0.2188]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.3340], [1.0000], [0.4668], [0.3145], [0.4668], [0.4668], [0.8008], [0.7500], [1.0000], [1.0000], [0.4004], [0.4004], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0007476806640625 loss: 0.00087738037109375 loss: 0.00061798095703125 loss: 0.00089263916015625 91%|█████████ | 447/492 [4:04:43<25:01, 33.36s/it] {'loss': 0.0062, 'learning_rate': 1e-05, 'epoch': 0.91} 91%|█████████ | 447/492 [4:04:43<25:01, 33.36s/it]predicted value: tensor([[0.4141], [0.4395], [0.7891], [0.5586], [0.5664], [0.4395], [0.7773], [0.7109], [0.4199], [0.2480], [0.3145], [0.4707], [0.3809], [0.1768], [0.1982], [0.2168]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.8008], [0.5547], [0.5547], [0.4668], [0.7500], [0.6680], [0.3750], [0.2500], [0.2500], [0.4004], [0.5000], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000926971435546875 loss: 0.000499725341796875loss: 0.00095367431640625 loss: 0.000873565673828125 predicted value: tensor([[0.5547], [1.0547], [0.4727], [0.5391], [0.8008], [0.7617], [0.6602], [0.5664], [0.3906], [0.5938], [0.7070], [0.6719], [0.5156], [0.7461], [0.4453], [0.2148]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.4668], [0.8008], [0.6016], [0.6016], [0.3750], [0.3340], [0.7500], [0.7500], [0.5000], [0.5000], [0.7500], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0022735595703125loss: 0.000934600830078125 loss: 0.000759124755859375 loss: 0.0009918212890625 predicted value: tensor([[0.8633], [0.4570], [0.4199], [0.4102], [0.2988], [0.4355], [0.8086], [1.0547], [0.2930], [0.5977], [0.5430], [0.3242], [0.5625], [0.3984], [0.1992], [0.2021]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.4668], [0.3750], [0.6016], [0.4668], [0.8008], [1.0000], [0.2500], [0.5000], [0.3340], [0.3340], [0.5000], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00091552734375 loss: 0.0025482177734375loss: 0.000823974609375 loss: 0.0037689208984375 predicted value: tensor([[0.2207], [0.5547], [0.4785], [0.8477], [0.7852], [0.4180], [0.2773], [0.7578], [0.7734], [0.5547], [0.2637], [0.2207], [0.0713], [0.3281], [0.1787], [0.2266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.5547], [0.4668], [0.8320], [0.8008], [0.4668], [0.2500], [0.7500], [0.8008], [0.5000], [0.0625], [0.2500], [0.0400], [0.3340], [0.1250], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0003223419189453125 loss: 0.000850677490234375 loss: 0.000888824462890625 loss: 0.000652313232421875 91%|█████████ | 448/492 [4:05:16<24:28, 33.37s/it] {'loss': 0.0047, 'learning_rate': 1e-05, 'epoch': 0.91} 91%|█████████ | 448/492 [4:05:16<24:28, 33.37s/it]predicted value: tensor([[0.3477], [0.7695], [0.4688], [0.7109], [0.9492], [0.6016], [0.1846], [0.9492], [0.6562], [0.9648], [0.2168], [0.1475], [0.2441], [0.1367], [0.1299], [0.1055]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.7148], [0.6016], [0.8008], [1.0000], [0.4668], [0.2500], [1.0000], [0.5000], [1.0000], [0.2500], [0.1670], [0.3340], [0.2500], [0.2500], [0.1250]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00164031982421875 loss: 0.0015106201171875 loss: 0.0020751953125 loss: 0.001434326171875 predicted value: tensor([[0.1357], [0.3750], [0.1494], [0.3594], [0.7461], [0.9805], [0.4121], [0.7734], [0.5000], [0.1768], [0.6250], [0.4531], [0.4141], [0.4160], [0.1465], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.4668], [0.2002], [0.4668], [0.8008], [1.0000], [0.4668], [0.8008], [0.5547], [0.2500], [0.6016], [0.5000], [0.4004], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00146484375 loss: 0.0007476806640625 loss: 0.00104522705078125 loss: 0.0024871826171875 predicted value: tensor([[0.4883], [0.9570], [0.9609], [0.0036], [0.3105], [0.1592], [0.4746], [0.5352], [0.7031], [0.7188], [0.6562], [0.5156], [0.3242], [0.3770], [0.1289], [0.1289]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [1.0000], [1.0000], [0.0400], [0.3145], [0.2500], [0.4668], [0.4668], [0.8008], [0.8008], [0.8008], [0.3750], [0.3340], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0026092529296875 loss: 0.000972747802734375 loss: 0.001434326171875 loss: 0.00135040283203125 predicted value: tensor([[0.7734], [0.3965], [0.7578], [0.3203], [0.9766], [0.6992], [0.6133], [0.9688], [0.5977], [0.0500], [0.0325], [0.3359], [0.9531], [0.3418], [0.1650], [0.1494]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.4668], [0.6680], [0.3750], [1.0000], [0.8008], [0.8008], [1.0000], [0.7500], [0.0400], [0.0204], [0.5000], [1.0000], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00156402587890625 loss: 0.00150299072265625 loss: 0.00144195556640625 loss: 0.0021820068359375 91%|█████████▏| 449/492 [4:05:49<23:49, 33.24s/it] {'loss': 0.0064, 'learning_rate': 1e-05, 'epoch': 0.91} 91%|█████████▏| 449/492 [4:05:49<23:49, 33.24s/it]predicted value: tensor([[0.7734], [0.6367], [0.3613], [0.1709], [0.7305], [0.7812], [0.9570], [0.6172], [0.9570], [0.6797], [0.3711], [0.4219], [0.3906], [0.4629], [0.3359], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.6680], [0.4668], [0.2500], [0.8008], [0.8320], [1.0000], [0.8008], [1.0000], [0.8008], [0.4004], [0.4004], [0.4004], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00144195556640625 loss: 0.0018463134765625 loss: 0.00141143798828125 loss: 0.0019989013671875 predicted value: tensor([[0.9258], [0.3633], [0.3770], [0.4043], [0.1533], [0.6211], [0.9492], [0.6289], [0.5742], [0.6211], [0.3887], [0.3438], [0.3535], [0.1406], [0.1475], [0.1260]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.4668], [0.2002], [0.6016], [1.0000], [0.6680], [0.6016], [0.7500], [0.5000], [0.4004], [0.5000], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00147247314453125 loss: 0.0018310546875loss: 0.0032501220703125 loss: 0.00135040283203125 predicted value: tensor([[0.9531], [0.9414], [0.3555], [0.2754], [0.2461], [0.4258], [0.4062], [0.1934], [0.6641], [0.5703], [0.9688], [0.5820], [0.1455], [0.3594], [0.1436], [0.1167]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.3750], [0.3340], [0.2002], [0.4668], [0.3750], [0.3340], [0.7500], [0.6016], [1.0000], [0.6016], [0.2002], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001953125 loss: 0.00091552734375loss: 0.000946044921875 loss: 0.0013275146484375 predicted value: tensor([[0.9453], [0.4297], [0.7344], [0.7109], [0.5625], [0.1475], [0.6914], [0.2168], [0.6562], [0.5898], [0.4512], [0.4609], [0.1465], [0.3906], [0.1138], [0.3711]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8320], [0.8008], [0.3750], [0.2500], [0.8320], [0.2500], [0.7500], [0.7500], [0.4668], [0.5000], [0.2002], [0.4004], [0.2500], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00156402587890625 loss: 0.00128936767578125 loss: 0.00157928466796875 loss: 0.0023040771484375 91%|█████████▏| 450/492 [4:06:22<23:13, 33.17s/it] {'loss': 0.0066, 'learning_rate': 1e-05, 'epoch': 0.91} 91%|█████████▏| 450/492 [4:06:22<23:13, 33.17s/it]predicted value: tensor([[0.4121], [0.5391], [0.9844], [0.6602], [0.4512], [0.9883], [0.6641], [0.2773], [0.6641], [0.8047], [0.2451], [0.6250], [0.6211], [0.2012], [0.2070], [0.2090]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [1.0000], [0.6016], [0.4668], [1.0000], [0.7500], [0.2500], [0.6016], [0.8008], [0.2500], [0.6016], [0.6016], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00106048583984375 loss: 0.00179290771484375 loss: 0.00034332275390625 loss: 0.00051116943359375 predicted value: tensor([[0.7852], [0.9961], [0.7773], [1.0000], [0.6406], [0.7539], [0.4609], [0.9883], [0.7852], [0.2520], [0.3750], [0.4102], [0.4785], [0.3770], [0.2021], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.8008], [1.0000], [0.4668], [0.8008], [0.4668], [1.0000], [0.8008], [0.2500], [0.3340], [0.4004], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00067901611328125loss: 0.001617431640625 loss: 0.000606536865234375 loss: 0.00174713134765625 predicted value: tensor([[0.3535], [0.9961], [0.2930], [0.6836], [0.4414], [0.2363], [0.7852], [0.2773], [0.4590], [0.4980], [0.4199], [0.9688], [0.2793], [0.4043], [0.1846], [0.1650]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [1.0000], [0.3340], [0.7500], [0.4668], [0.3340], [0.8008], [0.3340], [0.4668], [0.5000], [0.4004], [1.0000], [0.0400], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016326904296875 loss: 0.001251220703125loss: 0.00131988525390625 loss: 0.000507354736328125 predicted value: tensor([[0.9727], [0.2393], [0.7148], [0.2734], [0.4414], [0.9766], [0.3613], [0.7227], [0.1982], [0.3945], [0.5391], [0.6094], [0.0542], [0.4355], [0.2051], [0.2480]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2500], [0.6680], [0.3340], [0.4668], [1.0000], [0.3145], [0.6680], [0.2002], [0.3750], [0.4668], [0.6016], [0.0625], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000339508056640625 loss: 0.0027008056640625 loss: 0.0003452301025390625 loss: 0.0003757476806640625 92%|█████████▏| 451/492 [4:06:55<22:35, 33.07s/it] {'loss': 0.0042, 'learning_rate': 1e-05, 'epoch': 0.92} 92%|█████████▏| 451/492 [4:06:55<22:35, 33.07s/it]predicted value: tensor([[0.5703], [0.8008], [0.3965], [0.3066], [1.0312], [0.3359], [0.6055], [0.4336], [0.7617], [0.8047], [0.4121], [0.3828], [0.4609], [0.2275], [0.2402], [0.2412]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.8008], [0.3750], [0.2500], [1.0000], [0.3340], [0.5547], [0.4004], [0.8008], [0.8008], [0.5000], [0.2500], [0.4004], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00157928466796875 loss: 0.000888824462890625 loss: 0.0011444091796875 loss: 0.00157928466796875 predicted value: tensor([[0.4766], [0.4414], [0.2734], [0.4805], [1.0312], [0.3711], [0.3457], [1.0312], [1.0234], [1.0078], [1.0469], [0.4570], [0.4824], [1.0312], [0.2578], [0.2461]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.3750], [0.2500], [0.4668], [1.0000], [0.2002], [0.2500], [1.0000], [1.0000], [1.0000], [1.0000], [0.4004], [0.5000], [1.0000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002197265625 loss: 0.0010833740234375loss: 0.001708984375 loss: 0.00049591064453125 predicted value: tensor([[0.6445], [0.3164], [0.8164], [0.9961], [0.7656], [0.5781], [1.0234], [0.6328], [0.6602], [0.7539], [0.3809], [0.3633], [0.4570], [0.4707], [0.2256], [0.2393]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.8008], [1.0000], [0.5703], [0.5547], [1.0000], [0.6016], [0.6016], [0.6680], [0.4004], [0.4004], [0.3340], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000797271728515625 loss: 0.0017547607421875 loss: 0.00162506103515625 loss: 0.00128173828125 predicted value: tensor([[0.5977], [0.5664], [0.4688], [0.4844], [1.0312], [1.0312], [1.0469], [0.2969], [0.6914], [0.8359], [0.4531], [0.3965], [0.7305], [0.4102], [0.2715], [0.3848]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.3750], [0.4668], [1.0000], [1.0000], [1.0000], [0.2500], [0.5000], [0.8320], [0.3340], [0.3340], [0.7500], [0.3340], [0.2500], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00244140625 loss: 0.00144195556640625 loss: 0.00164794921875 loss: 0.003143310546875 92%|█████████▏| 452/492 [4:07:28<22:05, 33.14s/it] {'loss': 0.0062, 'learning_rate': 1e-05, 'epoch': 0.92} 92%|█████████▏| 452/492 [4:07:28<22:05, 33.14s/it]predicted value: tensor([[0.6172], [0.2734], [1.0078], [1.0234], [1.0156], [0.2871], [0.5352], [0.8086], [0.6875], [0.5898], [0.5977], [0.3574], [0.4043], [0.4023], [0.2344], [0.2285]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.2500], [1.0000], [1.0000], [1.0000], [0.2500], [0.5000], [0.8008], [0.6680], [0.5000], [0.7500], [0.3340], [0.3340], [0.2852], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00102996826171875 loss: 0.0007781982421875 loss: 0.00092315673828125 loss: 0.001983642578125 predicted value: tensor([[1.0078], [0.3105], [0.2520], [0.5078], [0.6445], [0.2930], [0.4688], [0.2773], [0.6797], [1.0234], [0.4082], [0.4004], [0.3809], [0.4277], [0.2471], [0.2432]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [0.2500], [0.3750], [0.7500], [0.2500], [0.4668], [0.2500], [0.7500], [1.0000], [0.4004], [0.3340], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000812530517578125 loss: 0.00113677978515625loss: 0.00140380859375 loss: 0.000667572021484375 predicted value: tensor([[0.5664], [0.7891], [0.2949], [1.0156], [0.5273], [0.9961], [0.8594], [0.3340], [0.6211], [0.5820], [0.6445], [0.4258], [0.4434], [0.3984], [0.2266], [0.2266]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.2500], [1.0000], [0.4277], [1.0000], [0.8320], [0.3340], [0.6016], [0.7500], [0.6016], [0.5000], [0.4004], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000881195068359375 loss: 0.000331878662109375 loss: 0.00177764892578125 loss: 0.00083160400390625 predicted value: tensor([[0.4746], [0.4512], [0.4531], [0.7852], [0.5547], [0.3281], [0.4512], [0.7422], [0.2930], [0.4805], [0.4316], [0.2402], [0.2832], [0.3750], [0.2217], [0.2334]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [0.8008], [0.4668], [0.2500], [0.4668], [0.6680], [0.2500], [0.6016], [0.5000], [0.2500], [0.2500], [0.3340], [0.1670], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009613037109375 loss: 0.00201416015625 loss: 0.00128173828125 loss: 0.0004215240478515625 92%|█████████▏| 453/492 [4:08:02<21:41, 33.38s/it] {'loss': 0.0043, 'learning_rate': 1e-05, 'epoch': 0.92} 92%|█████████▏| 453/492 [4:08:02<21:41, 33.38s/it]predicted value: tensor([[0.4785], [0.7344], [0.6055], [0.3145], [0.3613], [0.5156], [0.6523], [0.2002], [0.9375], [0.9570], [0.4707], [0.1660], [0.5742], [0.1855], [0.1641], [0.3418]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.8008], [0.8008], [0.3750], [0.4668], [0.5000], [0.8320], [0.2500], [1.0000], [1.0000], [0.4668], [0.2500], [0.6016], [0.2500], [0.2002], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015716552734375 loss: 0.00506591796875 loss: 0.00213623046875 loss: 0.0057373046875 predicted value: tensor([[0.5273], [0.4648], [0.3984], [0.9336], [0.3867], [0.3906], [0.6523], [0.7031], [0.6094], [0.3555], [0.6523], [0.5469], [0.1201], [0.1592], [0.1533], [0.1543]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.5547], [0.4668], [1.0000], [0.3750], [0.5000], [0.8008], [0.8008], [0.7500], [0.3750], [0.8008], [0.7500], [0.2500], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020751953125 loss: 0.0029449462890625loss: 0.0023345947265625 loss: 0.00115203857421875 predicted value: tensor([[0.9531], [0.6719], [0.3438], [0.9375], [0.9297], [0.3691], [0.5625], [0.3828], [0.6211], [0.5938], [0.3398], [0.3496], [0.3203], [0.3496], [0.2637], [0.1572]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.4668], [1.0000], [1.0000], [0.4668], [0.3750], [0.4668], [0.8008], [0.7500], [0.4004], [0.5000], [0.4004], [0.5000], [0.3340], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018463134765625 loss: 0.002532958984375 loss: 0.0033111572265625 loss: 0.002197265625 predicted value: tensor([[0.3438], [0.3750], [0.6016], [0.3867], [0.6758], [0.9492], [0.6562], [0.9336], [0.4141], [0.2715], [0.1973], [0.3633], [0.2217], [0.3438], [0.1650], [0.1348]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.6680], [0.4668], [0.7500], [1.0000], [0.8008], [1.0000], [0.4668], [0.3340], [0.2500], [0.5000], [0.2500], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002227783203125 loss: 0.0028839111328125 loss: 0.0036163330078125 loss: 0.00156402587890625 92%|█████████▏| 454/492 [4:08:36<21:09, 33.42s/it] {'loss': 0.0108, 'learning_rate': 1e-05, 'epoch': 0.92} 92%|█████████▏| 454/492 [4:08:36<21:09, 33.42s/it]predicted value: tensor([[0.5703], [0.3574], [0.1875], [0.5898], [0.9336], [0.6445], [0.1992], [0.5664], [0.5742], [0.5664], [0.2832], [0.2910], [0.2754], [0.2598], [0.1631], [0.1406]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6680], [0.3750], [0.2500], [0.6680], [1.0000], [0.6680], [0.2500], [0.8008], [0.6016], [0.7500], [0.3340], [0.4004], [0.3340], [0.3340], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0009918212890625 loss: 0.00191497802734375loss: 0.00250244140625 loss: 0.0025177001953125 predicted value: tensor([[0.3633], [0.3691], [0.7539], [0.4648], [0.2178], [0.9375], [0.9570], [0.6367], [0.4102], [0.3379], [0.2715], [0.5508], [0.9414], [0.5234], [0.1543], [0.1387]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8320], [0.5547], [0.2500], [1.0000], [1.0000], [0.6680], [0.5000], [0.4004], [0.2500], [0.7500], [1.0000], [0.6016], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001373291015625 loss: 0.00177764892578125 loss: 0.0031585693359375 loss: 0.006011962890625 predicted value: tensor([[0.4375], [0.6797], [0.5312], [0.3848], [0.5703], [0.3555], [0.4883], [0.5312], [0.4316], [0.1738], [0.5469], [0.3301], [0.2402], [0.2695], [0.1245], [0.1494]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.8008], [0.6016], [0.4668], [0.6680], [0.3750], [0.4668], [0.8008], [0.6016], [0.2002], [0.6016], [0.4004], [0.3340], [0.4004], [0.1426], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00167083740234375 loss: 0.0027618408203125 loss: 0.0047607421875 loss: 0.001861572265625 predicted value: tensor([[0.6250], [0.3457], [0.3691], [0.2793], [0.6914], [0.6797], [0.4512], [0.4062], [0.9453], [0.2002], [0.9336], [0.2051], [0.3066], [0.1445], [0.3008], [0.1475]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.4668], [0.4668], [0.6016], [0.8320], [0.8008], [0.6016], [0.5000], [1.0000], [0.2500], [1.0000], [0.2002], [0.3340], [0.2500], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027618408203125 loss: 0.003753662109375 loss: 0.0022125244140625 loss: 0.004180908203125 92%|█████████▏| 455/492 [4:09:09<20:32, 33.30s/it] {'loss': 0.0111, 'learning_rate': 1e-05, 'epoch': 0.92} 92%|█████████▏| 455/492 [4:09:09<20:32, 33.30s/it]predicted value: tensor([[0.4414], [0.5273], [0.5117], [0.5859], [0.9883], [0.6797], [0.7383], [0.4551], [0.9727], [0.3438], [0.5273], [0.3867], [0.3301], [0.2256], [0.2041], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.6016], [0.6016], [1.0000], [0.6680], [0.8008], [0.5000], [1.0000], [0.5000], [0.6016], [0.5000], [0.5000], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00106048583984375 loss: 0.001434326171875 loss: 0.0012969970703125 loss: 0.0014495849609375 predicted value: tensor([[0.3984], [0.2412], [0.4238], [0.9805], [0.2754], [0.9922], [0.6055], [0.5781], [0.3477], [0.3594], [0.1895], [0.3652], [0.1807], [0.1875], [0.2041], [0.2012]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2002], [0.4668], [1.0000], [0.3340], [1.0000], [0.7500], [0.8008], [0.3340], [0.4004], [0.2500], [0.4004], [0.2002], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001861572265625 loss: 0.0020599365234375 loss: 0.00144195556640625 loss: 0.00147247314453125 predicted value: tensor([[0.4062], [0.2715], [0.4355], [0.5703], [0.5625], [0.2246], [0.4746], [0.5078], [0.9922], [0.4473], [0.2773], [0.3613], [0.3457], [0.3613], [0.1865], [0.1807]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2500], [0.4668], [0.5000], [0.8008], [0.2500], [0.5000], [0.8008], [1.0000], [0.5000], [0.3340], [0.3340], [0.3340], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001617431640625 loss: 0.0025177001953125 loss: 0.002532958984375 loss: 0.001373291015625 predicted value: tensor([[0.7500], [0.5117], [0.6797], [0.4492], [0.9570], [0.7383], [0.9883], [0.3652], [0.5000], [0.5000], [0.4570], [0.0361], [0.3770], [0.1885], [0.1973], [0.2197]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.5547], [0.6680], [0.4668], [1.0000], [0.8008], [1.0000], [0.3750], [0.7500], [0.6016], [0.4277], [0.0278], [0.3340], [0.1670], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00136566162109375 loss: 0.00144195556640625 loss: 0.0006103515625 loss: 0.0003795623779296875 93%|█████████▎| 456/492 [4:09:42<19:52, 33.13s/it] {'loss': 0.006, 'learning_rate': 1e-05, 'epoch': 0.93} 93%|█████████▎| 456/492 [4:09:42<19:52, 33.13s/it]predicted value: tensor([[1.1016], [0.8281], [1.0859], [0.8398], [1.0859], [0.3906], [0.5430], [0.7148], [0.4863], [1.0938], [0.3672], [0.4961], [0.4238], [0.3008], [0.4609], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [1.0000], [0.8008], [1.0000], [0.2500], [0.5000], [0.6016], [0.4004], [1.0000], [0.3340], [0.5000], [0.2500], [0.1670], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001708984375 loss: 0.0025482177734375loss: 0.0028076171875 loss: 0.0022125244140625 predicted value: tensor([[0.5156], [0.7266], [0.9180], [0.5078], [1.0938], [1.0703], [0.6953], [0.3457], [0.2500], [0.3613], [0.7500], [0.5430], [0.4375], [0.4512], [0.2988], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.6680], [0.8555], [0.3750], [1.0000], [1.0000], [0.7500], [0.2002], [0.0400], [0.2002], [0.7500], [0.6016], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0021209716796875 loss: 0.00286865234375 loss: 0.00148773193359375 loss: 0.00157928466796875 predicted value: tensor([[0.6445], [0.8672], [1.0703], [0.8242], [0.7031], [0.5195], [1.0938], [0.6758], [1.0547], [0.4180], [0.4453], [0.6641], [0.4004], [0.4375], [0.2617], [0.3281]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.8320], [1.0000], [0.8008], [0.5000], [0.4668], [1.0000], [0.5000], [1.0000], [0.6016], [0.3340], [0.7500], [0.5000], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0015106201171875 loss: 0.00274658203125 loss: 0.002716064453125 loss: 0.00130462646484375 predicted value: tensor([[0.5391], [0.6250], [0.9023], [0.8164], [1.0781], [0.5195], [0.5078], [0.6055], [0.7773], [0.4004], [0.3652], [0.3594], [0.2871], [0.6719], [0.3086], [0.2949]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.8320], [0.8008], [1.0000], [0.4668], [0.3750], [0.5000], [0.6680], [0.5000], [0.2500], [0.3340], [0.2002], [0.6016], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00147247314453125 loss: 0.0029754638671875 loss: 0.00250244140625 loss: 0.001800537109375 93%|█████████▎| 457/492 [4:10:15<19:24, 33.27s/it] {'loss': 0.0086, 'learning_rate': 1e-05, 'epoch': 0.93} 93%|█████████▎| 457/492 [4:10:15<19:24, 33.27s/it]predicted value: tensor([[0.5938], [0.9180], [0.9570], [0.7969], [0.5586], [0.7891], [0.3945], [1.1328], [0.8633], [0.7852], [0.4492], [0.4961], [0.4590], [0.3145], [0.3066], [0.3203]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8320], [0.8320], [0.6680], [0.4668], [0.7500], [0.2500], [1.0000], [0.6680], [0.7500], [0.6016], [0.4004], [0.4004], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0036163330078125 loss: 0.0038909912109375loss: 0.00457763671875 loss: 0.002838134765625 predicted value: tensor([[0.5820], [0.8359], [0.5508], [0.6914], [0.8555], [0.8320], [0.7617], [0.7461], [0.7461], [0.3398], [0.7539], [0.5469], [0.4219], [0.3281], [0.2988], [0.3398]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.4668], [0.7500], [0.6680], [0.6680], [0.7500], [0.8008], [0.6016], [0.2002], [0.6016], [0.5000], [0.3340], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004058837890625 loss: 0.00262451171875 loss: 0.0030364990234375 loss: 0.0030670166015625 predicted value: tensor([[0.6367], [0.5586], [0.8750], [0.5508], [0.5625], [0.8828], [0.5820], [0.8750], [0.2422], [1.0859], [1.0938], [0.5078], [0.5391], [0.3379], [0.3340], [0.3203]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.8320], [0.3750], [0.4668], [0.8008], [0.4668], [0.8008], [0.0278], [1.0000], [1.0000], [0.5000], [0.5000], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00482177734375 loss: 0.0025787353515625 loss: 0.0033111572265625 loss: 0.002410888671875 predicted value: tensor([[0.5586], [0.5742], [0.6484], [0.6602], [0.6367], [0.6523], [1.1250], [0.5938], [0.8320], [1.1328], [0.3379], [0.5859], [0.4707], [0.3281], [0.3242], [0.3320]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.5547], [0.5547], [0.4648], [0.8008], [1.0000], [0.8008], [0.6680], [1.0000], [0.4004], [0.4668], [0.3340], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005615234375 loss: 0.00421142578125 loss: 0.0038299560546875 loss: 0.004364013671875 93%|█████████▎| 458/492 [4:10:48<18:48, 33.18s/it] {'loss': 0.0147, 'learning_rate': 1e-05, 'epoch': 0.93} 93%|█████████▎| 458/492 [4:10:48<18:48, 33.18s/it]predicted value: tensor([[0.6406], [0.5195], [0.5352], [0.6836], [0.6133], [0.5039], [0.5156], [0.6367], [0.7109], [0.6914], [0.4395], [0.4824], [0.4180], [0.3027], [0.2930], [0.3008]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.4668], [0.5703], [0.5547], [0.3750], [0.4668], [0.5000], [0.6016], [0.6016], [0.4004], [0.5000], [0.3340], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023956298828125 loss: 0.0026092529296875 loss: 0.00213623046875 loss: 0.003173828125 predicted value: tensor([[0.5625], [1.0781], [0.7930], [0.8398], [1.0859], [1.0938], [0.4492], [0.3223], [0.3359], [0.4863], [0.4766], [1.1016], [0.4980], [0.4570], [0.2832], [0.2949]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.6680], [0.8008], [1.0000], [1.0000], [0.3340], [0.2002], [0.2500], [0.5000], [0.5000], [1.0000], [0.4004], [0.5000], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00164794921875 loss: 0.0019073486328125loss: 0.0020599365234375 loss: 0.00146484375 predicted value: tensor([[0.8242], [0.5703], [1.0781], [0.2773], [0.3223], [0.5625], [0.5000], [1.0938], [1.0859], [0.3320], [1.0859], [0.4746], [0.4395], [0.2988], [0.3184], [0.2949]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.5547], [1.0000], [0.1113], [0.2500], [0.5000], [0.6016], [1.0000], [1.0000], [0.2500], [1.0000], [0.3340], [0.3340], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00188446044921875 loss: 0.0024261474609375loss: 0.0021514892578125 loss: 0.002532958984375 predicted value: tensor([[0.3438], [0.7188], [1.0938], [0.8750], [0.5469], [0.3789], [0.7305], [0.8789], [0.3809], [0.6094], [0.4922], [0.3965], [0.3047], [0.4648], [0.2988], [0.2832]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.8320], [1.0000], [0.8008], [0.4668], [0.2500], [0.7500], [0.8008], [0.5000], [0.6016], [0.2500], [0.5000], [0.2002], [0.5000], [0.2002], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037994384765625 loss: 0.0031585693359375 loss: 0.002197265625 loss: 0.00360107421875 93%|█████████▎| 459/492 [4:11:22<18:16, 33.23s/it] {'loss': 0.0098, 'learning_rate': 1e-05, 'epoch': 0.93} 93%|█████████▎| 459/492 [4:11:22<18:16, 33.23s/it]predicted value: tensor([[0.4258], [0.4336], [0.6914], [0.8086], [1.0078], [0.3105], [0.2715], [1.0000], [0.2559], [0.3906], [0.3926], [0.2158], [0.2676], [0.4121], [0.3867], [0.2207]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.8008], [0.8320], [1.0000], [0.3340], [0.3340], [1.0000], [0.2002], [0.4004], [0.5000], [0.1670], [0.2500], [0.5000], [0.5000], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000629425048828125 loss: 0.0020294189453125 loss: 0.0011138916015625 loss: 0.0010833740234375 predicted value: tensor([[0.5391], [0.4375], [0.9805], [0.9961], [0.8086], [0.4395], [0.4688], [1.0156], [0.5664], [0.4277], [0.4336], [0.3789], [0.4102], [0.3457], [0.2314], [0.2070]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [1.0000], [1.0000], [0.8320], [0.4668], [0.4668], [1.0000], [0.6016], [0.6016], [0.3750], [0.4004], [0.4004], [0.2852], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000850677490234375 loss: 0.00115966796875 loss: 0.000713348388671875 loss: 0.0008087158203125 predicted value: tensor([[0.2852], [0.8320], [0.9922], [1.0078], [0.4375], [0.7734], [0.2422], [0.4199], [0.7227], [0.7344], [0.9961], [0.6016], [0.0544], [0.3848], [0.2139], [0.2119]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.8320], [1.0000], [1.0000], [0.4668], [0.8008], [0.2500], [0.4668], [0.5703], [0.4668], [1.0000], [0.6016], [0.0278], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0003604888916015625 loss: 0.0015869140625loss: 0.002593994140625 loss: 0.000751495361328125 predicted value: tensor([[0.7734], [0.4473], [0.9805], [0.4297], [0.3848], [0.2363], [0.6016], [0.6562], [0.4043], [0.3047], [0.5234], [0.2373], [0.3613], [0.2266], [0.1963], [0.1953]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [1.0000], [0.4668], [0.3750], [0.2500], [0.6016], [0.7500], [0.3750], [0.3340], [0.7500], [0.2002], [0.3340], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00116729736328125 loss: 0.00183868408203125loss: 0.0013427734375 loss: 0.000957489013671875 93%|█████████▎| 460/492 [4:11:55<17:45, 33.30s/it] {'loss': 0.0047, 'learning_rate': 1e-05, 'epoch': 0.93} 93%|█████████▎| 460/492 [4:11:55<17:45, 33.30s/it]predicted value: tensor([[0.5547], [0.9844], [0.6992], [0.4082], [0.6484], [0.3984], [0.2197], [0.4609], [0.2734], [0.3887], [0.9805], [0.3379], [0.3496], [0.3652], [0.4590], [0.1982]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.7500], [0.3750], [0.6680], [0.5000], [0.2002], [0.4668], [0.2500], [0.4004], [1.0000], [0.4004], [0.3340], [0.5000], [0.7500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000583648681640625 loss: 0.000873565673828125 loss: 0.0019683837890625 loss: 0.00188446044921875 predicted value: tensor([[0.4336], [0.7734], [0.4219], [0.9727], [0.9883], [0.5469], [0.2197], [0.6953], [0.2139], [0.4551], [0.3789], [0.4863], [0.3887], [0.3438], [0.2061], [0.1992]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.4668], [1.0000], [1.0000], [0.6016], [0.2500], [0.5703], [0.2002], [0.6016], [0.5000], [0.6016], [0.4004], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00133514404296875 loss: 0.0009002685546875 loss: 0.00127410888671875 loss: 0.001922607421875 predicted value: tensor([[0.4238], [0.9844], [0.5039], [0.6914], [0.4473], [0.8242], [0.7070], [0.6250], [0.5273], [0.2578], [0.3770], [0.7266], [0.4043], [0.1865], [0.1904], [0.2090]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.5547], [0.6680], [0.4668], [0.8320], [0.8008], [0.6016], [0.7500], [0.2500], [0.6016], [0.7500], [0.4004], [0.1670], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00128936767578125 loss: 0.0018463134765625loss: 0.0016021728515625 loss: 0.0021820068359375 predicted value: tensor([[0.5391], [0.7109], [0.8242], [0.4316], [0.9883], [0.2334], [0.3984], [0.3281], [0.2539], [0.4004], [0.2734], [0.9727], [0.3105], [0.3379], [0.1836], [0.1787]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.8320], [0.4668], [1.0000], [0.1670], [0.3750], [0.2500], [0.2500], [0.4004], [0.2002], [1.0000], [0.4004], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00164794921875 loss: 0.000865936279296875 loss: 0.0008544921875 loss: 0.00061798095703125 94%|█████████▎| 461/492 [4:12:29<17:19, 33.52s/it] {'loss': 0.0054, 'learning_rate': 1e-05, 'epoch': 0.94} 94%|█████████▎| 461/492 [4:12:29<17:19, 33.52s/it]predicted value: tensor([[0.5664], [0.4219], [0.2930], [0.2490], [0.8125], [0.9883], [0.3828], [0.4844], [0.6367], [0.5664], [0.2539], [0.4160], [0.3320], [0.2910], [0.2051], [0.2314]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.3340], [0.2500], [0.8008], [1.0000], [0.2500], [0.4668], [0.6016], [0.6016], [0.0400], [0.5000], [0.3340], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0013275146484375 loss: 0.0017242431640625 loss: 0.002197265625 loss: 0.00153350830078125 predicted value: tensor([[1.0000], [0.5547], [0.4551], [0.8008], [0.8867], [0.3516], [0.5938], [0.7656], [0.6797], [0.5859], [1.0078], [0.6133], [0.4512], [0.2334], [0.2490], [0.2432]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.5547], [0.3750], [0.8008], [0.8320], [0.2500], [0.4668], [0.7500], [0.5547], [0.1670], [1.0000], [0.6016], [0.5000], [0.2002], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00055694580078125 loss: 0.003692626953125loss: 0.0023651123046875 loss: 0.0021514892578125 predicted value: tensor([[1.0391], [0.4551], [0.9961], [1.0156], [0.4824], [0.4141], [0.3223], [0.7070], [0.4082], [0.4863], [0.6055], [0.7266], [0.4297], [0.2559], [0.2246], [0.2412]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [1.0000], [1.0000], [0.4668], [0.6016], [0.2500], [0.6016], [0.3340], [0.4668], [0.5703], [0.7500], [0.4004], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000453948974609375 loss: 0.00372314453125 loss: 0.001129150390625 loss: 0.000804901123046875 predicted value: tensor([[0.7734], [1.0000], [0.4414], [0.5000], [0.2930], [1.0234], [0.5938], [0.5000], [0.1328], [0.4785], [0.0292], [1.0078], [0.2090], [0.3730], [0.2168], [0.4160]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [1.0000], [0.4668], [0.4668], [0.2500], [1.0000], [0.6016], [0.3750], [0.0278], [0.2500], [0.0625], [1.0000], [0.0204], [0.2852], [0.2500], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000782012939453125 loss: 0.000782012939453125 loss: 0.002197265625 loss: 0.00213623046875 94%|█████████▍| 462/492 [4:13:02<16:42, 33.42s/it] {'loss': 0.0069, 'learning_rate': 1e-05, 'epoch': 0.94} 94%|█████████▍| 462/492 [4:13:02<16:42, 33.42s/it]predicted value: tensor([[1.0156], [0.8125], [0.4199], [0.4453], [0.8125], [0.4121], [0.7461], [0.9922], [0.9844], [0.6680], [1.0078], [0.4062], [0.4707], [0.2090], [0.2236], [0.2188]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.3750], [0.4668], [0.8008], [0.4668], [0.7500], [1.0000], [1.0000], [0.5000], [1.0000], [0.3340], [0.5000], [0.2002], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00061798095703125 loss: 0.000843048095703125loss: 0.00063323974609375 loss: 0.000911712646484375 predicted value: tensor([[0.4863], [0.7148], [0.5352], [0.9844], [0.4434], [0.7148], [0.7070], [0.4336], [0.2100], [0.7734], [0.2871], [0.3027], [0.3652], [0.4258], [0.2119], [0.2002]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.6680], [0.5547], [1.0000], [0.3750], [0.7500], [0.6016], [0.4668], [0.2500], [0.8008], [0.2500], [0.2002], [0.4004], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0004425048828125 loss: 0.00077056884765625 loss: 0.000682830810546875 loss: 0.00072479248046875 predicted value: tensor([[0.4121], [0.4453], [1.0000], [0.9805], [0.7266], [1.0078], [0.2676], [0.7578], [0.2949], [0.6367], [0.4766], [0.6094], [0.4375], [0.4004], [0.1855], [0.2061]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [1.0000], [1.0000], [0.6250], [1.0000], [0.2500], [0.8008], [0.3340], [0.5000], [0.4668], [0.6016], [0.6680], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000823974609375 loss: 0.0014801025390625 loss: 0.0002536773681640625 loss: 0.0020599365234375 predicted value: tensor([[0.4219], [1.0000], [0.8281], [0.7031], [0.8047], [1.0156], [0.5625], [0.2852], [0.4297], [0.6680], [0.4180], [0.5469], [0.6797], [0.4023], [0.4492], [0.1846]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3145], [1.0000], [0.8320], [0.7500], [0.8008], [1.0000], [0.4668], [0.3340], [0.3750], [0.6016], [0.4004], [0.6680], [0.7500], [0.5000], [0.4004], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000629425048828125 loss: 0.000804901123046875 loss: 0.00107574462890625 loss: 0.0009918212890625 94%|█████████▍| 463/492 [4:13:36<16:09, 33.44s/it] {'loss': 0.0034, 'learning_rate': 1e-05, 'epoch': 0.94} 94%|█████████▍| 463/492 [4:13:36<16:09, 33.44s/it]predicted value: tensor([[0.4629], [0.3398], [0.9141], [0.3418], [0.7109], [0.5430], [0.9219], [0.1855], [0.6094], [0.5859], [0.9258], [0.3535], [0.3574], [0.4141], [0.3145], [0.1348]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [1.0000], [0.4668], [0.8008], [0.5000], [1.0000], [0.2500], [0.6680], [0.6016], [1.0000], [0.3340], [0.5000], [0.6016], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002685546875 loss: 0.00167083740234375 loss: 0.00191497802734375 loss: 0.001800537109375 predicted value: tensor([[0.7148], [0.1445], [0.7227], [0.3477], [0.3555], [0.5625], [0.6602], [0.3535], [0.9375], [0.1367], [0.6875], [0.6133], [0.4355], [0.3418], [0.3848], [0.1157]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.2500], [0.8008], [0.4668], [0.4668], [0.6680], [0.7500], [0.4668], [1.0000], [0.2500], [0.8008], [0.6680], [0.5000], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0022430419921875 loss: 0.0022430419921875 loss: 0.002685546875 loss: 0.00360107421875 predicted value: tensor([[0.4336], [0.4727], [0.2354], [0.9297], [0.6211], [0.6797], [0.8984], [0.2441], [0.5234], [0.9180], [0.5352], [0.5586], [0.1680], [0.3828], [0.1338], [0.1167]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.3340], [1.0000], [0.8008], [0.8008], [1.0000], [0.2500], [0.6016], [1.0000], [0.6016], [0.5000], [0.2500], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002227783203125 loss: 0.00162506103515625 loss: 0.00201416015625 loss: 0.00286865234375 predicted value: tensor([[0.9062], [0.3457], [0.2080], [0.6328], [0.3809], [0.9219], [0.6602], [0.7031], [0.9297], [0.1758], [0.4453], [0.9336], [0.3027], [0.3164], [0.1484], [0.1357]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.3340], [0.8008], [0.3145], [1.0000], [0.5547], [0.8555], [1.0000], [0.2500], [0.5000], [1.0000], [0.3340], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025177001953125 loss: 0.0019683837890625 loss: 0.0021514892578125 loss: 0.0017852783203125 94%|█████████▍| 464/492 [4:14:09<15:32, 33.31s/it] {'loss': 0.009, 'learning_rate': 1e-05, 'epoch': 0.94} 94%|█████████▍| 464/492 [4:14:09<15:32, 33.31s/it]predicted value: tensor([[0.4609], [0.3203], [0.5469], [0.1973], [0.4141], [0.3672], [0.9258], [0.5625], [0.3223], [0.4746], [0.6133], [0.1226], [0.3691], [0.1396], [0.1377], [0.1118]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.4668], [0.2500], [0.5547], [0.4668], [1.0000], [0.6016], [0.4004], [0.6016], [0.7500], [0.1670], [0.4004], [0.2500], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030517578125 loss: 0.0038299560546875 loss: 0.002197265625 loss: 0.00372314453125 predicted value: tensor([[0.7617], [0.5820], [0.9219], [0.6992], [0.6523], [0.9258], [0.9219], [0.9102], [0.6602], [0.9375], [0.5469], [0.1021], [0.9297], [0.3652], [0.1572], [0.1245]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.6172], [1.0000], [0.8008], [0.6680], [1.0000], [1.0000], [1.0000], [0.7500], [1.0000], [0.6016], [0.2500], [1.0000], [0.5000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020751953125 loss: 0.002685546875 loss: 0.002044677734375 loss: 0.0029754638671875 predicted value: tensor([[0.4023], [0.4277], [0.6641], [0.3340], [0.3887], [0.2207], [0.3691], [0.2891], [0.1514], [0.6094], [0.4004], [0.2676], [0.2949], [0.1006], [0.1299], [0.1768]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.8008], [0.4668], [0.4668], [0.2002], [0.4648], [0.3750], [0.2500], [0.6016], [0.5000], [0.2852], [0.3340], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00177001953125 loss: 0.00238037109375 loss: 0.002197265625 loss: 0.002227783203125 predicted value: tensor([[ 0.6719], [ 0.2930], [ 0.3516], [ 0.7148], [ 0.6680], [ 0.1445], [ 0.9297], [ 0.9453], [ 0.9258], [ 0.1641], [ 0.3906], [-0.0273], [ 0.3672], [ 0.1299], [ 0.1465], [ 0.1084]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [0.4668], [0.4668], [0.8008], [0.5000], [0.2500], [1.0000], [1.0000], [1.0000], [0.2500], [0.4004], [0.0625], [0.4004], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0052490234375 loss: 0.0027313232421875 loss: 0.0030059814453125 loss: 0.0021514892578125 95%|█████████▍| 465/492 [4:14:42<15:02, 33.41s/it] {'loss': 0.0111, 'learning_rate': 1e-05, 'epoch': 0.95} 95%|█████████▍| 465/492 [4:14:42<15:02, 33.41s/it]predicted value: tensor([[0.8477], [0.7891], [0.1670], [0.3340], [0.3770], [0.2217], [0.3750], [0.2461], [0.3555], [0.3750], [0.2334], [0.3867], [0.4121], [0.3730], [0.2051], [0.1660]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [0.8320], [0.3340], [0.4668], [0.4668], [0.3340], [0.4668], [0.2002], [0.4668], [0.4668], [0.2500], [0.4004], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011749267578125 loss: 0.0019073486328125loss: 0.0017242431640625 loss: 0.002593994140625 predicted value: tensor([[0.9727], [0.3809], [0.1562], [0.3887], [0.6094], [0.3965], [0.3457], [0.9844], [0.6172], [0.5898], [0.4004], [0.0049], [0.4180], [0.4941], [0.1855], [0.1631]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.2500], [0.4668], [0.5547], [0.4668], [0.4668], [1.0000], [0.5000], [0.6016], [0.4004], [0.0278], [0.3340], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00104522705078125 loss: 0.0010528564453125loss: 0.00055694580078125 loss: 0.001739501953125 predicted value: tensor([[0.6523], [0.9844], [0.3418], [0.9609], [0.3574], [0.9727], [0.3477], [0.9805], [0.2070], [0.6680], [0.6328], [0.4102], [0.7422], [0.4453], [0.2002], [0.1885]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [1.0000], [0.4668], [1.0000], [0.4668], [1.0000], [0.4668], [1.0000], [0.3340], [0.6016], [0.5000], [0.3340], [0.4668], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000698089599609375 loss: 0.00274658203125 loss: 0.000972747802734375 loss: 0.000392913818359375 predicted value: tensor([[0.9688], [0.5664], [0.4766], [0.3086], [0.7852], [0.9844], [0.3711], [0.7891], [0.3633], [0.5273], [0.3477], [0.4180], [0.3867], [0.3574], [0.1963], [0.1670]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [0.5547], [0.3750], [0.8320], [1.0000], [0.4668], [0.8320], [0.3750], [0.5000], [0.3750], [0.3340], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0011444091796875 loss: 0.003692626953125 loss: 0.000965118408203125 loss: 0.0009307861328125 95%|█████████▍| 466/492 [4:15:16<14:31, 33.53s/it] {'loss': 0.0058, 'learning_rate': 1e-05, 'epoch': 0.95} 95%|█████████▍| 466/492 [4:15:16<14:31, 33.53s/it]predicted value: tensor([[1.0781], [0.4727], [0.4531], [0.8125], [1.0781], [1.0547], [0.4453], [0.7695], [0.3008], [0.4668], [1.0703], [0.7305], [0.3359], [0.4883], [0.2852], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3750], [0.4668], [0.7500], [1.0000], [1.0000], [0.4668], [0.7500], [0.3340], [0.3340], [1.0000], [0.5000], [0.3340], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0036468505859375 loss: 0.0022125244140625loss: 0.00174713134765625 loss: 0.0027923583984375 predicted value: tensor([[0.4590], [0.4961], [0.2988], [0.4863], [0.2793], [1.0781], [0.4863], [0.7930], [0.6250], [0.4844], [0.7656], [0.7969], [0.6289], [0.4492], [0.2871], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3340], [0.6680], [0.2500], [1.0000], [0.2500], [0.6680], [0.5000], [0.4668], [0.6016], [0.7500], [0.5000], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003570556640625 loss: 0.002838134765625 loss: 0.0023956298828125 loss: 0.005126953125 predicted value: tensor([[0.6055], [0.5781], [0.4590], [0.5000], [0.7891], [0.8633], [1.0938], [0.6953], [0.6836], [0.8750], [0.5000], [0.4590], [0.4766], [0.2676], [0.2793], [0.2891]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [0.4668], [0.6680], [0.8008], [1.0000], [0.4668], [0.6016], [0.8008], [0.4004], [0.3340], [0.3340], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002410888671875loss: 0.00543212890625 loss: 0.0035858154296875 loss: 0.0009918212890625 predicted value: tensor([[0.3262], [1.0625], [1.0859], [0.4238], [1.0859], [0.8672], [0.3555], [0.7891], [0.3301], [0.5039], [1.0859], [0.5586], [0.5547], [0.2988], [0.2715], [0.2871]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [1.0000], [1.0000], [0.3750], [1.0000], [0.8320], [0.3340], [0.7500], [0.2500], [0.3340], [1.0000], [0.5000], [0.5000], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001495361328125 loss: 0.002166748046875 loss: 0.00311279296875 loss: 0.00157928466796875 95%|█████████▍| 467/492 [4:15:50<13:57, 33.51s/it] {'loss': 0.0113, 'learning_rate': 1e-05, 'epoch': 0.95} 95%|█████████▍| 467/492 [4:15:50<13:57, 33.51s/it]predicted value: tensor([[0.9453], [1.0859], [0.8203], [0.3711], [1.1172], [0.7461], [0.8164], [0.3789], [1.1172], [1.1016], [0.6055], [0.5273], [1.1250], [0.3477], [0.3438], [0.5039]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.8008], [0.3340], [1.0000], [0.6016], [0.8008], [0.3340], [1.0000], [1.0000], [0.5000], [0.4004], [1.0000], [0.2002], [0.2500], [0.3340]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0076904296875 loss: 0.0024871826171875 loss: 0.002899169921875 loss: 0.003692626953125 predicted value: tensor([[0.9648], [1.0781], [0.6523], [0.3047], [1.1172], [0.4707], [1.1172], [0.7695], [0.5039], [0.7188], [1.1172], [0.8867], [0.6602], [0.5742], [0.3105], [0.3047]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8750], [1.0000], [0.8320], [0.2500], [1.0000], [0.4668], [1.0000], [0.6016], [0.4668], [0.6016], [1.0000], [0.8008], [0.4277], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030975341796875 loss: 0.003936767578125 loss: 0.003753662109375 loss: 0.00372314453125 predicted value: tensor([[0.4805], [1.1094], [1.1250], [0.5312], [0.4785], [0.7734], [0.7578], [0.7031], [0.3340], [0.3359], [0.5664], [0.5352], [0.7266], [0.3066], [0.5117], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [0.4668], [0.3750], [0.4668], [0.6016], [0.5000], [0.2002], [0.3340], [0.4004], [0.2852], [0.6016], [0.2500], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0054931640625loss: 0.003448486328125 loss: 0.0057373046875 loss: 0.0035552978515625 predicted value: tensor([[1.1250], [0.8672], [0.6133], [0.5039], [0.5078], [0.1494], [0.3145], [0.3262], [0.8789], [1.1250], [0.7070], [0.5898], [0.1592], [0.6367], [0.3027], [0.3398]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.3750], [0.3750], [0.4668], [0.0400], [0.2002], [0.2500], [0.8008], [1.0000], [0.6016], [0.5000], [0.0400], [0.5000], [0.1670], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006439208984375 loss: 0.00341796875 loss: 0.0064697265625 loss: 0.003875732421875 95%|█████████▌| 468/492 [4:16:23<13:21, 33.40s/it] {'loss': 0.0174, 'learning_rate': 1e-05, 'epoch': 0.95} 95%|█████████▌| 468/492 [4:16:23<13:21, 33.40s/it]predicted value: tensor([[0.6211], [0.6523], [0.7812], [1.0859], [0.5430], [0.3438], [0.2793], [0.1641], [0.7070], [0.7383], [0.7070], [0.5195], [0.5820], [0.2676], [0.2773], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.8320], [1.0000], [0.4668], [0.2500], [0.2002], [0.0625], [0.6016], [0.8008], [0.6016], [0.4668], [0.4004], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00238037109375 loss: 0.0021514892578125loss: 0.0045166015625 loss: 0.0025634765625 predicted value: tensor([[1.0781], [0.7852], [0.4941], [1.0781], [0.8711], [0.0952], [0.4883], [0.3203], [0.5312], [0.2754], [0.5781], [0.5625], [0.5352], [0.5508], [0.2773], [0.2451]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.7500], [0.3750], [1.0000], [0.8008], [0.0278], [0.4668], [0.2500], [0.3750], [0.3340], [0.4004], [0.4004], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00421142578125 loss: 0.00482177734375 loss: 0.003143310546875 loss: 0.0009765625 predicted value: tensor([[0.6055], [0.4629], [0.7656], [1.0938], [1.0781], [0.3867], [1.0859], [0.7773], [0.7031], [1.0703], [0.6055], [0.4883], [0.3594], [0.7578], [0.2695], [0.2969]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3750], [0.3145], [1.0000], [1.0000], [0.3340], [1.0000], [0.7500], [0.6016], [1.0000], [0.2500], [0.3340], [0.2500], [0.3340], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00188446044921875 loss: 0.0096435546875loss: 0.001556396484375 loss: 0.0023040771484375 predicted value: tensor([[0.3301], [0.8750], [0.4902], [0.8320], [0.4668], [0.7070], [0.4668], [1.0859], [0.8867], [0.1270], [0.5781], [0.3340], [0.4844], [0.2969], [0.2773], [0.2988]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.8320], [0.4668], [0.8008], [0.4668], [0.6016], [0.4668], [1.0000], [0.8320], [0.0400], [0.4668], [0.2002], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002532958984375 loss: 0.00133514404296875 loss: 0.0018768310546875 loss: 0.0016632080078125 95%|█████████▌| 469/492 [4:16:56<12:48, 33.42s/it] {'loss': 0.0119, 'learning_rate': 1e-05, 'epoch': 0.95} 95%|█████████▌| 469/492 [4:16:56<12:48, 33.42s/it]predicted value: tensor([[0.1904], [0.1709], [0.4023], [0.9961], [1.0078], [1.0078], [0.6602], [0.2031], [0.7656], [0.6016], [0.6016], [0.4355], [0.3965], [0.4043], [0.1797], [0.1914]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2002], [0.2002], [0.4668], [1.0000], [1.0000], [1.0000], [0.7500], [0.2500], [0.8008], [0.6016], [0.2002], [0.4004], [0.4004], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010986328125 loss: 0.0028228759765625loss: 0.000762939453125 loss: 0.0003147125244140625 predicted value: tensor([[0.6484], [0.8398], [0.2373], [0.2480], [0.7695], [0.9961], [0.2461], [0.2354], [0.5039], [0.6719], [0.1953], [0.4551], [0.4355], [0.4004], [0.3691], [0.2031]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [0.8320], [0.3340], [0.2500], [0.8008], [1.0000], [0.2500], [0.3340], [0.5547], [0.7500], [0.2002], [0.4004], [0.4004], [0.3340], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000362396240234375 loss: 0.0027618408203125 loss: 0.000652313232421875 loss: 0.000858306884765625 predicted value: tensor([[0.3594], [0.4434], [0.5352], [0.2461], [0.5039], [0.4102], [0.5078], [0.4473], [0.2969], [0.9844], [0.4609], [0.3184], [0.5820], [0.6133], [0.6367], [0.4766]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.7148], [0.5547], [0.2500], [0.4648], [0.4668], [0.4648], [0.4668], [0.3340], [1.0000], [0.5000], [0.3340], [0.6016], [0.5000], [0.7500], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029296875 loss: 0.00189208984375 loss: 0.0019989013671875 loss: 0.00063323974609375 predicted value: tensor([[1.0000], [0.9961], [0.2373], [0.2227], [0.7539], [0.4023], [0.6758], [0.5117], [1.0156], [0.2891], [1.0000], [0.3320], [0.4297], [0.4082], [0.4941], [0.2080]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.3340], [0.2500], [0.8008], [0.4668], [0.6680], [0.4668], [1.0000], [0.3340], [1.0000], [0.2500], [0.4004], [0.4668], [0.5000], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000823974609375 loss: 0.0016021728515625 loss: 0.000946044921875 loss: 0.00052642822265625 96%|█████████▌| 470/492 [4:17:31<12:22, 33.77s/it] {'loss': 0.0052, 'learning_rate': 1e-05, 'epoch': 0.96} 96%|█████████▌| 470/492 [4:17:31<12:22, 33.77s/it]predicted value: tensor([[0.4492], [0.7656], [0.4297], [0.9766], [0.6680], [0.6602], [0.7578], [0.7227], [0.5820], [0.9922], [0.5508], [0.4004], [0.4258], [0.2363], [0.1875], [0.1895]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.4668], [1.0000], [0.8008], [0.6680], [0.5547], [0.7500], [0.6016], [1.0000], [0.6016], [0.4004], [0.4004], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025634765625 loss: 0.000896453857421875 loss: 0.00107574462890625 loss: 0.001220703125 predicted value: tensor([[0.5430], [0.9609], [0.3926], [0.7969], [0.4062], [0.6289], [0.3887], [0.9688], [0.2695], [0.5391], [0.4512], [0.6211], [0.4043], [0.4746], [0.4980], [0.1816]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.8320], [0.3750], [0.6680], [0.8008], [1.0000], [0.2500], [0.5000], [0.5000], [0.5000], [0.3340], [0.5000], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0004253387451171875 loss: 0.00372314453125 loss: 0.00118255615234375 loss: 0.000896453857421875 predicted value: tensor([[0.4004], [0.4023], [0.2910], [0.7031], [0.1846], [0.2695], [0.6875], [0.7461], [0.8008], [0.5586], [0.9805], [0.4902], [0.4121], [0.3301], [0.1377], [0.1514]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.3340], [0.8008], [0.3340], [0.2500], [0.7500], [0.8008], [0.8320], [0.5000], [1.0000], [0.2500], [0.4004], [0.3340], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001220703125 loss: 0.0037841796875 loss: 0.0028076171875 loss: 0.00080108642578125 predicted value: tensor([[0.5391], [0.2119], [0.7852], [0.3984], [0.4238], [0.9688], [0.9727], [0.9688], [0.5859], [0.5391], [0.2236], [0.6836], [0.5430], [0.9492], [0.1973], [0.1738]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2002], [0.8320], [0.4668], [0.4668], [1.0000], [1.0000], [1.0000], [0.6016], [0.5000], [0.2500], [0.6016], [0.5000], [1.0000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000415802001953125 loss: 0.0008392333984375 loss: 0.00107574462890625 loss: 0.0004215240478515625 96%|█████████▌| 471/492 [4:18:04<11:47, 33.70s/it] {'loss': 0.0058, 'learning_rate': 1e-05, 'epoch': 0.96} 96%|█████████▌| 471/492 [4:18:04<11:47, 33.70s/it]predicted value: tensor([[0.5938], [0.8008], [1.0078], [0.4609], [0.7812], [0.6992], [0.4531], [0.7773], [0.6406], [0.9844], [0.7266], [1.0078], [0.4082], [0.2207], [0.2344], [0.2471]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [1.0000], [0.3750], [0.8008], [0.7500], [0.4668], [0.6680], [0.5000], [1.0000], [0.7500], [1.0000], [0.5000], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00087738037109375 loss: 0.00084686279296875 loss: 0.001373291015625 loss: 0.00188446044921875 predicted value: tensor([[0.4551], [0.4551], [0.4688], [0.5000], [1.0391], [0.6641], [1.0000], [0.6133], [0.6641], [0.5859], [0.5508], [0.5547], [0.4453], [0.1963], [0.4434], [0.2246]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.4668], [0.3750], [0.4668], [1.0000], [0.7500], [1.0000], [0.5000], [0.7500], [0.6016], [0.6016], [0.4668], [0.4004], [0.2002], [0.3340], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00145721435546875 loss: 0.00110626220703125 loss: 0.0003261566162109375 loss: 0.00072479248046875 predicted value: tensor([[0.6133], [0.5781], [1.0156], [0.5195], [0.4727], [0.6719], [1.0000], [0.6367], [0.2930], [0.4980], [1.0156], [0.4766], [0.3848], [0.4902], [0.2178], [0.2100]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [1.0000], [0.4668], [0.2500], [0.6016], [1.0000], [0.6016], [0.2500], [0.4668], [1.0000], [0.4668], [0.3340], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0003604888916015625 loss: 0.0011138916015625 loss: 0.00102996826171875 loss: 0.002105712890625 predicted value: tensor([[1.0312], [0.4980], [0.4609], [0.1108], [1.0078], [0.6641], [0.3398], [0.6836], [0.7305], [0.6758], [0.4980], [0.4863], [0.4922], [0.4141], [0.1992], [0.2275]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.4668], [0.0278], [1.0000], [0.6016], [0.2500], [0.5547], [0.6680], [0.7500], [0.4668], [0.3750], [0.2852], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000583648681640625 loss: 0.001953125 loss: 0.000667572021484375 loss: 0.00162506103515625 96%|█████████▌| 472/492 [4:18:38<11:14, 33.71s/it] {'loss': 0.0045, 'learning_rate': 1e-05, 'epoch': 0.96} 96%|█████████▌| 472/492 [4:18:38<11:14, 33.71s/it]predicted value: tensor([[1.0078], [0.7031], [0.4023], [0.4902], [0.4453], [0.7266], [0.4473], [0.9648], [0.4043], [0.6562], [0.3965], [0.4609], [0.4102], [0.3867], [0.2080], [0.1914]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.3750], [0.3750], [0.4004], [0.8008], [0.4668], [1.0000], [0.4004], [0.7500], [0.4004], [0.4004], [0.5000], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00148773193359375 loss: 0.000896453857421875 loss: 0.00086212158203125 loss: 0.00168609619140625 predicted value: tensor([[0.5938], [0.3145], [0.9883], [0.2559], [0.7227], [0.7266], [0.6367], [0.5156], [0.5156], [0.7500], [0.6133], [0.9727], [0.4609], [0.4062], [0.3379], [0.2021]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4648], [0.3340], [1.0000], [0.2500], [0.8008], [0.8008], [0.5547], [0.5000], [0.6016], [0.8008], [0.7500], [1.0000], [0.5000], [0.5000], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0023193359375 loss: 0.00121307373046875loss: 0.0021209716796875 loss: 0.002349853515625 predicted value: tensor([[0.9844], [0.8242], [0.9922], [0.7461], [0.9844], [0.2734], [0.2617], [0.6328], [0.5898], [0.2490], [0.3867], [0.2715], [0.3984], [0.6133], [0.2031], [0.2012]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8320], [1.0000], [0.8008], [1.0000], [0.2500], [0.2500], [0.6680], [0.5000], [0.2002], [0.7500], [0.2500], [0.3340], [0.6016], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00141143798828125 loss: 0.0024261474609375loss: 0.00139617919921875 loss: 0.00115966796875 predicted value: tensor([[0.7969], [0.4688], [0.4883], [0.9961], [0.3398], [0.3770], [0.9922], [0.2852], [0.6875], [0.4102], [0.3105], [0.5977], [0.3438], [0.2100], [0.1924], [0.1934]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.3750], [1.0000], [0.3340], [0.3340], [1.0000], [0.2500], [0.6680], [0.3750], [0.3340], [0.5000], [0.3340], [0.2500], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000919342041015625 loss: 0.0009918212890625 loss: 0.0023193359375 loss: 0.000530242919921875 96%|█████████▌| 473/492 [4:19:12<10:39, 33.66s/it] {'loss': 0.006, 'learning_rate': 1e-05, 'epoch': 0.96} 96%|█████████▌| 473/492 [4:19:12<10:39, 33.66s/it]predicted value: tensor([[0.5117], [0.3750], [0.7461], [0.4004], [0.9531], [0.6719], [0.3613], [0.6562], [0.6836], [0.2734], [0.3027], [0.3711], [0.1631], [0.1611], [0.1465], [0.1514]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8008], [0.3750], [1.0000], [0.8320], [0.3750], [0.7500], [0.8008], [0.3340], [0.4004], [0.4004], [0.2500], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002349853515625 loss: 0.0014495849609375 loss: 0.00177001953125 loss: 0.00250244140625 predicted value: tensor([[0.4102], [0.3555], [0.4141], [0.7578], [0.7617], [0.9492], [0.6914], [0.2754], [0.3340], [0.2324], [0.4805], [0.5234], [0.0232], [0.3438], [0.1572], [0.1719]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.2715], [0.4668], [0.8008], [0.8320], [1.0000], [0.8008], [0.2002], [0.2500], [0.2500], [0.6016], [0.5547], [0.0400], [0.3340], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001434326171875 loss: 0.0010833740234375loss: 0.000629425048828125 loss: 0.00213623046875 predicted value: tensor([[0.9688], [0.9688], [0.4023], [0.4297], [0.4414], [0.2314], [0.3770], [0.5781], [0.9688], [0.9492], [0.5742], [0.3066], [0.3867], [0.3242], [0.2119], [0.1641]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.4668], [0.4668], [0.2500], [0.3750], [0.6016], [1.0000], [1.0000], [0.8008], [0.3340], [0.5000], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00165557861328125 loss: 0.0013427734375 loss: 0.00058746337890625 loss: 0.0013885498046875 predicted value: tensor([[0.4238], [0.5273], [0.9648], [0.4023], [0.3965], [0.1816], [0.6133], [0.5938], [0.9766], [0.3711], [0.3672], [0.1621], [0.3262], [0.6484], [0.1309], [0.1592]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [1.0000], [0.4668], [0.3750], [0.1670], [0.7500], [0.7500], [1.0000], [0.3340], [0.4004], [0.2002], [0.2852], [0.7500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010986328125 loss: 0.000759124755859375 loss: 0.00121307373046875loss: 0.0023193359375 96%|█████████▋| 474/492 [4:19:45<10:01, 33.44s/it] {'loss': 0.0059, 'learning_rate': 1e-05, 'epoch': 0.96} 96%|█████████▋| 474/492 [4:19:45<10:01, 33.44s/it]predicted value: tensor([[0.5273], [0.5195], [0.6523], [0.7500], [0.7070], [0.7344], [0.6484], [0.2383], [0.4746], [0.3730], [0.5547], [0.3477], [0.1797], [0.1904], [0.1680], [0.1768]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.6680], [0.8008], [0.8008], [0.8008], [0.8008], [0.2500], [0.6016], [0.4004], [0.6016], [0.4004], [0.2002], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00133514404296875 loss: 0.0011138916015625loss: 0.0004596710205078125 loss: 0.000804901123046875 predicted value: tensor([[0.7227], [0.4160], [0.6172], [0.2471], [0.9922], [0.6953], [0.9648], [0.4941], [0.5508], [0.4551], [0.7266], [0.3926], [0.2373], [0.3086], [0.1768], [0.1562]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.6680], [0.2500], [1.0000], [0.7148], [1.0000], [0.5000], [0.6016], [0.5000], [0.8008], [0.4004], [0.0400], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00201416015625 loss: 0.00121307373046875loss: 0.004547119140625 loss: 0.000537872314453125 predicted value: tensor([[0.2930], [0.5273], [0.1128], [0.4121], [0.4102], [0.4102], [0.2520], [0.5352], [1.0078], [0.6055], [0.3691], [0.6211], [0.3672], [0.3926], [0.1465], [0.1914]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.5547], [0.2002], [0.3750], [0.4668], [0.3750], [0.1670], [0.4277], [1.0000], [0.6016], [0.4004], [0.7500], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004180908203125 loss: 0.000743865966796875 loss: 0.0008544921875 loss: 0.000934600830078125 predicted value: tensor([[0.5625], [0.7148], [1.0000], [0.6406], [0.4648], [0.5469], [0.6172], [0.2969], [0.5703], [1.0156], [0.5352], [0.9688], [0.3477], [0.1553], [0.1748], [0.1680]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [1.0000], [0.6680], [0.4668], [0.5000], [0.8008], [0.2500], [0.6016], [1.0000], [0.6016], [1.0000], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00103759765625 loss: 0.000579833984375 loss: 0.000885009765625loss: 0.000762939453125 97%|█████████▋| 475/492 [4:20:16<09:19, 32.94s/it] {'loss': 0.0055, 'learning_rate': 1e-05, 'epoch': 0.97} 97%|█████████▋| 475/492 [4:20:16<09:19, 32.94s/it]predicted value: tensor([[0.7812], [1.0938], [0.6602], [1.1094], [0.7461], [0.7188], [0.6250], [0.2871], [0.4785], [0.5938], [0.3535], [0.2715], [1.0625], [0.4727], [0.2109], [0.2676]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6016], [1.0000], [0.5547], [1.0000], [0.7500], [0.7500], [0.7500], [0.2002], [0.3750], [0.6016], [0.3340], [0.2500], [1.0000], [0.5000], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.001617431640625 loss: 0.0011444091796875 loss: 0.001708984375 loss: 0.0011444091796875 predicted value: tensor([[1.1016], [1.0781], [0.3535], [0.5625], [1.0859], [0.5742], [1.0781], [0.3008], [0.7578], [0.4824], [0.6797], [0.4551], [0.4727], [0.3945], [0.2539], [0.2412]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.3340], [0.5000], [1.0000], [0.6016], [1.0000], [0.2500], [0.4668], [0.4668], [0.7500], [0.5000], [0.5000], [0.3340], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0018157958984375 loss: 0.0022125244140625 loss: 0.00091552734375 loss: 0.00096893310546875 predicted value: tensor([[0.6055], [1.0859], [0.3867], [0.5234], [1.0781], [0.4297], [0.3340], [1.0938], [0.2949], [0.3594], [0.4902], [0.6680], [0.7383], [0.4434], [0.4062], [0.3418]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.2500], [0.4668], [1.0000], [0.7500], [0.3340], [1.0000], [0.2002], [0.2002], [0.4004], [0.7500], [0.7500], [0.4004], [0.4004], [0.0400]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00160980224609375 loss: 0.004547119140625 loss: 0.00079345703125 loss: 0.000972747802734375 predicted value: tensor([[0.8320], [1.1016], [0.5039], [1.0781], [0.3555], [0.5273], [0.7852], [1.0938], [0.6641], [0.7891], [0.7188], [0.5195], [0.4102], [0.2451], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.3750], [1.0000], [0.2500], [0.3750], [0.8008], [1.0000], [0.7500], [0.6680], [0.6016], [0.5000], [0.3340], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0016632080078125 loss: 0.0017852783203125 loss: 0.003448486328125 loss: 0.001953125 97%|█████████▋| 476/492 [4:20:50<08:48, 33.06s/it] {'loss': 0.0071, 'learning_rate': 1e-05, 'epoch': 0.97} 97%|█████████▋| 476/492 [4:20:50<08:48, 33.06s/it]predicted value: tensor([[0.5195], [1.0859], [0.2490], [0.3418], [0.8672], [0.7891], [0.3691], [0.8164], [0.5078], [0.7227], [0.5000], [0.4863], [0.4648], [0.6719], [0.3867], [0.2598]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.2002], [0.2500], [0.8008], [0.6680], [0.2500], [0.8008], [0.4004], [0.7500], [0.5000], [0.5000], [0.5000], [0.6016], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002349853515625 loss: 0.00148773193359375loss: 0.000965118408203125 loss: 0.0030975341796875 predicted value: tensor([[0.4609], [1.0938], [0.8594], [1.0938], [0.5156], [1.0938], [0.3164], [0.3496], [1.1016], [0.4648], [0.6289], [0.4648], [0.4336], [0.6953], [0.5977], [0.2734]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3145], [1.0000], [0.8008], [1.0000], [0.4668], [1.0000], [0.2500], [0.2500], [1.0000], [0.5000], [0.3750], [0.4004], [0.4004], [0.7500], [0.7500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00136566162109375 loss: 0.0028228759765625 loss: 0.0032501220703125 loss: 0.001617431640625 predicted value: tensor([[0.6055], [0.8125], [0.3359], [0.5391], [0.5586], [0.5273], [0.6992], [0.6875], [0.4434], [0.3340], [0.6523], [0.4199], [0.7891], [0.4805], [0.2637], [0.2695]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6680], [0.2500], [0.3750], [0.8008], [0.4668], [0.6016], [0.4668], [0.4004], [0.2500], [0.7500], [0.4004], [0.8008], [0.5000], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030975341796875 loss: 0.0016632080078125 loss: 0.001983642578125loss: 0.0033111572265625 predicted value: tensor([[0.6133], [1.1094], [0.5938], [0.5586], [0.1050], [0.7227], [1.0859], [0.6406], [0.4297], [0.7070], [0.6406], [0.5039], [0.5703], [0.2471], [0.2910], [0.2578]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.5547], [0.4668], [0.0204], [0.6016], [1.0000], [0.6016], [0.2852], [0.7500], [0.6016], [0.4004], [0.4668], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00177001953125 loss: 0.00159454345703125 loss: 0.002197265625 loss: 0.001922607421875 97%|█████████▋| 477/492 [4:21:25<08:23, 33.59s/it] {'loss': 0.0086, 'learning_rate': 1e-05, 'epoch': 0.97} 97%|█████████▋| 477/492 [4:21:25<08:23, 33.59s/it]predicted value: tensor([[0.5430], [0.4531], [0.8125], [0.7539], [0.7070], [1.0469], [0.4492], [0.3945], [0.3750], [0.5117], [0.6758], [0.3887], [0.2109], [0.1719], [0.2344], [0.2490]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8008], [0.4668], [0.6680], [1.0000], [0.4668], [0.3750], [0.2715], [0.5000], [0.7500], [0.4004], [0.1670], [0.1670], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00048065185546875 loss: 0.0016937255859375loss: 0.0005035400390625 loss: 0.0015106201171875 predicted value: tensor([[0.8047], [0.4980], [0.7891], [1.0312], [0.4766], [0.4609], [0.2363], [0.5312], [0.2715], [0.2090], [0.5586], [0.7578], [1.0078], [0.2559], [0.2119], [0.2139]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.3750], [0.8008], [1.0000], [0.4668], [0.4668], [0.2002], [0.6016], [0.3340], [0.3340], [0.2002], [0.4668], [1.0000], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0010986328125 loss: 0.0040283203125loss: 0.000797271728515625 loss: 0.0008697509765625 predicted value: tensor([[0.4453], [0.5508], [0.4668], [1.0312], [0.8047], [0.5859], [0.5000], [0.6680], [0.2559], [0.4043], [0.4375], [0.2422], [0.2471], [0.2197], [0.2090], [0.2256]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.4668], [1.0000], [0.8008], [0.5000], [0.5547], [0.6680], [0.2002], [0.3340], [0.5000], [0.2852], [0.0625], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0007476806640625 loss: 0.0009918212890625loss: 0.0028839111328125 loss: 0.00078582763671875 predicted value: tensor([[0.4727], [0.8125], [0.7109], [1.0234], [0.7383], [0.7461], [0.4355], [0.5742], [0.4434], [0.4238], [0.6133], [0.0679], [0.4805], [0.2061], [0.2061], [0.2168]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8008], [0.8320], [1.0000], [0.8008], [0.6680], [0.4668], [0.6016], [0.4668], [0.4004], [0.6016], [0.0278], [0.5000], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000682830810546875 loss: 0.00121307373046875 loss: 0.000316619873046875 loss: 0.000492095947265625 97%|█████████▋| 478/492 [4:21:59<07:51, 33.71s/it] {'loss': 0.0048, 'learning_rate': 1e-05, 'epoch': 0.97} 97%|█████████▋| 478/492 [4:21:59<07:51, 33.71s/it]predicted value: tensor([[0.2041], [0.7227], [0.3750], [0.4180], [0.9258], [0.9180], [0.4258], [0.3301], [0.3164], [0.5352], [0.3457], [0.2871], [0.3066], [0.1240], [0.1045], [0.1069]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.8555], [0.4668], [0.5547], [1.0000], [1.0000], [0.6016], [0.4668], [0.4668], [0.7500], [0.4004], [0.3340], [0.3340], [0.2002], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004058837890625 loss: 0.003387451171875 loss: 0.0026397705078125 loss: 0.0032196044921875 predicted value: tensor([[0.4180], [0.3262], [0.4434], [0.3125], [0.7148], [0.6133], [0.3301], [0.3281], [0.5312], [0.5469], [0.2041], [0.3203], [0.1738], [0.8984], [0.1416], [0.1279]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8320], [0.4668], [0.8008], [0.6680], [0.4668], [0.3750], [0.6016], [0.6680], [0.3340], [0.4004], [0.3340], [1.0000], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002777099609375 loss: 0.005462646484375loss: 0.005706787109375 loss: 0.0029144287109375 predicted value: tensor([[0.9336], [0.6680], [0.6523], [0.9219], [0.9453], [0.3438], [0.3320], [0.4844], [0.7109], [0.2422], [0.4414], [0.1309], [0.3281], [0.3223], [0.1055], [0.1104]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.8008], [1.0000], [1.0000], [0.4668], [0.4668], [0.5000], [0.8008], [0.2500], [0.5000], [0.2002], [0.3340], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025177001953125 loss: 0.0019683837890625 loss: 0.00213623046875 loss: 0.004180908203125 predicted value: tensor([[0.3613], [0.3359], [0.4785], [0.3574], [0.9414], [0.3164], [0.3379], [0.2383], [0.3496], [0.5352], [0.2676], [0.2910], [0.3027], [0.3398], [0.1602], [0.1128]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.8320], [0.4668], [1.0000], [0.4668], [0.4668], [0.2500], [0.3750], [0.7500], [0.4004], [0.3340], [0.5000], [0.5000], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0025482177734375 loss: 0.0036163330078125 loss: 0.005828857421875 loss: 0.005462646484375 97%|█████████▋| 479/492 [4:22:32<07:18, 33.76s/it] {'loss': 0.0146, 'learning_rate': 1e-05, 'epoch': 0.97} 97%|█████████▋| 479/492 [4:22:32<07:18, 33.76s/it]predicted value: tensor([[0.4141], [0.8789], [0.8789], [0.8828], [0.6680], [0.5156], [0.6250], [0.1089], [0.3145], [0.3184], [0.5664], [0.2930], [0.2637], [0.2598], [0.0503], [0.1099]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6172], [1.0000], [1.0000], [1.0000], [0.6680], [0.6016], [0.8008], [0.2002], [0.4004], [0.5000], [0.7500], [0.5000], [0.4004], [0.5000], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004852294921875 loss: 0.00872802734375 loss: 0.0057373046875 loss: 0.010986328125 predicted value: tensor([[0.2930], [0.7070], [0.5586], [0.0732], [0.1797], [0.8633], [0.3828], [0.3594], [0.1533], [0.4004], [0.2344], [0.2930], [0.2949], [0.0374], [0.0879], [0.1079]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.6680], [0.2002], [0.3340], [1.0000], [0.8008], [0.4668], [0.2500], [0.6016], [0.3340], [0.5000], [0.5000], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005859375 loss: 0.0093994140625 loss: 0.005279541015625 loss: 0.00677490234375 predicted value: tensor([[0.3203], [0.4004], [0.2891], [0.8672], [0.4922], [0.8867], [0.5938], [0.1631], [0.6914], [0.4141], [0.2100], [0.1904], [0.2354], [0.0889], [0.0889], [0.0737]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.4668], [0.3750], [1.0000], [0.6016], [1.0000], [0.6680], [0.2500], [0.8320], [0.6016], [0.7500], [0.3340], [0.4004], [0.2002], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0067138671875 loss: 0.005218505859375 loss: 0.0091552734375 loss: 0.00836181640625 predicted value: tensor([[0.1543], [0.3105], [0.1553], [0.2773], [0.6523], [0.1895], [0.1104], [0.4707], [0.1680], [0.2930], [0.2598], [0.2910], [0.3125], [0.2969], [0.1045], [0.2617]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3340], [0.4668], [0.3340], [0.4668], [0.8008], [0.3340], [0.2500], [0.6016], [0.5000], [0.4004], [0.4004], [0.5000], [0.4004], [0.5000], [0.2500], [0.4004]], device='cuda:0', dtype=torch.bfloat16) loss: 0.007476806640625loss: 0.004364013671875 loss: 0.0059814453125 loss: 0.006072998046875 98%|█████████▊| 480/492 [4:23:06<06:44, 33.73s/it] {'loss': 0.0277, 'learning_rate': 1e-05, 'epoch': 0.98} 98%|█████████▊| 480/492 [4:23:06<06:44, 33.73s/it]predicted value: tensor([[0.3516], [0.9023], [0.8867], [0.3926], [0.6250], [0.1680], [0.9102], [0.9023], [0.5234], [0.3398], [0.3320], [0.4004], [0.2754], [0.4941], [0.1445], [0.1079]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [1.0000], [0.5547], [0.8008], [0.2500], [1.0000], [1.0000], [0.6016], [0.5000], [0.5000], [0.5000], [0.2500], [0.7500], [0.2500], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005706787109375 loss: 0.00396728515625 loss: 0.004180908203125 loss: 0.0047607421875 predicted value: tensor([[0.0942], [0.3984], [0.6523], [0.9258], [0.8633], [0.5234], [0.6094], [0.2910], [0.1553], [0.2637], [0.5859], [0.2715], [0.2520], [0.3516], [0.1226], [0.1367]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.5547], [0.6680], [1.0000], [1.0000], [0.6680], [0.7148], [0.3145], [0.2500], [0.6016], [0.6016], [0.3750], [0.3340], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003753662109375 loss: 0.0040283203125 loss: 0.005035400390625 loss: 0.0032958984375 predicted value: tensor([[0.4121], [0.8867], [0.4395], [0.6875], [0.7305], [0.1055], [0.9062], [0.1367], [0.6367], [0.9141], [0.8828], [0.2412], [0.1670], [0.1108], [0.1152], [0.1270]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.6016], [0.8008], [0.8008], [0.1670], [1.0000], [0.2500], [0.6016], [1.0000], [1.0000], [0.3340], [0.0278], [0.1426], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004425048828125 loss: 0.002593994140625loss: 0.003448486328125 loss: 0.006622314453125 predicted value: tensor([[0.9062], [0.8789], [0.1680], [0.8672], [0.2275], [0.8945], [0.6172], [0.3379], [0.5820], [0.4883], [0.3340], [0.3535], [0.5820], [0.3125], [0.1138], [0.1318]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.3340], [1.0000], [0.3340], [1.0000], [0.6680], [0.3340], [0.7500], [0.6680], [0.4668], [0.5000], [0.6016], [0.4004], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0040283203125 loss: 0.003265380859375 loss: 0.00311279296875 loss: 0.00286865234375 98%|█████████▊| 481/492 [4:23:39<06:09, 33.63s/it] {'loss': 0.0163, 'learning_rate': 1e-05, 'epoch': 0.98} 98%|█████████▊| 481/492 [4:23:39<06:09, 33.63s/it]predicted value: tensor([[0.7383], [0.4453], [0.2832], [0.9922], [0.5039], [0.9883], [0.4766], [0.2773], [0.4531], [0.3496], [0.3945], [0.6328], [0.3320], [0.2578], [0.5508], [0.2168]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.4668], [0.2500], [1.0000], [0.4277], [1.0000], [0.8008], [0.2500], [0.4668], [0.3340], [0.4004], [0.6016], [0.5000], [0.2500], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00159454345703125 loss: 0.0038604736328125 loss: 0.000766754150390625 loss: 0.0007781982421875 predicted value: tensor([[0.5234], [0.2969], [0.7891], [0.2949], [0.4355], [0.3730], [0.5898], [0.4492], [0.6367], [0.9922], [0.3438], [0.4082], [0.2100], [0.3516], [0.1758], [0.2129]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.2500], [0.8320], [0.2500], [0.4668], [0.3145], [0.6016], [0.2500], [0.7500], [1.0000], [0.4004], [0.4668], [0.2002], [0.4004], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0027923583984375loss: 0.0005950927734375 loss: 0.001983642578125 loss: 0.00250244140625 predicted value: tensor([[0.5156], [0.3887], [0.2305], [0.9570], [0.4160], [0.2305], [0.5312], [0.7539], [0.3945], [0.9766], [0.5781], [0.2949], [0.6758], [0.3867], [0.2451], [0.1875]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.3750], [0.2002], [1.0000], [0.4668], [0.2500], [0.6016], [0.8008], [0.4668], [1.0000], [0.6016], [0.2500], [0.7500], [0.4004], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002044677734375 loss: 0.0010833740234375 loss: 0.00168609619140625 loss: 0.00075531005859375 predicted value: tensor([[0.4160], [0.3730], [0.7305], [0.3906], [0.3027], [0.4082], [0.4160], [0.3066], [0.6562], [0.5547], [0.5938], [0.9727], [0.2236], [0.3516], [0.2061], [0.2656]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.8008], [0.3750], [0.2500], [0.4668], [0.3750], [0.3340], [0.6016], [0.6016], [0.7500], [1.0000], [0.2002], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0029144287109375 loss: 0.000843048095703125loss: 0.000957489013671875 loss: 0.000942230224609375 98%|█████████▊| 482/492 [4:24:13<05:35, 33.54s/it] {'loss': 0.0065, 'learning_rate': 1e-05, 'epoch': 0.98} 98%|█████████▊| 482/492 [4:24:13<05:35, 33.54s/it]predicted value: tensor([[0.5508], [0.5977], [0.4121], [0.9141], [0.5508], [1.1094], [0.9375], [1.1016], [0.6914], [0.8047], [0.3965], [0.5117], [0.8203], [0.5938], [0.3438], [0.3145]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.5547], [0.3340], [0.8008], [0.4668], [1.0000], [0.8008], [1.0000], [0.6016], [0.7500], [0.2002], [0.3340], [0.7500], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006103515625 loss: 0.0036468505859375 loss: 0.00494384765625 loss: 0.004425048828125 predicted value: tensor([[1.0781], [0.5586], [0.8789], [0.5234], [0.7578], [0.7148], [0.7188], [0.4824], [0.5703], [0.5664], [1.0859], [0.4922], [0.6211], [0.4941], [0.3535], [0.3340]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.8008], [0.3750], [0.6016], [0.6016], [0.6016], [0.4004], [0.3750], [0.4004], [1.0000], [0.4004], [0.6680], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.003936767578125loss: 0.0052490234375 loss: 0.0035552978515625 loss: 0.007476806640625 predicted value: tensor([[0.5938], [0.5273], [0.9492], [1.1094], [0.9531], [1.1094], [0.5273], [0.5742], [1.0703], [0.6914], [0.4707], [0.5312], [0.5703], [0.5156], [0.3633], [0.3535]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.4668], [0.8320], [1.0000], [0.8008], [1.0000], [0.4668], [0.6680], [1.0000], [0.6016], [0.3340], [0.4004], [0.5000], [0.4004], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.004119873046875 loss: 0.0026702880859375 loss: 0.006866455078125 loss: 0.0037841796875 predicted value: tensor([[0.6055], [1.0781], [0.5391], [1.1016], [0.5859], [1.0938], [0.5508], [0.8359], [0.5469], [0.6758], [0.4531], [0.3711], [0.5625], [0.5625], [0.3203], [0.3359]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [1.0000], [0.3750], [1.0000], [0.3750], [1.0000], [0.6680], [0.4668], [0.4668], [0.4277], [0.2002], [0.2500], [0.4004], [0.4004], [0.1426], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037689208984375 loss: 0.008056640625 loss: 0.00457763671875 loss: 0.004241943359375 98%|█████████▊| 483/492 [4:24:47<05:02, 33.60s/it] {'loss': 0.0194, 'learning_rate': 1e-05, 'epoch': 0.98} 98%|█████████▊| 483/492 [4:24:47<05:02, 33.60s/it]predicted value: tensor([[0.6875], [1.0078], [0.9883], [0.9414], [1.0156], [1.1328], [0.9102], [0.5273], [0.7656], [0.5312], [1.1484], [0.8086], [0.7930], [0.3945], [0.4180], [0.4219]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.8008], [0.8008], [0.8320], [1.0000], [0.8008], [0.2500], [0.5000], [0.3340], [1.0000], [0.7500], [0.6016], [0.1670], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00994873046875 loss: 0.00848388671875loss: 0.010986328125 loss: 0.0133056640625 predicted value: tensor([[0.8711], [0.9922], [0.6680], [0.6836], [0.7969], [0.6523], [0.8398], [1.1562], [0.6211], [0.3145], [0.5664], [0.7852], [0.6797], [0.6211], [0.4160], [0.4336]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.8320], [0.5547], [0.5547], [0.6680], [0.4668], [0.7500], [1.0000], [0.3750], [0.0400], [0.3340], [0.5000], [0.4004], [0.2500], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00701904296875 loss: 0.01080322265625loss: 0.00897216796875 loss: 0.008056640625 predicted value: tensor([[0.6953], [0.9336], [0.5586], [1.1641], [0.6055], [0.4434], [0.5781], [0.8984], [0.6680], [0.6406], [1.1328], [0.8516], [0.7031], [0.6289], [0.4062], [0.4043]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8008], [0.3750], [1.0000], [0.3750], [0.2500], [0.2002], [0.6680], [0.4668], [0.5000], [1.0000], [0.5000], [0.5000], [0.4004], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.008544921875 loss: 0.01165771484375 loss: 0.0084228515625 loss: 0.00927734375 predicted value: tensor([[0.6094], [0.8789], [0.4473], [1.1641], [0.6172], [0.9570], [0.5195], [0.8242], [0.4609], [0.9727], [0.6172], [0.7930], [0.6016], [0.6680], [0.4258], [0.4043]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.7148], [0.3340], [1.0000], [0.3750], [0.6680], [0.2500], [0.7500], [0.2500], [0.8008], [0.3750], [0.4277], [0.4004], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00994873046875 loss: 0.01171875 loss: 0.00946044921875 loss: 0.01092529296875 98%|█████████▊| 484/492 [4:25:21<04:29, 33.75s/it] {'loss': 0.0394, 'learning_rate': 1e-05, 'epoch': 0.98} 98%|█████████▊| 484/492 [4:25:21<04:29, 33.75s/it]predicted value: tensor([[1.1562], [0.9570], [0.9141], [1.1484], [0.9062], [0.9883], [1.1484], [0.9023], [0.8047], [1.1484], [0.4316], [0.5742], [0.5742], [0.2285], [0.3789], [0.3867]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8008], [0.8008], [1.0000], [0.8008], [0.8008], [1.0000], [0.8008], [0.7500], [1.0000], [0.2002], [0.4004], [0.4004], [0.0400], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00982666015625 loss: 0.006317138671875 loss: 0.007049560546875 loss: 0.01055908203125 predicted value: tensor([[0.5273], [0.4473], [0.7852], [0.6133], [0.7070], [0.8438], [0.8086], [0.8672], [0.8516], [0.8438], [0.7695], [0.6602], [0.7812], [0.6055], [0.3770], [0.4219]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.2002], [0.8008], [0.4668], [0.4668], [0.6016], [0.6680], [0.7500], [0.7500], [0.6680], [0.6016], [0.5000], [0.3750], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00921630859375 loss: 0.01043701171875 loss: 0.0106201171875 loss: 0.01226806640625 predicted value: tensor([[0.6875], [0.6797], [0.6367], [0.6172], [1.1406], [1.1484], [0.5000], [0.4395], [0.8672], [0.6445], [0.5898], [0.5508], [0.6250], [0.4102], [0.3730], [0.4062]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [0.4668], [0.4668], [1.0000], [1.0000], [0.2500], [0.2500], [0.7500], [0.4668], [0.5000], [0.5000], [0.5000], [0.2500], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00823974609375 loss: 0.00933837890625 loss: 0.006072998046875 loss: 0.01165771484375 predicted value: tensor([[1.1250], [1.1172], [0.6250], [0.9648], [1.1562], [0.5312], [1.1406], [0.9531], [1.1406], [0.4277], [0.6406], [0.6094], [0.5117], [0.3750], [0.3965], [0.4160]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.4668], [0.8008], [1.0000], [0.3340], [1.0000], [0.8008], [1.0000], [0.2002], [0.5000], [0.5000], [0.3340], [0.2002], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00653076171875 loss: 0.0091552734375 loss: 0.00830078125 loss: 0.0106201171875 99%|█████████▊| 485/492 [4:25:55<03:56, 33.78s/it] {'loss': 0.0366, 'learning_rate': 1e-05, 'epoch': 0.99} 99%|█████████▊| 485/492 [4:25:55<03:56, 33.78s/it]predicted value: tensor([[0.9570], [0.5508], [0.9297], [1.0859], [0.4023], [0.5469], [0.4199], [0.7422], [0.7305], [0.4180], [0.7383], [0.5156], [0.2520], [0.3398], [0.5156], [0.3398]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3750], [0.8320], [1.0000], [0.2002], [0.3750], [0.3340], [0.7500], [0.6016], [0.2500], [0.6016], [0.4004], [0.0400], [0.2002], [0.4004], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005096435546875 loss: 0.004852294921875 loss: 0.003448486328125 loss: 0.004974365234375 predicted value: tensor([[1.0781], [0.5547], [0.6172], [0.8203], [0.6523], [1.0859], [0.9258], [0.6953], [0.4004], [0.5781], [0.4590], [0.5508], [0.4414], [0.6406], [0.5312], [0.3906]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.5547], [0.8008], [0.4668], [1.0000], [0.8320], [0.5000], [0.2500], [0.4668], [0.2500], [0.5000], [0.2500], [0.3340], [0.3340], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00408935546875 loss: 0.005889892578125 loss: 0.00408935546875 loss: 0.0040283203125 predicted value: tensor([[0.9336], [0.5469], [1.0781], [0.5664], [0.5938], [0.6289], [0.8398], [0.4551], [0.5000], [0.6094], [0.7578], [0.5859], [0.5469], [0.5625], [0.3105], [0.3574]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8555], [0.3750], [1.0000], [0.4668], [0.4668], [0.5547], [0.6680], [0.3340], [0.7500], [0.3340], [0.6016], [0.8008], [0.4668], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0059814453125loss: 0.003265380859375 loss: 0.0032958984375 loss: 0.005706787109375 predicted value: tensor([[0.5430], [0.5000], [0.5938], [0.5234], [0.9727], [0.3418], [0.8242], [1.0859], [0.4727], [0.6680], [0.7305], [0.4570], [0.4727], [0.5312], [0.3555], [0.3730]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.3750], [0.3750], [0.4668], [0.8320], [0.2002], [0.6680], [1.0000], [0.3340], [0.4277], [0.6016], [0.2002], [0.3340], [0.3340], [0.2002], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00628662109375loss: 0.0032806396484375 loss: 0.0037994384765625 loss: 0.00506591796875 99%|█████████▉| 486/492 [4:26:28<03:21, 33.63s/it] {'loss': 0.0183, 'learning_rate': 1e-05, 'epoch': 0.99} 99%|█████████▉| 486/492 [4:26:28<03:21, 33.63s/it]predicted value: tensor([[0.4844], [1.0000], [0.9609], [0.5586], [0.5156], [0.7188], [0.4902], [0.4355], [0.4707], [0.4746], [0.5156], [0.5938], [0.2324], [0.5430], [0.2461], [0.2520]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [1.0000], [0.6016], [0.6172], [0.7500], [0.4668], [0.4668], [0.5000], [0.5000], [0.6016], [0.6016], [0.1670], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00099945068359375 loss: 0.00171661376953125 loss: 0.00083160400390625 loss: 0.00016117095947265625 predicted value: tensor([[0.5000], [0.7344], [0.5156], [0.6914], [0.7344], [0.2617], [0.6172], [0.9727], [0.2305], [0.2500], [0.4453], [0.4238], [0.2578], [0.2402], [0.2402], [0.2344]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.8320], [0.5547], [0.8008], [0.6680], [0.2500], [0.8008], [1.0000], [0.0713], [0.2852], [0.5000], [0.4004], [0.2002], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00160980224609375loss: 0.00159454345703125 loss: 0.00140380859375 loss: 0.000518798828125 predicted value: tensor([[0.2490], [0.4355], [0.5156], [0.5078], [0.4297], [0.4648], [0.8125], [0.6289], [0.6836], [0.3789], [0.9727], [0.0513], [0.2451], [0.1885], [0.2363], [0.2109]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.2500], [0.4668], [0.5547], [0.5547], [0.4668], [0.4668], [0.8320], [0.4668], [0.7500], [0.3340], [1.0000], [0.0400], [0.2500], [0.1670], [0.2500], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0019073486328125 loss: 0.000911712646484375 loss: 0.000659942626953125 loss: 0.004058837890625 predicted value: tensor([[0.9922], [0.2598], [0.9844], [0.7500], [0.9961], [0.6602], [0.6719], [0.7539], [0.9648], [0.4297], [0.9805], [0.4141], [0.4277], [0.3516], [0.2373], [0.2412]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.2002], [1.0000], [0.8008], [1.0000], [0.7500], [0.6016], [0.8008], [1.0000], [0.4004], [1.0000], [0.4004], [0.4004], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.000885009765625 loss: 0.00067138671875 loss: 0.00121307373046875 loss: 0.000762939453125 99%|█████████▉| 487/492 [4:27:01<02:48, 33.64s/it] {'loss': 0.005, 'learning_rate': 1e-05, 'epoch': 0.99} 99%|█████████▉| 487/492 [4:27:01<02:48, 33.64s/it]predicted value: tensor([[ 0.8594], [ 0.3359], [ 0.5078], [ 0.1543], [ 0.6289], [ 0.4766], [ 0.2910], [ 0.8945], [ 0.3633], [ 0.2715], [ 0.2637], [ 0.4922], [ 0.2559], [-0.0179], [ 0.0053], [ 0.0596]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [0.6680], [0.2500], [0.8320], [0.7500], [0.4668], [1.0000], [0.6680], [0.3750], [0.5000], [0.7500], [0.4004], [0.0400], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0052490234375 loss: 0.00823974609375 loss: 0.007720947265625 loss: 0.006378173828125 predicted value: tensor([[0.8633], [0.5312], [0.5273], [0.8945], [0.3125], [0.2793], [0.3887], [0.1611], [0.5898], [0.3301], [0.3633], [0.3770], [0.2090], [0.0344], [0.0454], [0.0796]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.7500], [0.6680], [1.0000], [0.4668], [0.3750], [0.5547], [0.2500], [0.8008], [0.4668], [0.5000], [0.6016], [0.4004], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.005157470703125 loss: 0.0106201171875 loss: 0.005462646484375 loss: 0.006561279296875 predicted value: tensor([[0.3594], [0.8672], [0.5742], [0.3965], [0.8906], [0.2051], [0.4512], [0.6016], [0.2090], [0.5625], [0.1592], [0.1807], [0.2480], [0.2188], [0.3320], [0.0654]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4004], [1.0000], [0.6680], [0.5000], [1.0000], [0.3340], [0.6016], [0.8008], [0.3340], [0.7500], [0.3340], [0.4004], [0.5000], [0.5000], [0.5000], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00701904296875loss: 0.005859375 loss: 0.00506591796875 loss: 0.004486083984375 predicted value: tensor([[0.8711], [0.2812], [0.8672], [0.7266], [0.3652], [0.1118], [0.0806], [0.1553], [0.4395], [0.6641], [0.2100], [0.2715], [0.1758], [0.0330], [0.0315], [0.0664]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.4668], [1.0000], [0.8320], [0.4668], [0.2500], [0.2500], [0.3340], [0.7500], [0.8008], [0.2002], [0.5000], [0.5000], [0.2002], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006988525390625 loss: 0.006500244140625 loss: 0.008056640625loss: 0.0072021484375 99%|█████████▉| 488/492 [4:27:34<02:13, 33.42s/it] {'loss': 0.0266, 'learning_rate': 1e-05, 'epoch': 0.99} 99%|█████████▉| 488/492 [4:27:34<02:13, 33.42s/it]predicted value: tensor([[ 0.8477], [ 0.4316], [ 0.4531], [ 0.4277], [ 0.5156], [ 0.1543], [ 0.8516], [ 0.3984], [ 0.3066], [ 0.1035], [ 0.4121], [-0.1123], [ 0.1357], [-0.1021], [-0.0791], [-0.0630]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.6680], [0.6680], [0.6016], [0.7500], [0.3145], [1.0000], [0.5000], [0.5000], [0.4004], [0.7500], [0.2002], [0.4004], [0.2002], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01214599609375 loss: 0.0140380859375 loss: 0.012939453125 loss: 0.01287841796875 predicted value: tensor([[ 0.2793], [ 0.3242], [ 0.8438], [ 0.4551], [ 0.3555], [ 0.0640], [ 0.3809], [ 0.2930], [ 0.3691], [ 0.3535], [ 0.1465], [ 0.1152], [ 0.2256], [ 0.1992], [-0.0781], [-0.0811]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.5547], [1.0000], [0.5703], [0.5547], [0.3340], [0.5000], [0.5000], [0.6016], [0.4668], [0.4004], [0.4004], [0.5000], [0.5000], [0.1670], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01214599609375 loss: 0.0133056640625 loss: 0.02197265625 loss: 0.0166015625 predicted value: tensor([[ 0.6484], [ 0.7969], [ 0.1846], [ 0.2002], [ 0.2197], [ 0.8477], [ 0.2754], [ 0.2598], [ 0.1030], [ 0.4004], [ 0.8203], [ 0.1963], [ 0.1279], [ 0.0796], [-0.0967], [-0.0366]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [1.0000], [0.3750], [0.4668], [0.4668], [1.0000], [0.8008], [0.5000], [0.2852], [0.6680], [1.0000], [0.4004], [0.3340], [0.4004], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01055908203125 loss: 0.0166015625 loss: 0.01556396484375 loss: 0.01385498046875 predicted value: tensor([[ 0.8516], [ 0.5742], [ 0.5234], [ 0.2148], [ 0.8359], [ 0.5625], [ 0.0601], [ 0.1436], [ 0.0386], [ 0.5430], [ 0.2988], [ 0.5625], [ 0.1465], [ 0.0723], [-0.0952], [-0.0588]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.8555], [0.8008], [0.4668], [1.0000], [0.8008], [0.3340], [0.4004], [0.3340], [0.8008], [0.6016], [0.6680], [0.4004], [0.4004], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01556396484375 loss: 0.01611328125 loss: 0.01531982421875 loss: 0.01458740234375 99%|█████████▉| 489/492 [4:28:09<01:41, 33.68s/it] {'loss': 0.0585, 'learning_rate': 1e-05, 'epoch': 0.99} 99%|█████████▉| 489/492 [4:28:09<01:41, 33.68s/it]predicted value: tensor([[ 3.4570e-01], [ 5.6250e-01], [ 3.1836e-01], [ 2.9883e-01], [ 2.5391e-01], [ 5.9766e-01], [ 6.2891e-01], [ 4.2578e-01], [ 6.3965e-02], [ 1.4258e-01], [ 1.4844e-01], [ 8.5938e-01], [ 1.2158e-01], [-8.2397e-04], [-1.2451e-01], [-8.4473e-02]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.6680], [0.5547], [0.4668], [0.4668], [0.8008], [0.8008], [0.6016], [0.3340], [0.4004], [0.5000], [1.0000], [0.7500], [0.2500], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.015380859375 loss: 0.0189208984375 loss: 0.0147705078125 loss: 0.0157470703125 predicted value: tensor([[ 0.2930], [ 0.0160], [ 0.5312], [ 0.2188], [ 0.4609], [ 0.6328], [ 0.5625], [ 0.5664], [ 0.2246], [ 0.5820], [ 0.1650], [ 0.8945], [ 0.3574], [ 0.1006], [-0.1094], [-0.0718]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [0.3340], [0.8320], [0.4668], [0.8008], [0.8320], [0.8008], [0.8008], [0.3750], [0.8008], [0.2500], [1.0000], [0.6016], [0.4004], [0.1670], [0.2500]], device='cuda:0', dtype=torch.bfloat16) loss: 0.016845703125 loss: 0.0157470703125 loss: 0.017333984375 loss: 0.0198974609375 predicted value: tensor([[ 0.5781], [ 0.2217], [-0.0127], [ 0.5898], [ 0.8711], [ 0.8789], [ 0.8594], [ 0.4785], [ 0.4414], [ 0.5469], [ 0.3770], [ 0.2734], [ 0.1445], [ 0.0947], [-0.1123], [-0.0796]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8320], [0.4668], [0.2500], [0.8008], [1.0000], [1.0000], [1.0000], [0.7500], [0.7500], [0.4668], [0.7500], [0.6016], [0.5000], [0.5000], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.012451171875 loss: 0.0189208984375 loss: 0.02001953125 loss: 0.01556396484375 predicted value: tensor([[ 0.3301], [ 0.8789], [ 0.2754], [ 0.6680], [ 0.5078], [ 0.9102], [ 0.4102], [ 0.2041], [ 0.8750], [ 0.2402], [ 0.1553], [ 0.4531], [ 0.0571], [-0.0908], [-0.1177], [-0.1108]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.5547], [1.0000], [0.4668], [0.8320], [0.8008], [1.0000], [0.5703], [0.3750], [1.0000], [0.3340], [0.3750], [0.7500], [0.4004], [0.2002], [0.1670], [0.1426]], device='cuda:0', dtype=torch.bfloat16) loss: 0.01226806640625loss: 0.0186767578125 loss: 0.0185546875 loss: 0.015869140625 100%|█████████▉| 490/492 [4:28:42<01:07, 33.57s/it] {'loss': 0.0667, 'learning_rate': 1e-05, 'epoch': 1.0} 100%|█████████▉| 490/492 [4:28:42<01:07, 33.57s/it]predicted value: tensor([[ 0.6328], [ 0.1289], [ 0.6250], [ 0.3477], [ 0.0046], [ 0.3672], [ 0.0098], [ 0.4590], [ 0.6172], [ 0.1797], [ 0.1904], [ 0.2314], [ 0.1050], [-0.2559], [-0.0125], [-0.0532]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7148], [0.3340], [0.8008], [0.4668], [0.2500], [0.6016], [0.2002], [0.3340], [0.8008], [0.4004], [0.5000], [0.5000], [0.3340], [0.0625], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.006927490234375 loss: 0.0107421875 loss: 0.012451171875 loss: 0.00860595703125 predicted value: tensor([[ 0.9492], [ 0.1221], [ 0.7422], [ 0.3145], [ 0.1426], [ 0.2617], [ 0.4531], [-0.0198], [-0.0459], [ 0.9492], [ 0.0996], [ 0.1562], [ 0.1572], [ 0.1748], [-0.0540], [-0.0557]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [0.3340], [0.8320], [0.4668], [0.3340], [0.3750], [0.6016], [0.2500], [0.2002], [1.0000], [0.2500], [0.4004], [0.4004], [0.3340], [0.2002], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.00927734375 loss: 0.00909423828125loss: 0.0079345703125 loss: 0.01397705078125 predicted value: tensor([[ 0.2988], [ 0.6992], [ 0.3828], [ 0.6250], [ 0.3027], [ 0.0967], [ 0.3457], [ 0.6133], [ 0.3691], [ 0.3320], [ 0.1953], [ 0.4785], [ 0.1826], [-0.0349], [-0.0688], [-0.0583]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.3750], [0.8008], [0.3750], [0.5312], [0.3750], [0.2002], [0.3750], [0.8008], [0.5547], [0.4668], [0.4004], [0.6016], [0.4004], [0.2500], [0.2002], [0.1670]], device='cuda:0', dtype=torch.bfloat16) loss: 0.010986328125 loss: 0.006866455078125 loss: 0.00811767578125 loss: 0.0093994140625 predicted value: tensor([[ 0.2891], [ 0.6875], [ 0.2930], [ 0.2598], [ 0.5625], [ 0.9297], [ 0.9609], [ 0.6914], [ 0.0391], [ 0.9648], [ 0.2715], [ 0.1982], [ 0.9688], [ 0.1836], [-0.0593], [-0.0522]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.4668], [0.8320], [0.3750], [0.4668], [0.6680], [1.0000], [1.0000], [0.8008], [0.2002], [1.0000], [0.5000], [0.4004], [1.0000], [0.4004], [0.2500], [0.2002]], device='cuda:0', dtype=torch.bfloat16) loss: 0.013427734375 loss: 0.00860595703125 loss: 0.008056640625 loss: 0.007171630859375 100%|█████████▉| 491/492 [4:29:16<00:33, 33.56s/it] {'loss': 0.0379, 'learning_rate': 1e-05, 'epoch': 1.0} 100%|█████████▉| 491/492 [4:29:16<00:33, 33.56s/it]predicted value: tensor([[ 1.0547], [ 1.0625], [ 0.5781], [ 0.4648], [ 0.2275], [ 0.4629], [ 0.4922], [-0.1543], [ 0.4609], [ 0.1973], [ 0.3281], [ 0.2363], [ 0.0708], [ 0.0383], [ 0.4062], [ 0.7969]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[1.0000], [1.0000], [0.6172], [0.4668], [0.2500], [0.4668], [0.5000], [0.0625], [0.5000], [0.3340], [0.4004], [0.3340], [0.2002], [0.2002], [0.3750], [0.8008]], device='cuda:0', dtype=torch.bfloat16) loss: 0.002838134765625 loss: 0.003936767578125 loss: 0.002105712890625 loss: 0.0029449462890625 predicted value: tensor([[0.6289], [0.5625], [0.1079], [0.0684], [0.0645], [0.0889], [0.5508], [1.0703], [0.4746], [1.0547], [0.7969], [0.6562], [1.0781], [1.0859], [0.1504], [0.2148]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.7500], [0.3750], [0.2002], [0.2500], [0.2002], [0.2500], [0.4648], [1.0000], [0.4668], [1.0000], [0.8008], [0.6680], [1.0000], [1.0000], [0.2500], [0.2852]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0030670166015625 loss: 0.0028076171875 loss: 0.0036773681640625 loss: 0.00286865234375 predicted value: tensor([[0.8164], [0.7812], [0.4570], [0.5391], [0.5625], [0.2539], [0.6562], [0.1797], [0.3047], [0.2773], [0.0491], [0.0179], [0.2314], [0.4863], [0.4238], [0.4570]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.8008], [0.8008], [0.4668], [0.4668], [0.6016], [0.3340], [0.7500], [0.3340], [0.5000], [0.4004], [0.2002], [0.1670], [0.2500], [0.4668], [0.4668], [0.4668]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0020751953125 loss: 0.0023040771484375 loss: 0.0027313232421875 loss: 0.0032958984375 predicted value: tensor([[ 0.6133], [ 0.0444], [ 0.0938], [ 1.0625], [ 1.0938], [ 0.2021], [ 0.1338], [ 0.6406], [ 0.7617], [ 1.1016], [ 0.5664], [ 1.0469], [-0.1260], [ 0.2695], [ 0.0654], [ 0.1768]], device='cuda:0', dtype=torch.bfloat16, grad_fn=) value label: tensor([[0.6016], [0.1670], [0.2500], [1.0000], [1.0000], [0.2500], [0.2500], [0.6680], [0.8008], [1.0000], [0.5000], [1.0000], [0.0400], [0.3340], [0.2002], [0.5000]], device='cuda:0', dtype=torch.bfloat16) loss: 0.0037841796875 loss: 0.00225830078125 loss: 0.001434326171875 loss: 0.005584716796875 100%|██████████| 492/492 [4:29:52<00:00, 34.46s/it] {'loss': 0.0119, 'learning_rate': 1e-05, 'epoch': 1.0} 100%|██████████| 492/492 [4:29:52<00:00, 34.46s/it] {'train_runtime': 16192.5824, 'train_samples_per_second': 7.776, 'train_steps_per_second': 0.03, 'train_loss': 0.08519280441408235, 'epoch': 1.0} 100%|██████████| 492/492 [4:29:52<00:00, 34.46s/it] 100%|██████████| 492/492 [4:29:52<00:00, 32.91s/it] Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41. Non-default generation parameters: {'max_length': 4096} [2024-12-20 00:06:30,472] [INFO] [launch.py:347:main] Process 702174 exits successfully. [2024-12-20 00:06:35,479] [INFO] [launch.py:347:main] Process 702175 exits successfully. [2024-12-20 00:06:35,480] [INFO] [launch.py:347:main] Process 702176 exits successfully. wandb: 🚀 View run upbeat-violet-6 at: https://wandb.ai/894699297roy-wuhan-university/llava_prm_sft/runs/wrrxxq6h wandb: Find logs at: wandb/run-20241219_193455-wrrxxq6h/logs [2024-12-20 00:07:09,515] [INFO] [launch.py:347:main] Process 702173 exits successfully.