|
[2025-01-09 15:54:48,830] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
df: /root/.triton/autotune: No such file or directory |
|
git root error: Cmd('git') failed due to: exit code(128) |
|
cmdline: git rev-parse --show-toplevel |
|
stderr: 'fatal: detected dubious ownership in repository at '/workspace' |
|
To add an exception for this directory, call: |
|
|
|
git config --global --add safe.directory /workspace' |
|
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. |
|
wandb: Currently logged in as: nguyenducphu201101. Use `wandb login --relogin` to force relogin |
|
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/pydantic/main.py:314: UserWarning: Pydantic serializer warnings: |
|
Expected `list[str]` but got `tuple` - serialized value may not be as expected |
|
return self.__pydantic_serializer__.to_python( |
|
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/pydantic/main.py:314: UserWarning: Pydantic serializer warnings: |
|
Expected `list[str]` but got `tuple` - serialized value may not be as expected |
|
return self.__pydantic_serializer__.to_python( |
|
wandb: Tracking run with wandb version 0.19.1 |
|
wandb: Run data is saved locally in /workspace/wandb/run-20250109_155451-37f2bdb2-2552-4958-b0be-7186fa7cfbe6 |
|
wandb: Run `wandb offline` to turn off syncing. |
|
wandb: Syncing run test-dpo |
|
wandb: βοΈ View project at https://wandb.ai/nguyenducphu201101/llm-training-platform |
|
wandb: π View run at https://wandb.ai/nguyenducphu201101/llm-training-platform/runs/37f2bdb2-2552-4958-b0be-7186fa7cfbe6 |
|
Generating train split: 0%| | 0/1545 [00:00<?, ? examples/s]
Generating train split: 100%|ββββββββββ| 1545/1545 [00:00<00:00, 119090.67 examples/s] |
|
Generating test split: 0%| | 0/89 [00:00<?, ? examples/s]
Generating test split: 100%|ββββββββββ| 89/89 [00:00<00:00, 37679.73 examples/s] |
|
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, ref_model_init_kwargs. Will not be supported from version '0.13.0'. |
|
|
|
Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead. |
|
warnings.warn(message, FutureWarning) |
|
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:262: UserWarning: You passed `model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`. |
|
warnings.warn( |
|
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:287: UserWarning: You passed `ref_model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`. |
|
warnings.warn( |
|
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:312: UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you. |
|
warnings.warn( |
|
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. |
|
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:319: UserWarning: You passed a ref model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` |
|
warnings.warn( |
|
Extracting prompt from train dataset: 0%| | 0/1545 [00:00<?, ? examples/s]
Extracting prompt from train dataset: 65%|βββββββ | 1000/1545 [00:00<00:00, 9015.34 examples/s]
Extracting prompt from train dataset: 100%|ββββββββββ| 1545/1545 [00:00<00:00, 9401.17 examples/s] |
|
Applying chat template to train dataset: 0%| | 0/1545 [00:00<?, ? examples/s]
Applying chat template to train dataset: 25%|βββ | 379/1545 [00:00<00:00, 3744.19 examples/s]
Applying chat template to train dataset: 53%|ββββββ | 812/1545 [00:00<00:00, 4082.95 examples/s]
Applying chat template to train dataset: 92%|ββββββββββ| 1427/1545 [00:00<00:00, 4085.34 examples/s]
Applying chat template to train dataset: 100%|ββββββββββ| 1545/1545 [00:00<00:00, 4014.06 examples/s] |
|
Tokenizing train dataset: 0%| | 0/1545 [00:00<?, ? examples/s]
Tokenizing train dataset: 7%|β | 110/1545 [00:00<00:01, 1087.62 examples/s]
Tokenizing train dataset: 15%|ββ | 232/1545 [00:00<00:01, 1153.16 examples/s]
Tokenizing train dataset: 23%|βββ | 353/1545 [00:00<00:01, 1170.38 examples/s]
Tokenizing train dataset: 31%|βββ | 472/1545 [00:00<00:00, 1172.74 examples/s]
Tokenizing train dataset: 41%|βββββ | 639/1545 [00:00<00:00, 1140.62 examples/s]
Tokenizing train dataset: 52%|ββββββ | 810/1545 [00:00<00:00, 1133.80 examples/s]
Tokenizing train dataset: 60%|ββββββ | 929/1545 [00:00<00:00, 1143.19 examples/s]
Tokenizing train dataset: 68%|βββββββ | 1051/1545 [00:00<00:00, 1161.64 examples/s]
Tokenizing train dataset: 76%|ββββββββ | 1176/1545 [00:01<00:00, 1183.67 examples/s]
Tokenizing train dataset: 88%|βββββββββ | 1354/1545 [00:01<00:00, 1182.12 examples/s]
Tokenizing train dataset: 99%|ββββββββββ| 1530/1545 [00:01<00:00, 1173.45 examples/s]
Tokenizing train dataset: 100%|ββββββββββ| 1545/1545 [00:01<00:00, 1158.17 examples/s] |
|
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter. |
|
0%| | 0/1545 [00:00<?, ?it/s]Could not estimate the number of tokens of the input, floating-point operations will not be computed |
|
0%| | 1/1545 [00:01<30:08, 1.17s/it]
0%| | 2/1545 [00:02<25:15, 1.02it/s]
0%| | 3/1545 [00:02<17:59, 1.43it/s]
0%| | 4/1545 [00:02<16:30, 1.56it/s]
0%| | 5/1545 [00:03<14:43, 1.74it/s]
0%| | 6/1545 [00:03<14:45, 1.74it/s]
0%| | 7/1545 [00:04<14:57, 1.71it/s]
1%| | 8/1545 [00:05<15:12, 1.68it/s]
1%| | 9/1545 [00:05<15:30, 1.65it/s]
1%| | 10/1545 [00:06<15:11, 1.68it/s]
{'loss': 2.7692, 'grad_norm': 18.5, 'learning_rate': 9.935275080906149e-06, 'rewards/chosen': -13.267723083496094, 'rewards/rejected': -12.376993179321289, 'rewards/accuracies': 0.5, 'rewards/margins': -0.8907286524772644, 'logps/chosen': -272.36419677734375, 'logps/rejected': -224.4181365966797, 'logits/chosen': -0.7734757661819458, 'logits/rejected': -0.8499571084976196, 'epoch': 0.01} |
|
1%| | 10/1545 [00:06<15:11, 1.68it/s]
1%| | 11/1545 [00:06<14:49, 1.72it/s]
1%| | 12/1545 [00:07<14:17, 1.79it/s]
1%| | 13/1545 [00:08<14:34, 1.75it/s]
1%| | 14/1545 [00:08<14:24, 1.77it/s]
1%| | 15/1545 [00:09<13:25, 1.90it/s]
1%| | 16/1545 [00:09<13:46, 1.85it/s]
1%| | 17/1545 [00:10<13:56, 1.83it/s]
1%| | 18/1545 [00:10<14:00, 1.82it/s]
1%| | 19/1545 [00:11<13:08, 1.94it/s]
1%|β | 20/1545 [00:11<13:36, 1.87it/s]
{'loss': 0.2762, 'grad_norm': 0.8828125, 'learning_rate': 9.870550161812299e-06, 'rewards/chosen': -16.860511779785156, 'rewards/rejected': -22.893171310424805, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 6.03265905380249, 'logps/chosen': -311.78521728515625, 'logps/rejected': -344.44610595703125, 'logits/chosen': -0.5504701137542725, 'logits/rejected': -0.5987938642501831, 'epoch': 0.01} |
|
1%|β | 20/1545 [00:11<13:36, 1.87it/s]
1%|β | 21/1545 [00:12<14:05, 1.80it/s]
1%|β | 22/1545 [00:12<13:28, 1.88it/s]
1%|β | 23/1545 [00:13<13:57, 1.82it/s]
2%|β | 24/1545 [00:14<14:16, 1.78it/s]
2%|β | 25/1545 [00:14<14:12, 1.78it/s]
2%|β | 26/1545 [00:15<13:29, 1.88it/s]
2%|β | 27/1545 [00:15<13:49, 1.83it/s]
2%|β | 28/1545 [00:16<13:54, 1.82it/s]
2%|β | 29/1545 [00:16<13:13, 1.91it/s]
2%|β | 30/1545 [00:17<13:19, 1.89it/s]
{'loss': 4.4466, 'grad_norm': 7.82310962677002e-08, 'learning_rate': 9.805825242718447e-06, 'rewards/chosen': -35.80329132080078, 'rewards/rejected': -39.130760192871094, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 3.3274664878845215, 'logps/chosen': -513.8514404296875, 'logps/rejected': -501.32061767578125, 'logits/chosen': -2.7659506797790527, 'logits/rejected': -3.2069015502929688, 'epoch': 0.02} |
|
2%|β | 30/1545 [00:17<13:19, 1.89it/s]
2%|β | 31/1545 [00:17<13:40, 1.84it/s]
2%|β | 32/1545 [00:18<13:52, 1.82it/s]
2%|β | 33/1545 [00:18<12:56, 1.95it/s]
2%|β | 34/1545 [00:19<11:55, 2.11it/s]
2%|β | 35/1545 [00:19<12:51, 1.96it/s]
2%|β | 36/1545 [00:20<13:24, 1.88it/s]
2%|β | 37/1545 [00:20<12:32, 2.00it/s]
2%|β | 38/1545 [00:21<13:11, 1.90it/s]
3%|β | 39/1545 [00:21<13:24, 1.87it/s]
3%|β | 40/1545 [00:22<13:16, 1.89it/s]
{'loss': 1.966, 'grad_norm': 1616.0, 'learning_rate': 9.741100323624596e-06, 'rewards/chosen': -40.694602966308594, 'rewards/rejected': -51.3997802734375, 'rewards/accuracies': 0.6000000238418579, 'rewards/margins': 10.705179214477539, 'logps/chosen': -561.7604370117188, 'logps/rejected': -636.379638671875, 'logits/chosen': -2.4068899154663086, 'logits/rejected': -2.463327407836914, 'epoch': 0.03} |
|
3%|β | 40/1545 [00:22<13:16, 1.89it/s]
3%|β | 41/1545 [00:22<13:18, 1.88it/s]
3%|β | 42/1545 [00:23<13:33, 1.85it/s]
3%|β | 43/1545 [00:23<12:27, 2.01it/s]
3%|β | 44/1545 [00:24<12:09, 2.06it/s]
3%|β | 45/1545 [00:24<12:41, 1.97it/s]
3%|β | 46/1545 [00:25<13:09, 1.90it/s]
3%|β | 47/1545 [00:26<13:23, 1.86it/s]
3%|β | 48/1545 [00:26<12:13, 2.04it/s]
3%|β | 49/1545 [00:27<12:59, 1.92it/s]
3%|β | 50/1545 [00:27<13:16, 1.88it/s]
{'loss': 2.376, 'grad_norm': 26.75, 'learning_rate': 9.676375404530746e-06, 'rewards/chosen': -23.04972267150879, 'rewards/rejected': -31.14634132385254, 'rewards/accuracies': 0.6000000238418579, 'rewards/margins': 8.09661865234375, 'logps/chosen': -366.2521057128906, 'logps/rejected': -414.253173828125, 'logits/chosen': -1.4296232461929321, 'logits/rejected': -1.88992440700531, 'epoch': 0.03} |
|
3%|β | 50/1545 [00:27<13:16, 1.88it/s]
3%|β | 51/1545 [00:28<13:31, 1.84it/s]
3%|β | 52/1545 [00:28<13:15, 1.88it/s]
3%|β | 53/1545 [00:29<13:37, 1.82it/s]
3%|β | 54/1545 [00:29<13:40, 1.82it/s]
4%|β | 55/1545 [00:30<12:53, 1.93it/s]
4%|β | 56/1545 [00:30<13:25, 1.85it/s]
4%|β | 57/1545 [00:31<13:30, 1.84it/s]
4%|β | 58/1545 [00:31<13:25, 1.85it/s]
4%|β | 59/1545 [00:32<13:04, 1.89it/s]
4%|β | 60/1545 [00:32<13:26, 1.84it/s]
{'loss': 3.5255, 'grad_norm': 1.2620790914089075e-19, 'learning_rate': 9.611650485436894e-06, 'rewards/chosen': -41.635231018066406, 'rewards/rejected': -57.55849075317383, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 15.923251152038574, 'logps/chosen': -561.2437744140625, 'logps/rejected': -689.3187255859375, 'logits/chosen': -3.666018009185791, 'logits/rejected': -3.4739983081817627, 'epoch': 0.04} |
|
4%|β | 60/1545 [00:33<13:26, 1.84it/s]
4%|β | 61/1545 [00:33<13:25, 1.84it/s]
4%|β | 62/1545 [00:33<12:48, 1.93it/s]
4%|β | 63/1545 [00:34<13:15, 1.86it/s]
4%|β | 64/1545 [00:36<21:48, 1.13it/s]
4%|β | 65/1545 [00:36<19:19, 1.28it/s]
4%|β | 66/1545 [00:37<17:18, 1.42it/s]
4%|β | 67/1545 [00:37<16:28, 1.50it/s]
4%|β | 68/1545 [00:38<15:40, 1.57it/s]
4%|β | 69/1545 [00:38<14:16, 1.72it/s]
5%|β | 70/1545 [00:39<14:19, 1.72it/s]
{'loss': 0.4333, 'grad_norm': 4000.0, 'learning_rate': 9.546925566343042e-06, 'rewards/chosen': -40.17411422729492, 'rewards/rejected': -62.64879608154297, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 22.47468376159668, 'logps/chosen': -535.5774536132812, 'logps/rejected': -740.8128662109375, 'logits/chosen': -3.367959976196289, 'logits/rejected': -3.1139683723449707, 'epoch': 0.05} |
|
5%|β | 70/1545 [00:39<14:19, 1.72it/s]
5%|β | 71/1545 [00:40<14:10, 1.73it/s]
5%|β | 72/1545 [00:40<13:52, 1.77it/s]
5%|β | 73/1545 [00:40<12:05, 2.03it/s]
5%|β | 74/1545 [00:41<12:40, 1.93it/s]
5%|β | 75/1545 [00:42<13:09, 1.86it/s]
5%|β | 76/1545 [00:42<13:07, 1.87it/s]
5%|β | 77/1545 [00:43<13:09, 1.86it/s]
5%|β | 78/1545 [00:43<13:23, 1.83it/s]
5%|β | 79/1545 [00:44<13:44, 1.78it/s]
5%|β | 80/1545 [00:44<12:42, 1.92it/s]
{'loss': 0.4326, 'grad_norm': 13824.0, 'learning_rate': 9.482200647249192e-06, 'rewards/chosen': -48.43840789794922, 'rewards/rejected': -76.73748779296875, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 28.299081802368164, 'logps/chosen': -626.9752807617188, 'logps/rejected': -873.7962646484375, 'logits/chosen': -2.483962297439575, 'logits/rejected': -2.444535493850708, 'epoch': 0.05} |
|
5%|β | 80/1545 [00:44<12:42, 1.92it/s]
5%|β | 81/1545 [00:45<13:20, 1.83it/s]
5%|β | 82/1545 [00:45<13:23, 1.82it/s]
5%|β | 83/1545 [00:46<13:11, 1.85it/s]
5%|β | 84/1545 [00:46<12:54, 1.89it/s]
6%|β | 85/1545 [00:47<13:07, 1.85it/s]
6%|β | 86/1545 [00:48<13:24, 1.81it/s]
6%|β | 87/1545 [00:48<12:57, 1.87it/s]
6%|β | 88/1545 [00:49<13:13, 1.84it/s]
6%|β | 89/1545 [00:49<13:20, 1.82it/s]
6%|β | 90/1545 [00:50<13:17, 1.82it/s]
{'loss': 2.8932, 'grad_norm': 0.001190185546875, 'learning_rate': 9.41747572815534e-06, 'rewards/chosen': -35.804256439208984, 'rewards/rejected': -56.7745361328125, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 20.97027015686035, 'logps/chosen': -504.11846923828125, 'logps/rejected': -688.4282836914062, 'logits/chosen': -1.8350107669830322, 'logits/rejected': -1.7690149545669556, 'epoch': 0.06} |
|
6%|β | 90/1545 [00:50<13:17, 1.82it/s]
6%|β | 91/1545 [00:50<13:03, 1.86it/s]
6%|β | 92/1545 [00:51<13:32, 1.79it/s]
6%|β | 93/1545 [00:51<13:47, 1.76it/s]
6%|β | 94/1545 [00:52<12:27, 1.94it/s]
6%|β | 95/1545 [00:52<11:45, 2.06it/s]
6%|β | 96/1545 [00:53<12:24, 1.95it/s]
6%|β | 97/1545 [00:53<12:46, 1.89it/s]
6%|β | 98/1545 [00:54<12:21, 1.95it/s]
6%|β | 99/1545 [00:54<12:47, 1.88it/s]
6%|β | 100/1545 [00:55<13:04, 1.84it/s]
{'loss': 1.493, 'grad_norm': 0.055908203125, 'learning_rate': 9.35275080906149e-06, 'rewards/chosen': -16.02591896057129, 'rewards/rejected': -22.183359146118164, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 6.157437801361084, 'logps/chosen': -340.04986572265625, 'logps/rejected': -332.9122314453125, 'logits/chosen': -1.376056432723999, 'logits/rejected': -1.7216821908950806, 'epoch': 0.06} |
|
6%|β | 100/1545 [00:55<13:04, 1.84it/s]
7%|β | 101/1545 [00:56<13:28, 1.79it/s]
7%|β | 102/1545 [00:56<12:21, 1.95it/s]
7%|β | 103/1545 [00:57<12:58, 1.85it/s]
7%|β | 104/1545 [00:57<13:06, 1.83it/s]
7%|β | 105/1545 [00:58<11:49, 2.03it/s]
7%|β | 106/1545 [00:58<11:20, 2.11it/s]
7%|β | 107/1545 [00:59<12:10, 1.97it/s]
7%|β | 108/1545 [00:59<11:12, 2.14it/s]
7%|β | 109/1545 [01:00<11:50, 2.02it/s]
7%|β | 110/1545 [01:00<11:33, 2.07it/s]
{'loss': 1.7504, 'grad_norm': 68.5, 'learning_rate': 9.288025889967638e-06, 'rewards/chosen': -17.503616333007812, 'rewards/rejected': -20.149761199951172, 'rewards/accuracies': 0.5, 'rewards/margins': 2.6461453437805176, 'logps/chosen': -324.0485534667969, 'logps/rejected': -315.428955078125, 'logits/chosen': -1.0650968551635742, 'logits/rejected': -1.4754924774169922, 'epoch': 0.07} |
|
7%|β | 110/1545 [01:00<11:33, 2.07it/s]
7%|β | 111/1545 [01:01<12:15, 1.95it/s]
7%|β | 112/1545 [01:01<12:51, 1.86it/s]
7%|β | 113/1545 [01:02<12:17, 1.94it/s]
7%|β | 114/1545 [01:02<11:17, 2.11it/s]
7%|β | 115/1545 [01:03<12:05, 1.97it/s]
8%|β | 116/1545 [01:03<12:24, 1.92it/s]
8%|β | 117/1545 [01:03<11:06, 2.14it/s]
8%|β | 118/1545 [01:04<11:36, 2.05it/s]
8%|β | 119/1545 [01:05<12:12, 1.95it/s]
8%|β | 120/1545 [01:05<11:08, 2.13it/s]
{'loss': 0.0812, 'grad_norm': 322.0, 'learning_rate': 9.223300970873788e-06, 'rewards/chosen': -22.525089263916016, 'rewards/rejected': -34.67645263671875, 'rewards/accuracies': 1.0, 'rewards/margins': 12.151366233825684, 'logps/chosen': -371.64007568359375, 'logps/rejected': -470.7439880371094, 'logits/chosen': -3.351106643676758, 'logits/rejected': -3.953688144683838, 'epoch': 0.08} |
|
8%|β | 120/1545 [01:05<11:08, 2.13it/s]
8%|β | 121/1545 [01:05<10:27, 2.27it/s]
8%|β | 122/1545 [01:06<10:18, 2.30it/s]
8%|β | 123/1545 [01:06<11:14, 2.11it/s]
8%|β | 124/1545 [01:07<11:40, 2.03it/s]
8%|β | 125/1545 [01:07<12:05, 1.96it/s]
8%|β | 126/1545 [01:08<11:53, 1.99it/s]
8%|β | 127/1545 [01:08<12:14, 1.93it/s]
8%|β | 128/1545 [01:09<11:14, 2.10it/s]
8%|β | 129/1545 [01:09<11:30, 2.05it/s]
8%|β | 130/1545 [01:10<11:49, 1.99it/s]
{'loss': 8.072, 'grad_norm': 7.534027099609375e-05, 'learning_rate': 9.158576051779936e-06, 'rewards/chosen': -28.951553344726562, 'rewards/rejected': -31.89861488342285, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 2.9470643997192383, 'logps/chosen': -441.2911682128906, 'logps/rejected': -428.54656982421875, 'logits/chosen': -3.3049476146698, 'logits/rejected': -3.9077095985412598, 'epoch': 0.08} |
|
8%|β | 130/1545 [01:10<11:49, 1.99it/s]
8%|β | 131/1545 [01:10<12:16, 1.92it/s]
9%|β | 132/1545 [01:11<12:20, 1.91it/s]
9%|β | 133/1545 [01:11<11:28, 2.05it/s]
9%|β | 134/1545 [01:12<12:02, 1.95it/s]
9%|β | 135/1545 [01:13<12:24, 1.89it/s]
9%|β | 136/1545 [01:13<12:28, 1.88it/s]
9%|β | 137/1545 [01:13<11:41, 2.01it/s]
9%|β | 138/1545 [01:14<12:11, 1.92it/s]
9%|β | 139/1545 [01:15<12:24, 1.89it/s]
9%|β | 140/1545 [01:15<12:20, 1.90it/s]
{'loss': 0.9608, 'grad_norm': 2.086162567138672e-07, 'learning_rate': 9.093851132686085e-06, 'rewards/chosen': -19.53597640991211, 'rewards/rejected': -32.082984924316406, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 12.547004699707031, 'logps/chosen': -317.4342956542969, 'logps/rejected': -419.8863220214844, 'logits/chosen': -2.860546112060547, 'logits/rejected': -3.551792621612549, 'epoch': 0.09} |
|
9%|β | 140/1545 [01:15<12:20, 1.90it/s]
9%|β | 141/1545 [01:15<11:08, 2.10it/s]
9%|β | 142/1545 [01:16<11:58, 1.95it/s]
9%|β | 143/1545 [01:17<12:16, 1.90it/s]
9%|β | 144/1545 [01:17<12:07, 1.93it/s]
9%|β | 145/1545 [01:18<12:27, 1.87it/s]
9%|β | 146/1545 [01:18<12:40, 1.84it/s]
10%|β | 147/1545 [01:19<12:40, 1.84it/s]
10%|β | 148/1545 [01:19<11:53, 1.96it/s]
10%|β | 149/1545 [01:20<12:18, 1.89it/s]
10%|β | 150/1545 [01:20<12:33, 1.85it/s]
{'loss': 1.0182, 'grad_norm': 18.5, 'learning_rate': 9.029126213592233e-06, 'rewards/chosen': -17.807960510253906, 'rewards/rejected': -36.81556701660156, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 19.00760269165039, 'logps/chosen': -303.423095703125, 'logps/rejected': -467.78778076171875, 'logits/chosen': -3.5883376598358154, 'logits/rejected': -4.601079940795898, 'epoch': 0.1} |
|
10%|β | 150/1545 [01:20<12:33, 1.85it/s]
10%|β | 151/1545 [01:21<12:15, 1.89it/s]
10%|β | 152/1545 [01:21<12:18, 1.89it/s]
10%|β | 153/1545 [01:22<12:30, 1.85it/s]
10%|β | 154/1545 [01:23<12:41, 1.83it/s]
10%|β | 155/1545 [01:23<12:00, 1.93it/s]
10%|β | 156/1545 [01:24<12:31, 1.85it/s]
10%|β | 157/1545 [01:24<11:22, 2.03it/s]
10%|β | 158/1545 [01:25<11:47, 1.96it/s]
10%|β | 159/1545 [01:25<11:12, 2.06it/s]
10%|β | 160/1545 [01:26<11:57, 1.93it/s]
{'loss': 0.0122, 'grad_norm': 6.184563972055912e-10, 'learning_rate': 8.964401294498383e-06, 'rewards/chosen': -19.044435501098633, 'rewards/rejected': -41.322547912597656, 'rewards/accuracies': 1.0, 'rewards/margins': 22.278114318847656, 'logps/chosen': -330.6968994140625, 'logps/rejected': -523.6119384765625, 'logits/chosen': -3.189589023590088, 'logits/rejected': -4.832894325256348, 'epoch': 0.1} |
|
10%|β | 160/1545 [01:26<11:57, 1.93it/s]
10%|β | 161/1545 [01:26<12:13, 1.89it/s]
10%|β | 162/1545 [01:27<12:12, 1.89it/s]
11%|β | 163/1545 [01:27<12:04, 1.91it/s]
11%|β | 164/1545 [01:28<12:25, 1.85it/s]
11%|β | 165/1545 [01:28<12:29, 1.84it/s]
11%|β | 166/1545 [01:29<11:56, 1.92it/s]
11%|β | 167/1545 [01:29<12:14, 1.88it/s]
11%|β | 168/1545 [01:30<12:20, 1.86it/s]
11%|β | 169/1545 [01:30<12:19, 1.86it/s]
11%|β | 170/1545 [01:31<12:00, 1.91it/s]
{'loss': 1.4695, 'grad_norm': 8.216219588366713e-20, 'learning_rate': 8.899676375404531e-06, 'rewards/chosen': -46.257057189941406, 'rewards/rejected': -82.32180786132812, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 36.06475067138672, 'logps/chosen': -610.6324462890625, 'logps/rejected': -940.2429809570312, 'logits/chosen': -4.580934047698975, 'logits/rejected': -5.433409214019775, 'epoch': 0.11} |
|
11%|β | 170/1545 [01:31<12:00, 1.91it/s]
11%|β | 171/1545 [01:31<12:26, 1.84it/s]
11%|β | 172/1545 [01:32<12:13, 1.87it/s]
11%|β | 173/1545 [01:32<11:43, 1.95it/s]
11%|ββ | 174/1545 [01:33<12:17, 1.86it/s]
11%|ββ | 175/1545 [01:34<12:30, 1.82it/s]
11%|ββ | 176/1545 [01:34<12:26, 1.83it/s]
11%|ββ | 177/1545 [01:34<10:43, 2.13it/s]
12%|ββ | 178/1545 [01:35<11:25, 1.99it/s]
12%|ββ | 179/1545 [01:36<11:43, 1.94it/s]
12%|ββ | 180/1545 [01:37<16:58, 1.34it/s]
{'loss': 1.1198, 'grad_norm': 0.03271484375, 'learning_rate': 8.834951456310681e-06, 'rewards/chosen': -24.422029495239258, 'rewards/rejected': -46.21000289916992, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 21.787975311279297, 'logps/chosen': -383.09124755859375, 'logps/rejected': -577.4222412109375, 'logits/chosen': -4.082339286804199, 'logits/rejected': -5.3761396408081055, 'epoch': 0.12} |
|
12%|ββ | 180/1545 [01:37<16:58, 1.34it/s]
12%|ββ | 181/1545 [01:37<15:05, 1.51it/s]
12%|ββ | 182/1545 [01:38<14:34, 1.56it/s]
12%|ββ | 183/1545 [01:38<12:40, 1.79it/s]
12%|ββ | 184/1545 [01:39<12:34, 1.80it/s]
12%|ββ | 185/1545 [01:39<12:23, 1.83it/s]
12%|ββ | 186/1545 [01:40<12:33, 1.80it/s]
12%|ββ | 187/1545 [01:40<12:35, 1.80it/s]
12%|ββ | 188/1545 [01:41<11:44, 1.93it/s]
12%|ββ | 189/1545 [01:41<12:01, 1.88it/s]
12%|ββ | 190/1545 [01:42<12:12, 1.85it/s]
{'loss': 0.0704, 'grad_norm': 0.00084686279296875, 'learning_rate': 8.770226537216829e-06, 'rewards/chosen': -19.553829193115234, 'rewards/rejected': -37.149200439453125, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 17.59537124633789, 'logps/chosen': -342.45489501953125, 'logps/rejected': -478.92120361328125, 'logits/chosen': -3.0818886756896973, 'logits/rejected': -4.275086402893066, 'epoch': 0.12} |
|
12%|ββ | 190/1545 [01:42<12:12, 1.85it/s]
12%|ββ | 191/1545 [01:43<12:16, 1.84it/s]
12%|ββ | 192/1545 [01:43<11:25, 1.97it/s]
12%|ββ | 193/1545 [01:44<11:49, 1.90it/s]
13%|ββ | 194/1545 [01:44<12:07, 1.86it/s]
13%|ββ | 195/1545 [01:44<10:58, 2.05it/s]
13%|ββ | 196/1545 [01:45<10:29, 2.14it/s]
13%|ββ | 197/1545 [01:45<11:07, 2.02it/s]
13%|ββ | 198/1545 [01:46<11:23, 1.97it/s]
13%|ββ | 199/1545 [01:47<11:34, 1.94it/s]
13%|ββ | 200/1545 [01:47<11:37, 1.93it/s]
{'loss': 0.012, 'grad_norm': 6.0625, 'learning_rate': 8.705501618122979e-06, 'rewards/chosen': -33.0775146484375, 'rewards/rejected': -53.18238067626953, 'rewards/accuracies': 1.0, 'rewards/margins': 20.104867935180664, 'logps/chosen': -470.367919921875, 'logps/rejected': -644.5907592773438, 'logits/chosen': -3.6899101734161377, 'logits/rejected': -5.173361778259277, 'epoch': 0.13} |
|
13%|ββ | 200/1545 [01:47<11:37, 1.93it/s]
13%|ββ | 201/1545 [01:48<12:10, 1.84it/s]
13%|ββ | 202/1545 [01:48<12:21, 1.81it/s]
13%|ββ | 203/1545 [01:49<11:27, 1.95it/s]
13%|ββ | 204/1545 [01:49<11:50, 1.89it/s]
13%|ββ | 205/1545 [01:50<12:17, 1.82it/s]
13%|ββ | 206/1545 [01:50<12:13, 1.83it/s]
13%|ββ | 207/1545 [01:51<12:05, 1.84it/s]
13%|ββ | 208/1545 [01:51<12:23, 1.80it/s]
14%|ββ | 209/1545 [01:52<12:19, 1.81it/s]
14%|ββ | 210/1545 [01:53<11:46, 1.89it/s]
{'loss': 0.1499, 'grad_norm': 0.0, 'learning_rate': 8.640776699029127e-06, 'rewards/chosen': -35.89064025878906, 'rewards/rejected': -85.90962219238281, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 50.01898956298828, 'logps/chosen': -512.2362670898438, 'logps/rejected': -987.8312377929688, 'logits/chosen': -3.8464527130126953, 'logits/rejected': -5.528945446014404, 'epoch': 0.14} |
|
14%|ββ | 210/1545 [01:53<11:46, 1.89it/s]
14%|ββ | 211/1545 [01:53<12:16, 1.81it/s]
14%|ββ | 212/1545 [01:54<12:23, 1.79it/s]
14%|ββ | 213/1545 [01:54<12:12, 1.82it/s]
14%|ββ | 214/1545 [01:55<11:54, 1.86it/s]
14%|ββ | 215/1545 [01:55<12:10, 1.82it/s]
14%|ββ | 216/1545 [01:56<12:09, 1.82it/s]
14%|ββ | 217/1545 [01:56<11:13, 1.97it/s]
14%|ββ | 218/1545 [01:57<11:42, 1.89it/s]
14%|ββ | 219/1545 [01:57<11:54, 1.86it/s]
14%|ββ | 220/1545 [01:58<11:52, 1.86it/s]
{'loss': 0.6222, 'grad_norm': 720.0, 'learning_rate': 8.576051779935276e-06, 'rewards/chosen': -39.93156814575195, 'rewards/rejected': -58.58339309692383, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 18.65182876586914, 'logps/chosen': -554.5364990234375, 'logps/rejected': -703.7771606445312, 'logits/chosen': -3.3229453563690186, 'logits/rejected': -4.664990425109863, 'epoch': 0.14} |
|
14%|ββ | 220/1545 [01:58<11:52, 1.86it/s]
14%|ββ | 221/1545 [01:58<11:11, 1.97it/s]
14%|ββ | 222/1545 [01:59<11:35, 1.90it/s]
14%|ββ | 223/1545 [02:00<11:46, 1.87it/s]
14%|ββ | 224/1545 [02:00<10:40, 2.06it/s]
15%|ββ | 225/1545 [02:00<10:18, 2.13it/s]
15%|ββ | 226/1545 [02:01<11:03, 1.99it/s]
15%|ββ | 227/1545 [02:01<10:09, 2.16it/s]
15%|ββ | 228/1545 [02:02<10:43, 2.05it/s]
15%|ββ | 229/1545 [02:02<09:23, 2.33it/s]
15%|ββ | 230/1545 [02:03<10:25, 2.10it/s]
{'loss': 0.0693, 'grad_norm': 0.0, 'learning_rate': 8.511326860841424e-06, 'rewards/chosen': -45.17763900756836, 'rewards/rejected': -84.54122924804688, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 39.363590240478516, 'logps/chosen': -602.5833740234375, 'logps/rejected': -968.5574951171875, 'logits/chosen': -3.7494091987609863, 'logits/rejected': -5.287755966186523, 'epoch': 0.15} |
|
15%|ββ | 230/1545 [02:03<10:25, 2.10it/s]
15%|ββ | 231/1545 [02:03<11:00, 1.99it/s]
15%|ββ | 232/1545 [02:04<11:10, 1.96it/s]
15%|ββ | 233/1545 [02:04<11:03, 1.98it/s]
15%|ββ | 234/1545 [02:05<11:36, 1.88it/s]
15%|ββ | 235/1545 [02:05<11:29, 1.90it/s]
15%|ββ | 236/1545 [02:06<11:04, 1.97it/s]
15%|ββ | 237/1545 [02:06<11:44, 1.86it/s]
15%|ββ | 238/1545 [02:07<11:56, 1.82it/s]
15%|ββ | 239/1545 [02:08<11:52, 1.83it/s]
16%|ββ | 240/1545 [02:08<10:52, 2.00it/s]
{'loss': 0.542, 'grad_norm': 334.0, 'learning_rate': 8.446601941747573e-06, 'rewards/chosen': -34.770347595214844, 'rewards/rejected': -57.98395919799805, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 23.213611602783203, 'logps/chosen': -471.40185546875, 'logps/rejected': -686.8424682617188, 'logits/chosen': -3.1056551933288574, 'logits/rejected': -4.154791355133057, 'epoch': 0.16} |
|
16%|ββ | 240/1545 [02:08<10:52, 2.00it/s]
16%|ββ | 241/1545 [02:09<11:23, 1.91it/s]
16%|ββ | 242/1545 [02:09<11:28, 1.89it/s]
16%|ββ | 243/1545 [02:10<11:26, 1.90it/s]
16%|ββ | 244/1545 [02:10<11:16, 1.92it/s]
16%|ββ | 245/1545 [02:11<11:35, 1.87it/s]
16%|ββ | 246/1545 [02:11<11:44, 1.84it/s]
16%|ββ | 247/1545 [02:12<11:09, 1.94it/s]
16%|ββ | 248/1545 [02:12<11:29, 1.88it/s]
16%|ββ | 249/1545 [02:13<11:42, 1.85it/s]
16%|ββ | 250/1545 [02:13<11:35, 1.86it/s]
{'loss': 0.0827, 'grad_norm': 0.00689697265625, 'learning_rate': 8.381877022653722e-06, 'rewards/chosen': -13.427328109741211, 'rewards/rejected': -41.39923095703125, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 27.97190284729004, 'logps/chosen': -279.3583068847656, 'logps/rejected': -531.146484375, 'logits/chosen': -1.5291231870651245, 'logits/rejected': -3.7414169311523438, 'epoch': 0.16} |
|
16%|ββ | 250/1545 [02:13<11:35, 1.86it/s]
16%|ββ | 251/1545 [02:14<11:04, 1.95it/s]
16%|ββ | 252/1545 [02:14<11:31, 1.87it/s]
16%|ββ | 253/1545 [02:15<10:22, 2.08it/s]
16%|ββ | 254/1545 [02:15<10:44, 2.00it/s]
17%|ββ | 255/1545 [02:16<10:06, 2.13it/s]
17%|ββ | 256/1545 [02:16<10:46, 2.00it/s]
17%|ββ | 257/1545 [02:17<11:12, 1.92it/s]
17%|ββ | 258/1545 [02:17<11:10, 1.92it/s]
17%|ββ | 259/1545 [02:18<10:05, 2.12it/s]
17%|ββ | 260/1545 [02:18<10:55, 1.96it/s]
{'loss': 0.4419, 'grad_norm': 2.9558577807620168e-12, 'learning_rate': 8.317152103559872e-06, 'rewards/chosen': -17.75248146057129, 'rewards/rejected': -30.9813175201416, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 13.228837966918945, 'logps/chosen': -318.49920654296875, 'logps/rejected': -418.8551330566406, 'logits/chosen': -2.046048164367676, 'logits/rejected': -2.655272960662842, 'epoch': 0.17} |
|
17%|ββ | 260/1545 [02:18<10:55, 1.96it/s]
17%|ββ | 261/1545 [02:19<11:08, 1.92it/s]
17%|ββ | 262/1545 [02:19<10:45, 1.99it/s]
17%|ββ | 263/1545 [02:20<11:14, 1.90it/s]
17%|ββ | 264/1545 [02:20<11:31, 1.85it/s]
17%|ββ | 265/1545 [02:21<11:34, 1.84it/s]
17%|ββ | 266/1545 [02:21<10:52, 1.96it/s]
17%|ββ | 267/1545 [02:22<11:28, 1.86it/s]
17%|ββ | 268/1545 [02:23<11:40, 1.82it/s]
17%|ββ | 269/1545 [02:23<11:10, 1.90it/s]
17%|ββ | 270/1545 [02:24<11:41, 1.82it/s]
{'loss': 0.0115, 'grad_norm': 4.363059997558594e-05, 'learning_rate': 8.25242718446602e-06, 'rewards/chosen': -12.999127388000488, 'rewards/rejected': -31.61448097229004, 'rewards/accuracies': 1.0, 'rewards/margins': 18.61534881591797, 'logps/chosen': -296.9534606933594, 'logps/rejected': -435.48883056640625, 'logits/chosen': -1.737764596939087, 'logits/rejected': -3.4163742065429688, 'epoch': 0.17} |
|
17%|ββ | 270/1545 [02:24<11:41, 1.82it/s]
18%|ββ | 271/1545 [02:24<12:05, 1.76it/s]
18%|ββ | 272/1545 [02:25<11:49, 1.79it/s]
18%|ββ | 273/1545 [02:25<11:08, 1.90it/s]
18%|ββ | 274/1545 [02:26<11:28, 1.85it/s]
18%|ββ | 275/1545 [02:26<10:19, 2.05it/s]
18%|ββ | 276/1545 [02:27<10:39, 1.99it/s]
18%|ββ | 277/1545 [02:27<10:04, 2.10it/s]
18%|ββ | 278/1545 [02:28<10:43, 1.97it/s]
18%|ββ | 279/1545 [02:28<10:56, 1.93it/s]
18%|ββ | 280/1545 [02:29<10:56, 1.93it/s]
{'loss': 0.0759, 'grad_norm': 1.4028046280145645e-08, 'learning_rate': 8.18770226537217e-06, 'rewards/chosen': -28.01290512084961, 'rewards/rejected': -43.24602508544922, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 15.233118057250977, 'logps/chosen': -442.2579040527344, 'logps/rejected': -560.4046630859375, 'logits/chosen': -2.9247617721557617, 'logits/rejected': -3.853868007659912, 'epoch': 0.18} |
|
18%|ββ | 280/1545 [02:29<10:56, 1.93it/s]
18%|ββ | 281/1545 [02:29<09:50, 2.14it/s]
18%|ββ | 282/1545 [02:30<10:36, 1.98it/s]
18%|ββ | 283/1545 [02:30<11:00, 1.91it/s]
18%|ββ | 284/1545 [02:31<10:33, 1.99it/s]
18%|ββ | 285/1545 [02:31<11:11, 1.88it/s]
19%|ββ | 286/1545 [02:32<11:30, 1.82it/s]
19%|ββ | 287/1545 [02:33<11:30, 1.82it/s]
19%|ββ | 288/1545 [02:33<10:37, 1.97it/s]
19%|ββ | 289/1545 [02:34<11:15, 1.86it/s]
19%|ββ | 290/1545 [02:34<11:23, 1.84it/s]
{'loss': 3.3436, 'grad_norm': 5.0090140368830305e-17, 'learning_rate': 8.122977346278318e-06, 'rewards/chosen': -25.31455421447754, 'rewards/rejected': -46.47494888305664, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 21.16039276123047, 'logps/chosen': -403.2884216308594, 'logps/rejected': -592.1914672851562, 'logits/chosen': -2.8424019813537598, 'logits/rejected': -3.816692352294922, 'epoch': 0.19} |
|
19%|ββ | 290/1545 [02:34<11:23, 1.84it/s]
19%|ββ | 291/1545 [02:35<11:07, 1.88it/s]
19%|ββ | 292/1545 [02:35<11:27, 1.82it/s]
19%|ββ | 293/1545 [02:36<11:42, 1.78it/s]
19%|ββ | 294/1545 [02:36<11:37, 1.79it/s]
19%|ββ | 295/1545 [02:37<10:51, 1.92it/s]
19%|ββ | 296/1545 [02:37<11:12, 1.86it/s]
19%|ββ | 297/1545 [02:39<16:07, 1.29it/s]
19%|ββ | 298/1545 [02:39<14:14, 1.46it/s]
19%|ββ | 299/1545 [02:40<13:15, 1.57it/s]
19%|ββ | 300/1545 [02:40<12:50, 1.62it/s]
{'loss': 0.0003, 'grad_norm': 3.92901711165905e-09, 'learning_rate': 8.058252427184466e-06, 'rewards/chosen': -10.483701705932617, 'rewards/rejected': -42.29178237915039, 'rewards/accuracies': 1.0, 'rewards/margins': 31.808080673217773, 'logps/chosen': -243.2613983154297, 'logps/rejected': -532.6328735351562, 'logits/chosen': -1.451924443244934, 'logits/rejected': -4.144944190979004, 'epoch': 0.19} |
|
19%|ββ | 300/1545 [02:40<12:50, 1.62it/s]
19%|ββ | 301/1545 [02:41<12:28, 1.66it/s]
20%|ββ | 302/1545 [02:41<11:27, 1.81it/s]
20%|ββ | 303/1545 [02:42<11:37, 1.78it/s]
20%|ββ | 304/1545 [02:42<11:35, 1.78it/s]
20%|ββ | 305/1545 [02:43<11:32, 1.79it/s]
20%|ββ | 306/1545 [02:44<11:30, 1.80it/s]
20%|ββ | 307/1545 [02:44<11:37, 1.78it/s]
20%|ββ | 308/1545 [02:45<11:41, 1.76it/s]
20%|ββ | 309/1545 [02:45<10:48, 1.91it/s]
20%|ββ | 310/1545 [02:46<11:14, 1.83it/s]
{'loss': 0.4085, 'grad_norm': 520.0, 'learning_rate': 7.993527508090616e-06, 'rewards/chosen': -10.010625839233398, 'rewards/rejected': -25.10512924194336, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 15.094502449035645, 'logps/chosen': -236.5189971923828, 'logps/rejected': -370.3948059082031, 'logits/chosen': -2.0673410892486572, 'logits/rejected': -3.2709903717041016, 'epoch': 0.2} |
|
20%|ββ | 310/1545 [02:46<11:14, 1.83it/s]
20%|ββ | 311/1545 [02:46<11:20, 1.81it/s]
20%|ββ | 312/1545 [02:47<11:16, 1.82it/s]
20%|ββ | 313/1545 [02:47<10:59, 1.87it/s]
20%|ββ | 314/1545 [02:48<11:09, 1.84it/s]
20%|ββ | 315/1545 [02:48<11:15, 1.82it/s]
20%|ββ | 316/1545 [02:49<10:30, 1.95it/s]
21%|ββ | 317/1545 [02:49<11:07, 1.84it/s]
21%|ββ | 318/1545 [02:50<11:17, 1.81it/s]
21%|ββ | 319/1545 [02:51<11:14, 1.82it/s]
21%|ββ | 320/1545 [02:51<11:03, 1.85it/s]
{'loss': 0.0186, 'grad_norm': 348.0, 'learning_rate': 7.928802588996765e-06, 'rewards/chosen': -7.756754398345947, 'rewards/rejected': -28.2783260345459, 'rewards/accuracies': 1.0, 'rewards/margins': 20.52157211303711, 'logps/chosen': -252.49423217773438, 'logps/rejected': -395.2229919433594, 'logits/chosen': -1.2267696857452393, 'logits/rejected': -2.0097601413726807, 'epoch': 0.21} |
|
21%|ββ | 320/1545 [02:51<11:03, 1.85it/s]
21%|ββ | 321/1545 [02:52<11:35, 1.76it/s]
21%|ββ | 322/1545 [02:52<10:21, 1.97it/s]
21%|ββ | 323/1545 [02:53<10:10, 2.00it/s]
21%|ββ | 324/1545 [02:53<10:31, 1.93it/s]
21%|ββ | 325/1545 [02:54<10:45, 1.89it/s]
21%|ββ | 326/1545 [02:54<11:08, 1.82it/s]
21%|ββ | 327/1545 [02:55<10:09, 2.00it/s]
21%|ββ | 328/1545 [02:55<10:38, 1.91it/s]
21%|βββ | 329/1545 [02:56<10:51, 1.87it/s]
21%|βββ | 330/1545 [02:56<10:48, 1.87it/s]
{'loss': 0.6079, 'grad_norm': 3.0184188481996443e-16, 'learning_rate': 7.864077669902913e-06, 'rewards/chosen': -10.482062339782715, 'rewards/rejected': -27.2606201171875, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 16.7785587310791, 'logps/chosen': -251.0252227783203, 'logps/rejected': -369.49072265625, 'logits/chosen': -1.2209112644195557, 'logits/rejected': -2.5535261631011963, 'epoch': 0.21} |
|
21%|βββ | 330/1545 [02:56<10:48, 1.87it/s]
21%|βββ | 331/1545 [02:57<10:39, 1.90it/s]
21%|βββ | 332/1545 [02:58<11:21, 1.78it/s]
22%|βββ | 333/1545 [02:58<11:12, 1.80it/s]
22%|βββ | 334/1545 [02:58<10:30, 1.92it/s]
22%|βββ | 335/1545 [02:59<10:59, 1.83it/s]
22%|βββ | 336/1545 [03:00<11:08, 1.81it/s]
22%|βββ | 337/1545 [03:00<11:01, 1.82it/s]
22%|βββ | 338/1545 [03:01<10:47, 1.86it/s]
22%|βββ | 339/1545 [03:01<11:00, 1.83it/s]
22%|βββ | 340/1545 [03:02<11:12, 1.79it/s]
{'loss': 0.6174, 'grad_norm': 2.7755575615628914e-17, 'learning_rate': 7.799352750809061e-06, 'rewards/chosen': -15.824236869812012, 'rewards/rejected': -32.17787170410156, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 16.35363006591797, 'logps/chosen': -278.61468505859375, 'logps/rejected': -427.072021484375, 'logits/chosen': -2.014265537261963, 'logits/rejected': -2.9095723628997803, 'epoch': 0.22} |
|
22%|βββ | 340/1545 [03:02<11:12, 1.79it/s]
22%|βββ | 341/1545 [03:02<10:38, 1.89it/s]
22%|βββ | 342/1545 [03:03<11:00, 1.82it/s]
22%|βββ | 343/1545 [03:03<09:57, 2.01it/s]
22%|βββ | 344/1545 [03:04<10:23, 1.93it/s]
22%|βββ | 345/1545 [03:04<09:48, 2.04it/s]
22%|βββ | 346/1545 [03:05<10:28, 1.91it/s]
22%|βββ | 347/1545 [03:05<10:40, 1.87it/s]
23%|βββ | 348/1545 [03:06<10:37, 1.88it/s]
23%|βββ | 349/1545 [03:07<10:33, 1.89it/s]
23%|βββ | 350/1545 [03:07<10:42, 1.86it/s]
{'loss': 0.0, 'grad_norm': 4.607859233063394e-19, 'learning_rate': 7.734627831715211e-06, 'rewards/chosen': -10.83470344543457, 'rewards/rejected': -41.98002624511719, 'rewards/accuracies': 1.0, 'rewards/margins': 31.145328521728516, 'logps/chosen': -240.7436065673828, 'logps/rejected': -537.6058349609375, 'logits/chosen': -1.7035901546478271, 'logits/rejected': -3.3107447624206543, 'epoch': 0.23} |
|
23%|βββ | 350/1545 [03:07<10:42, 1.86it/s]
23%|βββ | 351/1545 [03:08<10:46, 1.85it/s]
23%|βββ | 352/1545 [03:08<09:58, 1.99it/s]
23%|βββ | 353/1545 [03:09<10:20, 1.92it/s]
23%|βββ | 354/1545 [03:09<10:31, 1.89it/s]
23%|βββ | 355/1545 [03:10<10:33, 1.88it/s]
23%|βββ | 356/1545 [03:10<10:00, 1.98it/s]
23%|βββ | 357/1545 [03:11<10:30, 1.88it/s]
23%|βββ | 358/1545 [03:11<09:28, 2.09it/s]
23%|βββ | 359/1545 [03:12<09:52, 2.00it/s]
23%|βββ | 360/1545 [03:12<09:29, 2.08it/s]
{'loss': 4.1093, 'grad_norm': 7.44648787076585e-12, 'learning_rate': 7.66990291262136e-06, 'rewards/chosen': -18.4683837890625, 'rewards/rejected': -34.16698455810547, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 15.698600769042969, 'logps/chosen': -337.5670166015625, 'logps/rejected': -447.79180908203125, 'logits/chosen': -2.0399553775787354, 'logits/rejected': -3.212721347808838, 'epoch': 0.23} |
|
23%|βββ | 360/1545 [03:12<09:29, 2.08it/s]
23%|βββ | 361/1545 [03:13<10:10, 1.94it/s]
23%|βββ | 362/1545 [03:13<10:25, 1.89it/s]
23%|βββ | 363/1545 [03:14<10:11, 1.93it/s]
24%|βββ | 364/1545 [03:14<10:17, 1.91it/s]
24%|βββ | 365/1545 [03:15<10:34, 1.86it/s]
24%|βββ | 366/1545 [03:15<10:50, 1.81it/s]
24%|βββ | 367/1545 [03:16<10:09, 1.93it/s]
24%|βββ | 368/1545 [03:16<10:43, 1.83it/s]
24%|βββ | 369/1545 [03:17<10:49, 1.81it/s]
24%|βββ | 370/1545 [03:18<10:31, 1.86it/s]
{'loss': 1.606, 'grad_norm': 0.011474609375, 'learning_rate': 7.605177993527508e-06, 'rewards/chosen': -15.731010437011719, 'rewards/rejected': -35.59105682373047, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 19.86004638671875, 'logps/chosen': -323.3865966796875, 'logps/rejected': -465.696044921875, 'logits/chosen': -2.1124587059020996, 'logits/rejected': -3.7612509727478027, 'epoch': 0.24} |
|
24%|βββ | 370/1545 [03:18<10:31, 1.86it/s]
24%|βββ | 371/1545 [03:18<10:42, 1.83it/s]
24%|βββ | 372/1545 [03:19<10:48, 1.81it/s]
24%|βββ | 373/1545 [03:19<10:53, 1.79it/s]
24%|βββ | 374/1545 [03:20<10:04, 1.94it/s]
24%|βββ | 375/1545 [03:20<10:27, 1.86it/s]
24%|βββ | 376/1545 [03:21<10:43, 1.82it/s]
24%|βββ | 377/1545 [03:21<10:30, 1.85it/s]
24%|βββ | 378/1545 [03:22<10:32, 1.85it/s]
25%|βββ | 379/1545 [03:22<10:46, 1.80it/s]
25%|βββ | 380/1545 [03:23<09:41, 2.00it/s]
{'loss': 0.0008, 'grad_norm': 2.7466739993542433e-10, 'learning_rate': 7.540453074433658e-06, 'rewards/chosen': -12.65916919708252, 'rewards/rejected': -49.6566047668457, 'rewards/accuracies': 1.0, 'rewards/margins': 36.99742889404297, 'logps/chosen': -260.13958740234375, 'logps/rejected': -610.1561279296875, 'logits/chosen': -1.6531693935394287, 'logits/rejected': -4.022229194641113, 'epoch': 0.25} |
|
25%|βββ | 380/1545 [03:23<09:41, 2.00it/s]
25%|βββ | 381/1545 [03:23<09:32, 2.03it/s]
25%|βββ | 382/1545 [03:24<10:10, 1.90it/s]
25%|βββ | 383/1545 [03:24<10:24, 1.86it/s]
25%|βββ | 384/1545 [03:25<10:29, 1.85it/s]
25%|βββ | 385/1545 [03:25<09:47, 1.97it/s]
25%|βββ | 386/1545 [03:26<09:02, 2.14it/s]
25%|βββ | 387/1545 [03:26<09:38, 2.00it/s]
25%|βββ | 388/1545 [03:27<09:53, 1.95it/s]
25%|βββ | 389/1545 [03:27<09:30, 2.03it/s]
25%|βββ | 390/1545 [03:28<08:51, 2.17it/s]
{'loss': 0.0693, 'grad_norm': 0.205078125, 'learning_rate': 7.475728155339807e-06, 'rewards/chosen': -22.143047332763672, 'rewards/rejected': -71.9369888305664, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 49.793941497802734, 'logps/chosen': -388.7550964355469, 'logps/rejected': -835.4134521484375, 'logits/chosen': -2.400635242462158, 'logits/rejected': -5.395668029785156, 'epoch': 0.25} |
|
25%|βββ | 390/1545 [03:28<08:51, 2.17it/s]
25%|βββ | 391/1545 [03:28<08:24, 2.29it/s]
25%|βββ | 392/1545 [03:29<08:08, 2.36it/s]
25%|βββ | 393/1545 [03:29<08:41, 2.21it/s]
26%|βββ | 394/1545 [03:29<08:25, 2.28it/s]
26%|βββ | 395/1545 [03:30<09:14, 2.08it/s]
26%|βββ | 396/1545 [03:31<09:45, 1.96it/s]
26%|βββ | 397/1545 [03:31<09:43, 1.97it/s]
26%|βββ | 398/1545 [03:31<08:49, 2.16it/s]
26%|βββ | 399/1545 [03:32<09:31, 2.01it/s]
26%|βββ | 400/1545 [03:33<09:58, 1.91it/s]
{'loss': 2.9874, 'grad_norm': 0.0002765655517578125, 'learning_rate': 7.411003236245955e-06, 'rewards/chosen': -18.571969985961914, 'rewards/rejected': -58.970985412597656, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 40.39901351928711, 'logps/chosen': -332.44952392578125, 'logps/rejected': -692.6171875, 'logits/chosen': -2.102853298187256, 'logits/rejected': -5.4109063148498535, 'epoch': 0.26} |
|
26%|βββ | 400/1545 [03:33<09:58, 1.91it/s]
26%|βββ | 401/1545 [03:33<09:33, 1.99it/s]
26%|βββ | 402/1545 [03:34<10:00, 1.90it/s]
26%|βββ | 403/1545 [03:34<10:11, 1.87it/s]
26%|βββ | 404/1545 [03:35<10:16, 1.85it/s]
26%|βββ | 405/1545 [03:35<09:46, 1.94it/s]
26%|βββ | 406/1545 [03:36<10:07, 1.87it/s]
26%|βββ | 407/1545 [03:36<10:23, 1.83it/s]
26%|βββ | 408/1545 [03:37<10:05, 1.88it/s]
26%|βββ | 409/1545 [03:37<10:21, 1.83it/s]
27%|βββ | 410/1545 [03:38<10:25, 1.82it/s]
{'loss': 0.0823, 'grad_norm': 1.1188966420050406e-16, 'learning_rate': 7.3462783171521046e-06, 'rewards/chosen': -16.90290069580078, 'rewards/rejected': -36.86577224731445, 'rewards/accuracies': 1.0, 'rewards/margins': 19.962865829467773, 'logps/chosen': -327.2436828613281, 'logps/rejected': -486.8968811035156, 'logits/chosen': -1.703176498413086, 'logits/rejected': -2.9482998847961426, 'epoch': 0.27} |
|
27%|βββ | 410/1545 [03:38<10:25, 1.82it/s]
27%|βββ | 411/1545 [03:39<10:34, 1.79it/s]
27%|βββ | 412/1545 [03:40<13:59, 1.35it/s]
27%|βββ | 413/1545 [03:40<13:05, 1.44it/s]
27%|βββ | 414/1545 [03:41<12:11, 1.55it/s]
27%|βββ | 415/1545 [03:41<11:01, 1.71it/s]
27%|βββ | 416/1545 [03:42<11:09, 1.69it/s]
27%|βββ | 417/1545 [03:43<10:52, 1.73it/s]
27%|βββ | 418/1545 [03:43<10:42, 1.75it/s]
27%|βββ | 419/1545 [03:43<09:55, 1.89it/s]
27%|βββ | 420/1545 [03:44<10:21, 1.81it/s]
{'loss': 1.0062, 'grad_norm': 0.0, 'learning_rate': 7.2815533980582534e-06, 'rewards/chosen': -29.327747344970703, 'rewards/rejected': -75.21330261230469, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 45.88555145263672, 'logps/chosen': -462.95111083984375, 'logps/rejected': -864.5911865234375, 'logits/chosen': -2.609936237335205, 'logits/rejected': -4.504573345184326, 'epoch': 0.27} |
|
27%|βββ | 420/1545 [03:44<10:21, 1.81it/s]
27%|βββ | 421/1545 [03:45<10:26, 1.79it/s]
27%|βββ | 422/1545 [03:45<09:57, 1.88it/s]
27%|βββ | 423/1545 [03:46<10:24, 1.80it/s]
27%|βββ | 424/1545 [03:46<10:29, 1.78it/s]
28%|βββ | 425/1545 [03:47<10:22, 1.80it/s]
28%|βββ | 426/1545 [03:47<08:51, 2.11it/s]
28%|βββ | 427/1545 [03:48<09:37, 1.94it/s]
28%|βββ | 428/1545 [03:48<09:54, 1.88it/s]
28%|βββ | 429/1545 [03:49<09:56, 1.87it/s]
28%|βββ | 430/1545 [03:49<09:49, 1.89it/s]
{'loss': 0.972, 'grad_norm': 2.5011104298755527e-12, 'learning_rate': 7.2168284789644015e-06, 'rewards/chosen': -48.82561492919922, 'rewards/rejected': -107.1943359375, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 58.36871337890625, 'logps/chosen': -632.4086303710938, 'logps/rejected': -1192.110107421875, 'logits/chosen': -3.8830108642578125, 'logits/rejected': -4.851853370666504, 'epoch': 0.28} |
|
28%|βββ | 430/1545 [03:49<09:49, 1.89it/s]
28%|βββ | 431/1545 [03:50<10:13, 1.82it/s]
28%|βββ | 432/1545 [03:51<10:21, 1.79it/s]
28%|βββ | 433/1545 [03:51<09:41, 1.91it/s]
28%|βββ | 434/1545 [03:52<10:04, 1.84it/s]
28%|βββ | 435/1545 [03:52<10:13, 1.81it/s]
28%|βββ | 436/1545 [03:53<09:14, 2.00it/s]
28%|βββ | 437/1545 [03:53<08:51, 2.08it/s]
28%|βββ | 438/1545 [03:53<08:17, 2.22it/s]
28%|βββ | 439/1545 [03:54<08:56, 2.06it/s]
28%|βββ | 440/1545 [03:54<08:16, 2.22it/s]
{'loss': 0.0, 'grad_norm': 6.054962083888163e-22, 'learning_rate': 7.152103559870551e-06, 'rewards/chosen': -39.671104431152344, 'rewards/rejected': -89.18429565429688, 'rewards/accuracies': 1.0, 'rewards/margins': 49.51319122314453, 'logps/chosen': -521.9129028320312, 'logps/rejected': -983.9114990234375, 'logits/chosen': -3.958955764770508, 'logits/rejected': -6.404815673828125, 'epoch': 0.28} |
|
28%|βββ | 440/1545 [03:54<08:16, 2.22it/s]
29%|βββ | 441/1545 [03:55<08:23, 2.19it/s]
29%|βββ | 442/1545 [03:55<09:05, 2.02it/s]
29%|βββ | 443/1545 [03:56<08:28, 2.17it/s]
29%|βββ | 444/1545 [03:56<08:56, 2.05it/s]
29%|βββ | 445/1545 [03:57<08:45, 2.09it/s]
29%|βββ | 446/1545 [03:57<09:22, 1.95it/s]
29%|βββ | 447/1545 [03:58<09:39, 1.90it/s]
29%|βββ | 448/1545 [03:58<09:44, 1.88it/s]
29%|βββ | 449/1545 [03:59<09:15, 1.97it/s]
29%|βββ | 450/1545 [03:59<09:46, 1.87it/s]
{'loss': 3.3081, 'grad_norm': 6.103515625e-05, 'learning_rate': 7.0873786407767e-06, 'rewards/chosen': -54.06614303588867, 'rewards/rejected': -93.74281311035156, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 39.676658630371094, 'logps/chosen': -693.2251586914062, 'logps/rejected': -1041.663818359375, 'logits/chosen': -3.7713088989257812, 'logits/rejected': -5.899226665496826, 'epoch': 0.29} |
|
29%|βββ | 450/1545 [04:00<09:46, 1.87it/s]
29%|βββ | 451/1545 [04:00<10:07, 1.80it/s]
29%|βββ | 452/1545 [04:01<09:35, 1.90it/s]
29%|βββ | 453/1545 [04:01<09:48, 1.85it/s]
29%|βββ | 454/1545 [04:02<09:59, 1.82it/s]
29%|βββ | 455/1545 [04:02<10:01, 1.81it/s]
30%|βββ | 456/1545 [04:03<09:10, 1.98it/s]
30%|βββ | 457/1545 [04:03<08:34, 2.11it/s]
30%|βββ | 458/1545 [04:04<09:04, 2.00it/s]
30%|βββ | 459/1545 [04:04<09:22, 1.93it/s]
30%|βββ | 460/1545 [04:05<08:52, 2.04it/s]
{'loss': 2.1395, 'grad_norm': 4.376943252282217e-12, 'learning_rate': 7.022653721682848e-06, 'rewards/chosen': -43.0927619934082, 'rewards/rejected': -76.77229309082031, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 33.679527282714844, 'logps/chosen': -589.8778686523438, 'logps/rejected': -876.2001953125, 'logits/chosen': -3.6679959297180176, 'logits/rejected': -4.975624084472656, 'epoch': 0.3} |
|
30%|βββ | 460/1545 [04:05<08:52, 2.04it/s]
30%|βββ | 461/1545 [04:05<09:21, 1.93it/s]
30%|βββ | 462/1545 [04:06<09:31, 1.90it/s]
30%|βββ | 463/1545 [04:06<09:36, 1.88it/s]
30%|βββ | 464/1545 [04:07<09:31, 1.89it/s]
30%|βββ | 465/1545 [04:07<09:55, 1.81it/s]
30%|βββ | 466/1545 [04:08<10:06, 1.78it/s]
30%|βββ | 467/1545 [04:08<09:24, 1.91it/s]
30%|βββ | 468/1545 [04:09<09:52, 1.82it/s]
30%|βββ | 469/1545 [04:10<10:04, 1.78it/s]
30%|βββ | 470/1545 [04:10<09:50, 1.82it/s]
{'loss': 0.0008, 'grad_norm': 0.0, 'learning_rate': 6.957928802588997e-06, 'rewards/chosen': -28.976566314697266, 'rewards/rejected': -84.09245300292969, 'rewards/accuracies': 1.0, 'rewards/margins': 55.11588668823242, 'logps/chosen': -440.51788330078125, 'logps/rejected': -952.6383056640625, 'logits/chosen': -2.842275619506836, 'logits/rejected': -4.066740989685059, 'epoch': 0.3} |
|
30%|βββ | 470/1545 [04:10<09:50, 1.82it/s]
30%|βββ | 471/1545 [04:11<08:50, 2.02it/s]
31%|βββ | 472/1545 [04:11<09:23, 1.90it/s]
31%|βββ | 473/1545 [04:12<09:25, 1.90it/s]
31%|βββ | 474/1545 [04:12<09:18, 1.92it/s]
31%|βββ | 475/1545 [04:13<09:38, 1.85it/s]
31%|βββ | 476/1545 [04:13<09:45, 1.83it/s]
31%|βββ | 477/1545 [04:14<09:42, 1.83it/s]
31%|βββ | 478/1545 [04:14<08:53, 2.00it/s]
31%|βββ | 479/1545 [04:15<09:22, 1.90it/s]
31%|βββ | 480/1545 [04:15<09:25, 1.88it/s]
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 6.893203883495147e-06, 'rewards/chosen': -20.838396072387695, 'rewards/rejected': -92.45664978027344, 'rewards/accuracies': 1.0, 'rewards/margins': 71.61825561523438, 'logps/chosen': -362.0668029785156, 'logps/rejected': -1039.4049072265625, 'logits/chosen': -2.1511878967285156, 'logits/rejected': -4.547484874725342, 'epoch': 0.31} |
|
31%|βββ | 480/1545 [04:15<09:25, 1.88it/s]
31%|βββ | 481/1545 [04:16<09:25, 1.88it/s]
31%|βββ | 482/1545 [04:16<08:35, 2.06it/s]
31%|ββββ | 483/1545 [04:17<09:04, 1.95it/s]
31%|ββββ | 484/1545 [04:17<09:14, 1.91it/s]
31%|ββββ | 485/1545 [04:18<08:58, 1.97it/s]
31%|ββββ | 486/1545 [04:18<09:19, 1.89it/s]
32%|ββββ | 487/1545 [04:19<09:37, 1.83it/s]
32%|ββββ | 488/1545 [04:20<09:46, 1.80it/s]
32%|ββββ | 489/1545 [04:20<09:03, 1.94it/s]
32%|ββββ | 490/1545 [04:21<09:31, 1.85it/s]
{'loss': 0.2412, 'grad_norm': 152.0, 'learning_rate': 6.828478964401295e-06, 'rewards/chosen': -30.904491424560547, 'rewards/rejected': -65.34832763671875, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 34.44383239746094, 'logps/chosen': -486.5516052246094, 'logps/rejected': -770.8928833007812, 'logits/chosen': -2.9180967807769775, 'logits/rejected': -4.752321720123291, 'epoch': 0.32} |
|
32%|ββββ | 490/1545 [04:21<09:31, 1.85it/s]
32%|ββββ | 491/1545 [04:21<09:47, 1.79it/s]
32%|ββββ | 492/1545 [04:22<09:18, 1.89it/s]
32%|ββββ | 493/1545 [04:22<09:44, 1.80it/s]
32%|ββββ | 494/1545 [04:23<09:47, 1.79it/s]
32%|ββββ | 495/1545 [04:23<09:47, 1.79it/s]
32%|ββββ | 496/1545 [04:24<09:05, 1.92it/s]
32%|ββββ | 497/1545 [04:24<09:21, 1.86it/s]
32%|ββββ | 498/1545 [04:25<09:39, 1.81it/s]
32%|ββββ | 499/1545 [04:26<09:36, 1.81it/s]
32%|ββββ | 500/1545 [04:26<09:58, 1.75it/s]
{'loss': 1.302, 'grad_norm': 2.6072732907671432e-21, 'learning_rate': 6.763754045307444e-06, 'rewards/chosen': -14.026659965515137, 'rewards/rejected': -57.82221221923828, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 43.795555114746094, 'logps/chosen': -295.4187927246094, 'logps/rejected': -692.2587890625, 'logits/chosen': -1.8123325109481812, 'logits/rejected': -4.419227123260498, 'epoch': 0.32} |
|
32%|ββββ | 500/1545 [04:26<09:58, 1.75it/s]
32%|ββββ | 501/1545 [04:27<10:10, 1.71it/s]
32%|ββββ | 502/1545 [04:27<09:58, 1.74it/s]
33%|ββββ | 503/1545 [04:28<09:47, 1.77it/s]
33%|ββββ | 504/1545 [04:29<10:25, 1.66it/s]
33%|ββββ | 505/1545 [04:29<10:18, 1.68it/s]
33%|ββββ | 506/1545 [04:30<09:34, 1.81it/s]
33%|ββββ | 507/1545 [04:30<09:55, 1.74it/s]
33%|ββββ | 508/1545 [04:31<09:58, 1.73it/s]
33%|ββββ | 509/1545 [04:31<09:36, 1.80it/s]
33%|ββββ | 510/1545 [04:32<09:42, 1.78it/s]
{'loss': 0.7018, 'grad_norm': 1.9063008949160576e-09, 'learning_rate': 6.6990291262135935e-06, 'rewards/chosen': -18.401165008544922, 'rewards/rejected': -41.45631408691406, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 23.05514907836914, 'logps/chosen': -362.13482666015625, 'logps/rejected': -531.1427612304688, 'logits/chosen': -1.9105132818222046, 'logits/rejected': -3.823967695236206, 'epoch': 0.33} |
|
33%|ββββ | 510/1545 [04:32<09:42, 1.78it/s]
33%|ββββ | 511/1545 [04:33<09:52, 1.75it/s]
33%|ββββ | 512/1545 [04:33<09:40, 1.78it/s]
33%|ββββ | 513/1545 [04:33<09:00, 1.91it/s]
33%|ββββ | 514/1545 [04:34<09:16, 1.85it/s]
33%|ββββ | 515/1545 [04:35<09:22, 1.83it/s]
33%|ββββ | 516/1545 [04:35<09:13, 1.86it/s]
33%|ββββ | 517/1545 [04:36<09:30, 1.80it/s]
34%|ββββ | 518/1545 [04:36<09:33, 1.79it/s]
34%|ββββ | 519/1545 [04:37<09:31, 1.80it/s]
34%|ββββ | 520/1545 [04:37<08:48, 1.94it/s]
{'loss': 0.9755, 'grad_norm': 3.924811864397526e-17, 'learning_rate': 6.6343042071197415e-06, 'rewards/chosen': -7.66559362411499, 'rewards/rejected': -35.98261642456055, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 28.317020416259766, 'logps/chosen': -237.4998321533203, 'logps/rejected': -480.2190856933594, 'logits/chosen': -1.257673978805542, 'logits/rejected': -3.1603283882141113, 'epoch': 0.34} |
|
34%|ββββ | 520/1545 [04:37<08:48, 1.94it/s]
34%|ββββ | 521/1545 [04:38<09:07, 1.87it/s]
34%|ββββ | 522/1545 [04:38<09:11, 1.85it/s]
34%|ββββ | 523/1545 [04:39<09:10, 1.86it/s]
34%|ββββ | 524/1545 [04:39<09:03, 1.88it/s]
34%|ββββ | 525/1545 [04:40<09:23, 1.81it/s]
34%|ββββ | 526/1545 [04:41<13:50, 1.23it/s]
34%|ββββ | 527/1545 [04:42<11:38, 1.46it/s]
34%|ββββ | 528/1545 [04:42<10:59, 1.54it/s]
34%|ββββ | 529/1545 [04:43<10:37, 1.59it/s]
34%|ββββ | 530/1545 [04:44<10:09, 1.67it/s]
{'loss': 0.4228, 'grad_norm': 0.0, 'learning_rate': 6.56957928802589e-06, 'rewards/chosen': -30.713363647460938, 'rewards/rejected': -60.528602600097656, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 29.81524085998535, 'logps/chosen': -450.9524841308594, 'logps/rejected': -714.2261962890625, 'logits/chosen': -3.075618267059326, 'logits/rejected': -4.854428768157959, 'epoch': 0.34} |
|
34%|ββββ | 530/1545 [04:44<10:09, 1.67it/s]
34%|ββββ | 531/1545 [04:44<09:39, 1.75it/s]
34%|ββββ | 532/1545 [04:45<09:42, 1.74it/s]
34%|ββββ | 533/1545 [04:45<09:37, 1.75it/s]
35%|ββββ | 534/1545 [04:46<08:54, 1.89it/s]
35%|ββββ | 535/1545 [04:46<09:14, 1.82it/s]
35%|ββββ | 536/1545 [04:47<09:17, 1.81it/s]
35%|ββββ | 537/1545 [04:47<09:14, 1.82it/s]
35%|ββββ | 538/1545 [04:48<09:06, 1.84it/s]
35%|ββββ | 539/1545 [04:48<09:18, 1.80it/s]
35%|ββββ | 540/1545 [04:49<09:20, 1.79it/s]
{'loss': 0.723, 'grad_norm': 4.440892098500626e-15, 'learning_rate': 6.50485436893204e-06, 'rewards/chosen': -16.0915584564209, 'rewards/rejected': -53.74153518676758, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 37.64997482299805, 'logps/chosen': -319.9918518066406, 'logps/rejected': -653.3815307617188, 'logits/chosen': -1.6358880996704102, 'logits/rejected': -3.925060749053955, 'epoch': 0.35} |
|
35%|ββββ | 540/1545 [04:49<09:20, 1.79it/s]
35%|ββββ | 541/1545 [04:49<08:44, 1.91it/s]
35%|ββββ | 542/1545 [04:50<09:08, 1.83it/s]
35%|ββββ | 543/1545 [04:51<09:12, 1.81it/s]
35%|ββββ | 544/1545 [04:51<09:09, 1.82it/s]
35%|ββββ | 545/1545 [04:52<09:01, 1.85it/s]
35%|ββββ | 546/1545 [04:52<09:09, 1.82it/s]
35%|ββββ | 547/1545 [04:53<09:15, 1.80it/s]
35%|ββββ | 548/1545 [04:53<08:38, 1.92it/s]
36%|ββββ | 549/1545 [04:54<08:53, 1.87it/s]
36%|ββββ | 550/1545 [04:54<08:57, 1.85it/s]
{'loss': 0.0105, 'grad_norm': 3.9257486150745535e-13, 'learning_rate': 6.440129449838188e-06, 'rewards/chosen': -14.184236526489258, 'rewards/rejected': -45.90851593017578, 'rewards/accuracies': 1.0, 'rewards/margins': 31.72427749633789, 'logps/chosen': -261.56756591796875, 'logps/rejected': -570.337890625, 'logits/chosen': -1.7183490991592407, 'logits/rejected': -3.5872623920440674, 'epoch': 0.36} |
|
36%|ββββ | 550/1545 [04:54<08:57, 1.85it/s]
36%|ββββ | 551/1545 [04:55<09:07, 1.81it/s]
36%|ββββ | 552/1545 [04:55<08:51, 1.87it/s]
36%|ββββ | 553/1545 [04:56<09:04, 1.82it/s]
36%|ββββ | 554/1545 [04:57<09:02, 1.83it/s]
36%|ββββ | 555/1545 [04:57<08:34, 1.92it/s]
36%|ββββ | 556/1545 [04:58<08:50, 1.86it/s]
36%|ββββ | 557/1545 [04:58<08:55, 1.84it/s]
36%|ββββ | 558/1545 [04:59<09:02, 1.82it/s]
36%|ββββ | 559/1545 [04:59<08:16, 1.99it/s]
36%|ββββ | 560/1545 [04:59<07:39, 2.15it/s]
{'loss': 1.7896, 'grad_norm': 7.104873657226562e-05, 'learning_rate': 6.375404530744337e-06, 'rewards/chosen': -30.25199317932129, 'rewards/rejected': -56.004981994628906, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 25.75299072265625, 'logps/chosen': -435.36322021484375, 'logps/rejected': -665.7625732421875, 'logits/chosen': -1.7875016927719116, 'logits/rejected': -3.8524105548858643, 'epoch': 0.36} |
|
36%|ββββ | 560/1545 [05:00<07:39, 2.15it/s]
36%|ββββ | 561/1545 [05:00<08:08, 2.01it/s]
36%|ββββ | 562/1545 [05:01<08:29, 1.93it/s]
36%|ββββ | 563/1545 [05:01<07:58, 2.05it/s]
37%|ββββ | 564/1545 [05:02<08:22, 1.95it/s]
37%|ββββ | 565/1545 [05:02<08:33, 1.91it/s]
37%|ββββ | 566/1545 [05:03<08:37, 1.89it/s]
37%|ββββ | 567/1545 [05:03<08:25, 1.94it/s]
37%|ββββ | 568/1545 [05:04<08:39, 1.88it/s]
37%|ββββ | 569/1545 [05:04<08:46, 1.85it/s]
37%|ββββ | 570/1545 [05:05<08:16, 1.96it/s]
{'loss': 0.0057, 'grad_norm': 1.0302869668521453e-12, 'learning_rate': 6.310679611650487e-06, 'rewards/chosen': -9.948836326599121, 'rewards/rejected': -50.712440490722656, 'rewards/accuracies': 1.0, 'rewards/margins': 40.76360321044922, 'logps/chosen': -243.68978881835938, 'logps/rejected': -616.48486328125, 'logits/chosen': -1.0023655891418457, 'logits/rejected': -3.4231104850769043, 'epoch': 0.37} |
|
37%|ββββ | 570/1545 [05:05<08:16, 1.96it/s]
37%|ββββ | 571/1545 [05:05<08:46, 1.85it/s]
37%|ββββ | 572/1545 [05:06<08:48, 1.84it/s]
37%|ββββ | 573/1545 [05:06<08:45, 1.85it/s]
37%|ββββ | 574/1545 [05:07<08:32, 1.90it/s]
37%|ββββ | 575/1545 [05:08<08:43, 1.85it/s]
37%|ββββ | 576/1545 [05:08<08:45, 1.84it/s]
37%|ββββ | 577/1545 [05:09<08:19, 1.94it/s]
37%|ββββ | 578/1545 [05:09<08:36, 1.87it/s]
37%|ββββ | 579/1545 [05:10<08:48, 1.83it/s]
38%|ββββ | 580/1545 [05:10<08:46, 1.83it/s]
{'loss': 0.0143, 'grad_norm': 0.0, 'learning_rate': 6.245954692556635e-06, 'rewards/chosen': -18.19172477722168, 'rewards/rejected': -61.596435546875, 'rewards/accuracies': 1.0, 'rewards/margins': 43.40471267700195, 'logps/chosen': -331.9469909667969, 'logps/rejected': -743.6862182617188, 'logits/chosen': -1.4042354822158813, 'logits/rejected': -3.8643798828125, 'epoch': 0.38} |
|
38%|ββββ | 580/1545 [05:10<08:46, 1.83it/s]
38%|ββββ | 581/1545 [05:11<08:20, 1.93it/s]
38%|ββββ | 582/1545 [05:11<08:42, 1.84it/s]
38%|ββββ | 583/1545 [05:12<08:53, 1.80it/s]
38%|ββββ | 584/1545 [05:12<08:40, 1.85it/s]
38%|ββββ | 585/1545 [05:13<08:53, 1.80it/s]
38%|ββββ | 586/1545 [05:14<09:09, 1.74it/s]
38%|ββββ | 587/1545 [05:14<09:00, 1.77it/s]
38%|ββββ | 588/1545 [05:15<08:39, 1.84it/s]
38%|ββββ | 589/1545 [05:15<08:46, 1.82it/s]
38%|ββββ | 590/1545 [05:16<08:40, 1.83it/s]
{'loss': 0.0179, 'grad_norm': 77.5, 'learning_rate': 6.181229773462784e-06, 'rewards/chosen': -18.66933822631836, 'rewards/rejected': -50.0496826171875, 'rewards/accuracies': 1.0, 'rewards/margins': 31.380340576171875, 'logps/chosen': -321.9429626464844, 'logps/rejected': -626.2196655273438, 'logits/chosen': -1.854098916053772, 'logits/rejected': -3.260258436203003, 'epoch': 0.38} |
|
38%|ββββ | 590/1545 [05:16<08:40, 1.83it/s]
38%|ββββ | 591/1545 [05:16<08:12, 1.94it/s]
38%|ββββ | 592/1545 [05:17<08:37, 1.84it/s]
38%|ββββ | 593/1545 [05:17<08:51, 1.79it/s]
38%|ββββ | 594/1545 [05:18<07:55, 2.00it/s]
39%|ββββ | 595/1545 [05:18<07:44, 2.05it/s]
39%|ββββ | 596/1545 [05:19<08:24, 1.88it/s]
39%|ββββ | 597/1545 [05:19<08:37, 1.83it/s]
39%|ββββ | 598/1545 [05:20<07:48, 2.02it/s]
39%|ββββ | 599/1545 [05:20<07:27, 2.11it/s]
39%|ββββ | 600/1545 [05:21<07:53, 2.00it/s]
{'loss': 0.0001, 'grad_norm': 0.5234375, 'learning_rate': 6.116504854368932e-06, 'rewards/chosen': -20.77777099609375, 'rewards/rejected': -52.45641326904297, 'rewards/accuracies': 1.0, 'rewards/margins': 31.678646087646484, 'logps/chosen': -362.299072265625, 'logps/rejected': -631.9798583984375, 'logits/chosen': -1.6714366674423218, 'logits/rejected': -3.967179775238037, 'epoch': 0.39} |
|
39%|ββββ | 600/1545 [05:21<07:53, 2.00it/s]
39%|ββββ | 601/1545 [05:21<08:15, 1.90it/s]
39%|ββββ | 602/1545 [05:22<08:17, 1.90it/s]
39%|ββββ | 603/1545 [05:22<08:08, 1.93it/s]
39%|ββββ | 604/1545 [05:23<08:23, 1.87it/s]
39%|ββββ | 605/1545 [05:24<08:28, 1.85it/s]
39%|ββββ | 606/1545 [05:24<08:07, 1.92it/s]
39%|ββββ | 607/1545 [05:24<07:28, 2.09it/s]
39%|ββββ | 608/1545 [05:25<07:53, 1.98it/s]
39%|ββββ | 609/1545 [05:25<08:07, 1.92it/s]
39%|ββββ | 610/1545 [05:26<07:55, 1.97it/s]
{'loss': 0.0579, 'grad_norm': 0.0, 'learning_rate': 6.0517799352750815e-06, 'rewards/chosen': -18.156063079833984, 'rewards/rejected': -48.84654998779297, 'rewards/accuracies': 1.0, 'rewards/margins': 30.690486907958984, 'logps/chosen': -305.2608947753906, 'logps/rejected': -596.3084716796875, 'logits/chosen': -1.5014350414276123, 'logits/rejected': -3.3216071128845215, 'epoch': 0.39} |
|
39%|ββββ | 610/1545 [05:26<07:55, 1.97it/s]
40%|ββββ | 611/1545 [05:27<08:20, 1.86it/s]
40%|ββββ | 612/1545 [05:27<07:32, 2.06it/s]
40%|ββββ | 613/1545 [05:27<07:55, 1.96it/s]
40%|ββββ | 614/1545 [05:28<07:28, 2.08it/s]
40%|ββββ | 615/1545 [05:28<07:56, 1.95it/s]
40%|ββββ | 616/1545 [05:29<08:07, 1.91it/s]
40%|ββββ | 617/1545 [05:30<08:09, 1.90it/s]
40%|ββββ | 618/1545 [05:30<08:01, 1.92it/s]
40%|ββββ | 619/1545 [05:30<07:22, 2.09it/s]
40%|ββββ | 620/1545 [05:31<07:43, 1.99it/s]
{'loss': 0.0011, 'grad_norm': 12.3125, 'learning_rate': 5.9870550161812304e-06, 'rewards/chosen': -20.277729034423828, 'rewards/rejected': -44.97526931762695, 'rewards/accuracies': 1.0, 'rewards/margins': 24.69754409790039, 'logps/chosen': -371.6828308105469, 'logps/rejected': -576.7119750976562, 'logits/chosen': -1.2356998920440674, 'logits/rejected': -2.410062074661255, 'epoch': 0.4} |
|
40%|ββββ | 620/1545 [05:31<07:43, 1.99it/s]
40%|ββββ | 621/1545 [05:32<07:59, 1.93it/s]
40%|ββββ | 622/1545 [05:32<08:12, 1.88it/s]
40%|ββββ | 623/1545 [05:33<08:25, 1.82it/s]
40%|ββββ | 624/1545 [05:33<08:24, 1.83it/s]
40%|ββββ | 625/1545 [05:34<07:55, 1.94it/s]
41%|ββββ | 626/1545 [05:34<08:20, 1.84it/s]
41%|ββββ | 627/1545 [05:35<08:27, 1.81it/s]
41%|ββββ | 628/1545 [05:35<08:07, 1.88it/s]
41%|ββββ | 629/1545 [05:36<08:24, 1.81it/s]
41%|ββββ | 630/1545 [05:37<08:29, 1.80it/s]
{'loss': 0.0001, 'grad_norm': 6.606569513678551e-09, 'learning_rate': 5.9223300970873785e-06, 'rewards/chosen': -22.200729370117188, 'rewards/rejected': -52.896087646484375, 'rewards/accuracies': 1.0, 'rewards/margins': 30.695358276367188, 'logps/chosen': -359.304931640625, 'logps/rejected': -632.1669921875, 'logits/chosen': -2.2159793376922607, 'logits/rejected': -4.487706184387207, 'epoch': 0.41} |
|
41%|ββββ | 630/1545 [05:37<08:29, 1.80it/s]
41%|ββββ | 631/1545 [05:37<08:32, 1.78it/s]
41%|ββββ | 632/1545 [05:38<07:54, 1.92it/s]
41%|ββββ | 633/1545 [05:38<08:11, 1.86it/s]
41%|ββββ | 634/1545 [05:39<08:16, 1.83it/s]
41%|ββββ | 635/1545 [05:39<08:01, 1.89it/s]
41%|ββββ | 636/1545 [05:40<07:07, 2.13it/s]
41%|ββββ | 637/1545 [05:40<07:35, 1.99it/s]
41%|βββββ | 638/1545 [05:41<07:53, 1.92it/s]
41%|βββββ | 639/1545 [05:41<07:36, 1.98it/s]
41%|βββββ | 640/1545 [05:42<07:54, 1.91it/s]
{'loss': 0.0756, 'grad_norm': 2.8731357570865868e-18, 'learning_rate': 5.857605177993528e-06, 'rewards/chosen': -37.544307708740234, 'rewards/rejected': -75.83995056152344, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 38.295654296875, 'logps/chosen': -509.95721435546875, 'logps/rejected': -876.0435791015625, 'logits/chosen': -3.4378981590270996, 'logits/rejected': -4.655713081359863, 'epoch': 0.41} |
|
41%|βββββ | 640/1545 [05:42<07:54, 1.91it/s]
41%|βββββ | 641/1545 [05:43<11:29, 1.31it/s]
42%|βββββ | 642/1545 [05:44<10:22, 1.45it/s]
42%|βββββ | 643/1545 [05:44<09:02, 1.66it/s]
42%|βββββ | 644/1545 [05:45<08:56, 1.68it/s]
42%|βββββ | 645/1545 [05:45<08:41, 1.73it/s]
42%|βββββ | 646/1545 [05:46<08:27, 1.77it/s]
42%|βββββ | 647/1545 [05:46<08:11, 1.83it/s]
42%|βββββ | 648/1545 [05:47<08:16, 1.81it/s]
42%|βββββ | 649/1545 [05:47<08:16, 1.81it/s]
42%|βββββ | 650/1545 [05:48<07:40, 1.94it/s]
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 5.792880258899677e-06, 'rewards/chosen': -17.858400344848633, 'rewards/rejected': -68.02870178222656, 'rewards/accuracies': 1.0, 'rewards/margins': 50.1702995300293, 'logps/chosen': -305.210205078125, 'logps/rejected': -778.7840576171875, 'logits/chosen': -2.0918753147125244, 'logits/rejected': -5.265947341918945, 'epoch': 0.42} |
|
42%|βββββ | 650/1545 [05:48<07:40, 1.94it/s]
42%|βββββ | 651/1545 [05:48<07:57, 1.87it/s]
42%|βββββ | 652/1545 [05:49<08:05, 1.84it/s]
42%|βββββ | 653/1545 [05:49<07:19, 2.03it/s]
42%|βββββ | 654/1545 [05:50<07:01, 2.11it/s]
42%|βββββ | 655/1545 [05:50<07:26, 1.99it/s]
42%|βββββ | 656/1545 [05:51<07:46, 1.91it/s]
43%|βββββ | 657/1545 [05:51<08:01, 1.84it/s]
43%|βββββ | 658/1545 [05:52<07:45, 1.91it/s]
43%|βββββ | 659/1545 [05:52<07:58, 1.85it/s]
43%|βββββ | 660/1545 [05:53<07:58, 1.85it/s]
{'loss': 2.3511, 'grad_norm': 0.0, 'learning_rate': 5.728155339805825e-06, 'rewards/chosen': -37.305686950683594, 'rewards/rejected': -80.12673950195312, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 42.821044921875, 'logps/chosen': -520.0347290039062, 'logps/rejected': -906.5696411132812, 'logits/chosen': -3.2962348461151123, 'logits/rejected': -6.164005279541016, 'epoch': 0.43} |
|
43%|βββββ | 660/1545 [05:53<07:58, 1.85it/s]
43%|βββββ | 661/1545 [05:53<07:34, 1.95it/s]
43%|βββββ | 662/1545 [05:54<07:50, 1.88it/s]
43%|βββββ | 663/1545 [05:55<08:01, 1.83it/s]
43%|βββββ | 664/1545 [05:55<08:02, 1.83it/s]
43%|βββββ | 665/1545 [05:55<07:23, 1.98it/s]
43%|βββββ | 666/1545 [05:56<07:41, 1.90it/s]
43%|βββββ | 667/1545 [05:56<06:58, 2.10it/s]
43%|βββββ | 668/1545 [05:57<07:18, 2.00it/s]
43%|βββββ | 669/1545 [05:57<06:56, 2.10it/s]
43%|βββββ | 670/1545 [05:58<07:20, 1.99it/s]
{'loss': 0.0, 'grad_norm': 0.00046539306640625, 'learning_rate': 5.663430420711975e-06, 'rewards/chosen': -21.083389282226562, 'rewards/rejected': -60.96876907348633, 'rewards/accuracies': 1.0, 'rewards/margins': 39.885379791259766, 'logps/chosen': -365.4623107910156, 'logps/rejected': -721.8714599609375, 'logits/chosen': -2.13993501663208, 'logits/rejected': -5.14273738861084, 'epoch': 0.43} |
|
43%|βββββ | 670/1545 [05:58<07:20, 1.99it/s]
43%|βββββ | 671/1545 [05:59<07:44, 1.88it/s]
43%|βββββ | 672/1545 [05:59<07:41, 1.89it/s]
44%|βββββ | 673/1545 [06:00<07:37, 1.91it/s]
44%|βββββ | 674/1545 [06:00<07:50, 1.85it/s]
44%|βββββ | 675/1545 [06:01<07:52, 1.84it/s]
44%|βββββ | 676/1545 [06:01<07:21, 1.97it/s]
44%|βββββ | 677/1545 [06:02<07:38, 1.89it/s]
44%|βββββ | 678/1545 [06:02<07:46, 1.86it/s]
44%|βββββ | 679/1545 [06:03<07:39, 1.88it/s]
44%|βββββ | 680/1545 [06:03<07:04, 2.04it/s]
{'loss': 0.001, 'grad_norm': 9.492850949754938e-12, 'learning_rate': 5.598705501618124e-06, 'rewards/chosen': -25.24799346923828, 'rewards/rejected': -60.609230041503906, 'rewards/accuracies': 1.0, 'rewards/margins': 35.361228942871094, 'logps/chosen': -402.2320251464844, 'logps/rejected': -727.8626708984375, 'logits/chosen': -2.94899845123291, 'logits/rejected': -5.123431205749512, 'epoch': 0.44} |
|
44%|βββββ | 680/1545 [06:03<07:04, 2.04it/s]
44%|βββββ | 681/1545 [06:04<07:28, 1.93it/s]
44%|βββββ | 682/1545 [06:04<07:33, 1.90it/s]
44%|βββββ | 683/1545 [06:05<07:32, 1.91it/s]
44%|βββββ | 684/1545 [06:05<06:27, 2.22it/s]
44%|βββββ | 685/1545 [06:06<07:03, 2.03it/s]
44%|βββββ | 686/1545 [06:06<06:33, 2.19it/s]
44%|βββββ | 687/1545 [06:07<06:58, 2.05it/s]
45%|βββββ | 688/1545 [06:07<06:43, 2.12it/s]
45%|βββββ | 689/1545 [06:08<07:11, 1.98it/s]
45%|βββββ | 690/1545 [06:08<07:25, 1.92it/s]
{'loss': 0.2533, 'grad_norm': 4.76837158203125e-07, 'learning_rate': 5.533980582524272e-06, 'rewards/chosen': -26.505752563476562, 'rewards/rejected': -50.40888595581055, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 23.90313720703125, 'logps/chosen': -408.7076721191406, 'logps/rejected': -599.2103881835938, 'logits/chosen': -2.7856857776641846, 'logits/rejected': -5.5106892585754395, 'epoch': 0.45} |
|
45%|βββββ | 690/1545 [06:08<07:25, 1.92it/s]
45%|βββββ | 691/1545 [06:09<07:36, 1.87it/s]
45%|βββββ | 692/1545 [06:09<06:37, 2.15it/s]
45%|βββββ | 693/1545 [06:09<05:52, 2.42it/s]
45%|βββββ | 694/1545 [06:10<06:28, 2.19it/s]
45%|βββββ | 695/1545 [06:10<06:50, 2.07it/s]
45%|βββββ | 696/1545 [06:11<06:01, 2.35it/s]
45%|βββββ | 697/1545 [06:11<05:24, 2.61it/s]
45%|βββββ | 698/1545 [06:11<05:33, 2.54it/s]
45%|βββββ | 699/1545 [06:12<05:48, 2.43it/s]
45%|βββββ | 700/1545 [06:12<05:51, 2.40it/s]
{'loss': 0.3027, 'grad_norm': 0.0, 'learning_rate': 5.4692556634304216e-06, 'rewards/chosen': -22.45337677001953, 'rewards/rejected': -68.20580291748047, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 45.752418518066406, 'logps/chosen': -372.1856384277344, 'logps/rejected': -798.4400634765625, 'logits/chosen': -2.2510809898376465, 'logits/rejected': -5.148565292358398, 'epoch': 0.45} |
|
45%|βββββ | 700/1545 [06:12<05:51, 2.40it/s]
45%|βββββ | 701/1545 [06:13<06:09, 2.28it/s]
45%|βββββ | 702/1545 [06:13<06:19, 2.22it/s]
46%|βββββ | 703/1545 [06:14<06:16, 2.24it/s]
46%|βββββ | 704/1545 [06:14<06:43, 2.08it/s]
46%|βββββ | 705/1545 [06:15<07:11, 1.95it/s]
46%|βββββ | 706/1545 [06:15<06:52, 2.03it/s]
46%|βββββ | 707/1545 [06:16<07:00, 1.99it/s]
46%|βββββ | 708/1545 [06:16<06:29, 2.15it/s]
46%|βββββ | 709/1545 [06:17<06:29, 2.15it/s]
46%|βββββ | 710/1545 [06:17<06:47, 2.05it/s]
{'loss': 0.5746, 'grad_norm': 0.0, 'learning_rate': 5.4045307443365705e-06, 'rewards/chosen': -27.447168350219727, 'rewards/rejected': -84.84578704833984, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 57.39862060546875, 'logps/chosen': -426.19427490234375, 'logps/rejected': -969.3411865234375, 'logits/chosen': -2.232661724090576, 'logits/rejected': -4.795763969421387, 'epoch': 0.46} |
|
46%|βββββ | 710/1545 [06:17<06:47, 2.05it/s]
46%|βββββ | 711/1545 [06:18<07:15, 1.92it/s]
46%|βββββ | 712/1545 [06:18<07:04, 1.96it/s]
46%|βββββ | 713/1545 [06:19<07:04, 1.96it/s]
46%|βββββ | 714/1545 [06:19<07:20, 1.89it/s]
46%|βββββ | 715/1545 [06:20<07:16, 1.90it/s]
46%|βββββ | 716/1545 [06:20<07:07, 1.94it/s]
46%|βββββ | 717/1545 [06:21<07:12, 1.91it/s]
46%|βββββ | 718/1545 [06:22<07:29, 1.84it/s]
47%|βββββ | 719/1545 [06:22<07:15, 1.90it/s]
47%|βββββ | 720/1545 [06:23<07:19, 1.88it/s]
{'loss': 0.003, 'grad_norm': 0.0, 'learning_rate': 5.3398058252427185e-06, 'rewards/chosen': -38.11956787109375, 'rewards/rejected': -97.89371490478516, 'rewards/accuracies': 1.0, 'rewards/margins': 59.774139404296875, 'logps/chosen': -570.1392822265625, 'logps/rejected': -1108.6182861328125, 'logits/chosen': -2.4496023654937744, 'logits/rejected': -5.000131607055664, 'epoch': 0.47} |
|
47%|βββββ | 720/1545 [06:23<07:19, 1.88it/s]
47%|βββββ | 721/1545 [06:23<07:25, 1.85it/s]
47%|βββββ | 722/1545 [06:24<07:37, 1.80it/s]
47%|βββββ | 723/1545 [06:24<07:19, 1.87it/s]
47%|βββββ | 724/1545 [06:25<07:21, 1.86it/s]
47%|βββββ | 725/1545 [06:25<07:28, 1.83it/s]
47%|βββββ | 726/1545 [06:26<07:26, 1.83it/s]
47%|βββββ | 727/1545 [06:26<07:04, 1.93it/s]
47%|βββββ | 728/1545 [06:27<07:10, 1.90it/s]
47%|βββββ | 729/1545 [06:27<07:22, 1.84it/s]
47%|βββββ | 730/1545 [06:28<07:00, 1.94it/s]
{'loss': 1.7073, 'grad_norm': 2.3418766925686896e-16, 'learning_rate': 5.275080906148867e-06, 'rewards/chosen': -44.976646423339844, 'rewards/rejected': -86.87692260742188, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 41.9002799987793, 'logps/chosen': -594.9097290039062, 'logps/rejected': -975.1007690429688, 'logits/chosen': -3.272984027862549, 'logits/rejected': -5.4031596183776855, 'epoch': 0.47} |
|
47%|βββββ | 730/1545 [06:28<07:00, 1.94it/s]
47%|βββββ | 731/1545 [06:28<07:05, 1.91it/s]
47%|βββββ | 732/1545 [06:29<07:16, 1.86it/s]
47%|βββββ | 733/1545 [06:29<06:59, 1.93it/s]
48%|βββββ | 734/1545 [06:30<07:06, 1.90it/s]
48%|βββββ | 735/1545 [06:31<07:10, 1.88it/s]
48%|βββββ | 736/1545 [06:31<07:15, 1.86it/s]
48%|βββββ | 737/1545 [06:32<06:55, 1.94it/s]
48%|βββββ | 738/1545 [06:32<07:00, 1.92it/s]
48%|βββββ | 739/1545 [06:33<07:10, 1.87it/s]
48%|βββββ | 740/1545 [06:33<06:53, 1.95it/s]
{'loss': 0.0581, 'grad_norm': 30.5, 'learning_rate': 5.210355987055017e-06, 'rewards/chosen': -43.055625915527344, 'rewards/rejected': -78.43653106689453, 'rewards/accuracies': 1.0, 'rewards/margins': 35.38090133666992, 'logps/chosen': -582.3439331054688, 'logps/rejected': -901.49658203125, 'logits/chosen': -3.8236382007598877, 'logits/rejected': -6.230503082275391, 'epoch': 0.48} |
|
48%|βββββ | 740/1545 [06:33<06:53, 1.95it/s]
48%|βββββ | 741/1545 [06:34<06:56, 1.93it/s]
48%|βββββ | 742/1545 [06:34<07:07, 1.88it/s]
48%|βββββ | 743/1545 [06:35<06:55, 1.93it/s]
48%|βββββ | 744/1545 [06:35<06:46, 1.97it/s]
48%|βββββ | 745/1545 [06:36<06:56, 1.92it/s]
48%|βββββ | 746/1545 [06:36<07:07, 1.87it/s]
48%|βββββ | 747/1545 [06:37<06:50, 1.94it/s]
48%|βββββ | 748/1545 [06:37<06:56, 1.91it/s]
48%|βββββ | 749/1545 [06:38<07:07, 1.86it/s]
49%|βββββ | 750/1545 [06:38<06:46, 1.95it/s]
{'loss': 3.8903, 'grad_norm': 0.0, 'learning_rate': 5.145631067961165e-06, 'rewards/chosen': -22.969589233398438, 'rewards/rejected': -51.09120559692383, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 28.121618270874023, 'logps/chosen': -399.1053466796875, 'logps/rejected': -617.9463500976562, 'logits/chosen': -1.9127031564712524, 'logits/rejected': -4.489147186279297, 'epoch': 0.49} |
|
49%|βββββ | 750/1545 [06:38<06:46, 1.95it/s]
49%|βββββ | 751/1545 [06:39<06:49, 1.94it/s]
49%|βββββ | 752/1545 [06:39<07:01, 1.88it/s]
49%|βββββ | 753/1545 [06:40<06:05, 2.17it/s]
49%|βββββ | 754/1545 [06:40<05:17, 2.49it/s]
49%|βββββ | 755/1545 [06:40<04:51, 2.71it/s]
49%|βββββ | 756/1545 [06:41<04:25, 2.97it/s]
49%|βββββ | 757/1545 [06:41<04:07, 3.19it/s]
49%|βββββ | 758/1545 [06:41<03:55, 3.35it/s]
49%|βββββ | 759/1545 [06:41<03:33, 3.67it/s]
49%|βββββ | 760/1545 [06:42<03:30, 3.72it/s]
{'loss': 0.0001, 'grad_norm': 0.0, 'learning_rate': 5.080906148867314e-06, 'rewards/chosen': -20.00014877319336, 'rewards/rejected': -70.9677505493164, 'rewards/accuracies': 1.0, 'rewards/margins': 50.96759796142578, 'logps/chosen': -339.2450256347656, 'logps/rejected': -807.701416015625, 'logits/chosen': -2.3724873065948486, 'logits/rejected': -5.234989166259766, 'epoch': 0.49} |
|
49%|βββββ | 760/1545 [06:42<03:30, 3.72it/s]
49%|βββββ | 761/1545 [06:42<03:30, 3.73it/s]
49%|βββββ | 762/1545 [06:42<03:29, 3.73it/s]
49%|βββββ | 763/1545 [06:42<03:29, 3.73it/s]
49%|βββββ | 764/1545 [06:43<03:31, 3.69it/s]
50%|βββββ | 765/1545 [06:43<03:20, 3.90it/s]
50%|βββββ | 766/1545 [06:43<03:25, 3.78it/s]
50%|βββββ | 767/1545 [06:44<06:59, 1.86it/s]
50%|βββββ | 768/1545 [06:45<05:56, 2.18it/s]
50%|βββββ | 769/1545 [06:45<05:12, 2.48it/s]
50%|βββββ | 770/1545 [06:45<04:41, 2.75it/s]
{'loss': 1.0551, 'grad_norm': 6.261267546103788e-18, 'learning_rate': 5.016181229773464e-06, 'rewards/chosen': -21.470638275146484, 'rewards/rejected': -69.12281036376953, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 47.65216827392578, 'logps/chosen': -400.18658447265625, 'logps/rejected': -822.2199096679688, 'logits/chosen': -2.1039681434631348, 'logits/rejected': -4.775751113891602, 'epoch': 0.5} |
|
50%|βββββ | 770/1545 [06:45<04:41, 2.75it/s]
50%|βββββ | 771/1545 [06:45<04:21, 2.96it/s]
50%|βββββ | 772/1545 [06:46<04:09, 3.10it/s]
50%|βββββ | 773/1545 [06:46<03:46, 3.41it/s]
50%|βββββ | 774/1545 [06:46<03:41, 3.48it/s]
50%|βββββ | 775/1545 [06:46<03:37, 3.55it/s]
50%|βββββ | 776/1545 [06:47<03:40, 3.48it/s]
50%|βββββ | 777/1545 [06:47<03:46, 3.40it/s]
50%|βββββ | 778/1545 [06:47<03:30, 3.64it/s]
50%|βββββ | 779/1545 [06:48<03:31, 3.61it/s]
50%|βββββ | 780/1545 [06:48<03:33, 3.59it/s]
{'loss': 0.0001, 'grad_norm': 0.0, 'learning_rate': 4.951456310679612e-06, 'rewards/chosen': -25.87270736694336, 'rewards/rejected': -69.1749267578125, 'rewards/accuracies': 1.0, 'rewards/margins': 43.302223205566406, 'logps/chosen': -421.61248779296875, 'logps/rejected': -803.4362182617188, 'logits/chosen': -2.607177972793579, 'logits/rejected': -4.537522315979004, 'epoch': 0.5} |
|
50%|βββββ | 780/1545 [06:48<03:33, 3.59it/s]
51%|βββββ | 781/1545 [06:48<03:56, 3.23it/s]
51%|βββββ | 782/1545 [06:49<04:22, 2.90it/s]
51%|βββββ | 783/1545 [06:49<04:42, 2.70it/s]
51%|βββββ | 784/1545 [06:49<04:20, 2.92it/s]
51%|βββββ | 785/1545 [06:50<04:43, 2.68it/s]
51%|βββββ | 786/1545 [06:50<04:58, 2.54it/s]
51%|βββββ | 787/1545 [06:51<05:15, 2.40it/s]
51%|βββββ | 788/1545 [06:51<06:04, 2.08it/s]
51%|βββββ | 789/1545 [06:52<06:22, 1.97it/s]
51%|βββββ | 790/1545 [06:52<06:27, 1.95it/s]
{'loss': 7.3323, 'grad_norm': 0.2470703125, 'learning_rate': 4.886731391585761e-06, 'rewards/chosen': -44.284454345703125, 'rewards/rejected': -73.53287506103516, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 29.2484188079834, 'logps/chosen': -594.3431396484375, 'logps/rejected': -851.0657348632812, 'logits/chosen': -2.5097155570983887, 'logits/rejected': -3.9307990074157715, 'epoch': 0.51} |
|
51%|βββββ | 790/1545 [06:53<06:27, 1.95it/s]
51%|βββββ | 791/1545 [06:53<06:25, 1.96it/s]
51%|ββββββ | 792/1545 [06:54<06:44, 1.86it/s]
51%|ββββββ | 793/1545 [06:54<06:49, 1.83it/s]
51%|ββββββ | 794/1545 [06:55<06:30, 1.92it/s]
51%|ββββββ | 795/1545 [06:55<06:44, 1.86it/s]
52%|ββββββ | 796/1545 [06:56<06:45, 1.85it/s]
52%|ββββββ | 797/1545 [06:56<06:45, 1.85it/s]
52%|ββββββ | 798/1545 [06:57<06:24, 1.94it/s]
52%|ββββββ | 799/1545 [06:57<06:37, 1.88it/s]
52%|ββββββ | 800/1545 [06:58<06:41, 1.86it/s]
{'loss': 0.0002, 'grad_norm': 1.4375, 'learning_rate': 4.82200647249191e-06, 'rewards/chosen': -13.080400466918945, 'rewards/rejected': -36.216304779052734, 'rewards/accuracies': 1.0, 'rewards/margins': 23.135906219482422, 'logps/chosen': -296.13519287109375, 'logps/rejected': -475.113037109375, 'logits/chosen': -1.1140010356903076, 'logits/rejected': -2.2951102256774902, 'epoch': 0.52} |
|
52%|ββββββ | 800/1545 [06:58<06:41, 1.86it/s]
52%|ββββββ | 801/1545 [06:58<06:30, 1.91it/s]
52%|ββββββ | 802/1545 [06:59<06:42, 1.84it/s]
52%|ββββββ | 803/1545 [07:00<06:47, 1.82it/s]
52%|ββββββ | 804/1545 [07:00<06:51, 1.80it/s]
52%|ββββββ | 805/1545 [07:01<06:23, 1.93it/s]
52%|ββββββ | 806/1545 [07:01<06:40, 1.85it/s]
52%|ββββββ | 807/1545 [07:02<06:49, 1.80it/s]
52%|ββββββ | 808/1545 [07:02<06:32, 1.88it/s]
52%|ββββββ | 809/1545 [07:03<05:57, 2.06it/s]
52%|ββββββ | 810/1545 [07:03<06:18, 1.94it/s]
{'loss': 0.0703, 'grad_norm': 9.38598532229662e-10, 'learning_rate': 4.7572815533980585e-06, 'rewards/chosen': -24.983036041259766, 'rewards/rejected': -45.387813568115234, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 20.404781341552734, 'logps/chosen': -387.3971252441406, 'logps/rejected': -552.9581909179688, 'logits/chosen': -2.1105751991271973, 'logits/rejected': -3.3740882873535156, 'epoch': 0.52} |
|
52%|ββββββ | 810/1545 [07:03<06:18, 1.94it/s]
52%|ββββββ | 811/1545 [07:04<06:21, 1.92it/s]
53%|ββββββ | 812/1545 [07:04<06:06, 2.00it/s]
53%|ββββββ | 813/1545 [07:05<06:30, 1.88it/s]
53%|ββββββ | 814/1545 [07:05<06:40, 1.83it/s]
53%|ββββββ | 815/1545 [07:06<06:39, 1.83it/s]
53%|ββββββ | 816/1545 [07:06<06:08, 1.98it/s]
53%|ββββββ | 817/1545 [07:07<06:27, 1.88it/s]
53%|ββββββ | 818/1545 [07:07<06:34, 1.85it/s]
53%|ββββββ | 819/1545 [07:08<06:28, 1.87it/s]
53%|ββββββ | 820/1545 [07:09<06:32, 1.85it/s]
{'loss': 1.553, 'grad_norm': 0.0, 'learning_rate': 4.6925566343042074e-06, 'rewards/chosen': -23.155715942382812, 'rewards/rejected': -62.099571228027344, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 38.94385528564453, 'logps/chosen': -375.70166015625, 'logps/rejected': -735.560791015625, 'logits/chosen': -1.7601606845855713, 'logits/rejected': -3.79761004447937, 'epoch': 0.53} |
|
53%|ββββββ | 820/1545 [07:09<06:32, 1.85it/s]
53%|ββββββ | 821/1545 [07:09<06:43, 1.79it/s]
53%|ββββββ | 822/1545 [07:10<06:47, 1.77it/s]
53%|ββββββ | 823/1545 [07:10<06:20, 1.90it/s]
53%|ββββββ | 824/1545 [07:11<06:30, 1.85it/s]
53%|ββββββ | 825/1545 [07:11<05:53, 2.04it/s]
53%|ββββββ | 826/1545 [07:12<06:06, 1.96it/s]
54%|ββββββ | 827/1545 [07:12<05:45, 2.08it/s]
54%|ββββββ | 828/1545 [07:13<06:05, 1.96it/s]
54%|ββββββ | 829/1545 [07:13<06:12, 1.92it/s]
54%|ββββββ | 830/1545 [07:14<06:14, 1.91it/s]
{'loss': 0.0695, 'grad_norm': 2.703125, 'learning_rate': 4.627831715210356e-06, 'rewards/chosen': -38.55735397338867, 'rewards/rejected': -69.79898834228516, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 31.241634368896484, 'logps/chosen': -564.8526611328125, 'logps/rejected': -828.9482421875, 'logits/chosen': -2.5433785915374756, 'logits/rejected': -3.9132227897644043, 'epoch': 0.54} |
|
54%|ββββββ | 830/1545 [07:14<06:14, 1.91it/s]
54%|ββββββ | 831/1545 [07:14<06:17, 1.89it/s]
54%|ββββββ | 832/1545 [07:15<06:26, 1.85it/s]
54%|ββββββ | 833/1545 [07:15<06:29, 1.83it/s]
54%|ββββββ | 834/1545 [07:16<06:03, 1.95it/s]
54%|ββββββ | 835/1545 [07:16<06:19, 1.87it/s]
54%|ββββββ | 836/1545 [07:17<06:26, 1.84it/s]
54%|ββββββ | 837/1545 [07:18<06:31, 1.81it/s]
54%|ββββββ | 838/1545 [07:18<06:25, 1.84it/s]
54%|ββββββ | 839/1545 [07:19<06:31, 1.80it/s]
54%|ββββββ | 840/1545 [07:19<06:33, 1.79it/s]
{'loss': 0.0001, 'grad_norm': 3.91155481338501e-07, 'learning_rate': 4.563106796116505e-06, 'rewards/chosen': -48.7236328125, 'rewards/rejected': -81.7518539428711, 'rewards/accuracies': 1.0, 'rewards/margins': 33.028221130371094, 'logps/chosen': -654.4888916015625, 'logps/rejected': -931.0814208984375, 'logits/chosen': -2.453652858734131, 'logits/rejected': -4.23397970199585, 'epoch': 0.54} |
|
54%|ββββββ | 840/1545 [07:19<06:33, 1.79it/s]
54%|ββββββ | 841/1545 [07:20<06:12, 1.89it/s]
54%|ββββββ | 842/1545 [07:20<06:21, 1.84it/s]
55%|ββββββ | 843/1545 [07:21<06:25, 1.82it/s]
55%|ββββββ | 844/1545 [07:21<06:21, 1.84it/s]
55%|ββββββ | 845/1545 [07:22<06:10, 1.89it/s]
55%|ββββββ | 846/1545 [07:22<06:18, 1.85it/s]
55%|ββββββ | 847/1545 [07:23<06:19, 1.84it/s]
55%|ββββββ | 848/1545 [07:23<05:54, 1.97it/s]
55%|ββββββ | 849/1545 [07:24<06:07, 1.89it/s]
55%|ββββββ | 850/1545 [07:25<06:14, 1.86it/s]
{'loss': 0.0, 'grad_norm': 2.286988957586611e-19, 'learning_rate': 4.498381877022654e-06, 'rewards/chosen': -44.718666076660156, 'rewards/rejected': -92.24812316894531, 'rewards/accuracies': 1.0, 'rewards/margins': 47.529449462890625, 'logps/chosen': -589.9989013671875, 'logps/rejected': -1017.69140625, 'logits/chosen': -3.3844847679138184, 'logits/rejected': -4.966015338897705, 'epoch': 0.55} |
|
55%|ββββββ | 850/1545 [07:25<06:14, 1.86it/s]
55%|ββββββ | 851/1545 [07:25<06:21, 1.82it/s]
55%|ββββββ | 852/1545 [07:26<05:57, 1.94it/s]
55%|ββββββ | 853/1545 [07:26<06:09, 1.87it/s]
55%|ββββββ | 854/1545 [07:27<06:16, 1.83it/s]
55%|ββββββ | 855/1545 [07:27<06:05, 1.89it/s]
55%|ββββββ | 856/1545 [07:28<06:06, 1.88it/s]
55%|ββββββ | 857/1545 [07:28<06:19, 1.81it/s]
56%|ββββββ | 858/1545 [07:29<06:22, 1.80it/s]
56%|ββββββ | 859/1545 [07:29<05:49, 1.96it/s]
56%|ββββββ | 860/1545 [07:30<06:02, 1.89it/s]
{'loss': 0.8693, 'grad_norm': 0.0, 'learning_rate': 4.433656957928803e-06, 'rewards/chosen': -40.95757293701172, 'rewards/rejected': -79.82716369628906, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 38.86958694458008, 'logps/chosen': -559.9370727539062, 'logps/rejected': -917.9359130859375, 'logits/chosen': -3.1516611576080322, 'logits/rejected': -4.5850043296813965, 'epoch': 0.56} |
|
56%|ββββββ | 860/1545 [07:30<06:02, 1.89it/s]
56%|ββββββ | 861/1545 [07:30<06:11, 1.84it/s]
56%|ββββββ | 862/1545 [07:31<06:10, 1.85it/s]
56%|ββββββ | 863/1545 [07:31<06:05, 1.87it/s]
56%|ββββββ | 864/1545 [07:32<06:12, 1.83it/s]
56%|ββββββ | 865/1545 [07:33<06:18, 1.80it/s]
56%|ββββββ | 866/1545 [07:33<05:57, 1.90it/s]
56%|ββββββ | 867/1545 [07:34<06:09, 1.84it/s]
56%|ββββββ | 868/1545 [07:34<05:36, 2.01it/s]
56%|ββββββ | 869/1545 [07:35<05:47, 1.95it/s]
56%|ββββββ | 870/1545 [07:35<05:25, 2.07it/s]
{'loss': 1.2715, 'grad_norm': 0.0, 'learning_rate': 4.368932038834952e-06, 'rewards/chosen': -32.4947509765625, 'rewards/rejected': -78.07811737060547, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 45.58336639404297, 'logps/chosen': -480.0074157714844, 'logps/rejected': -896.2662963867188, 'logits/chosen': -2.7316079139709473, 'logits/rejected': -4.247876167297363, 'epoch': 0.56} |
|
56%|ββββββ | 870/1545 [07:35<05:25, 2.07it/s]
56%|ββββββ | 871/1545 [07:36<05:46, 1.95it/s]
56%|ββββββ | 872/1545 [07:36<05:55, 1.89it/s]
57%|ββββββ | 873/1545 [07:37<05:58, 1.88it/s]
57%|ββββββ | 874/1545 [07:37<05:49, 1.92it/s]
57%|ββββββ | 875/1545 [07:38<05:59, 1.86it/s]
57%|ββββββ | 876/1545 [07:38<06:03, 1.84it/s]
57%|ββββββ | 877/1545 [07:39<05:44, 1.94it/s]
57%|ββββββ | 878/1545 [07:39<05:56, 1.87it/s]
57%|ββββββ | 879/1545 [07:40<05:24, 2.05it/s]
57%|ββββββ | 880/1545 [07:40<05:45, 1.93it/s]
{'loss': 2.2344, 'grad_norm': 3.552436828613281e-05, 'learning_rate': 4.304207119741101e-06, 'rewards/chosen': -30.8316707611084, 'rewards/rejected': -63.74101638793945, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 32.909339904785156, 'logps/chosen': -454.8379821777344, 'logps/rejected': -745.82373046875, 'logits/chosen': -2.527254581451416, 'logits/rejected': -3.76324725151062, 'epoch': 0.57} |
|
57%|ββββββ | 880/1545 [07:40<05:45, 1.93it/s]
57%|ββββββ | 881/1545 [07:41<05:34, 1.98it/s]
57%|ββββββ | 882/1545 [07:41<05:45, 1.92it/s]
57%|ββββββ | 883/1545 [07:42<05:49, 1.89it/s]
57%|ββββββ | 884/1545 [07:42<05:48, 1.90it/s]
57%|ββββββ | 885/1545 [07:43<05:36, 1.96it/s]
57%|ββββββ | 886/1545 [07:44<05:55, 1.85it/s]
57%|ββββββ | 887/1545 [07:44<05:52, 1.87it/s]
57%|ββββββ | 888/1545 [07:45<05:39, 1.94it/s]
58%|ββββββ | 889/1545 [07:46<08:42, 1.25it/s]
58%|ββββββ | 890/1545 [07:47<07:52, 1.39it/s]
{'loss': 0.0, 'grad_norm': 3.0547380447387695e-07, 'learning_rate': 4.23948220064725e-06, 'rewards/chosen': -15.884663581848145, 'rewards/rejected': -51.30836868286133, 'rewards/accuracies': 1.0, 'rewards/margins': 35.423702239990234, 'logps/chosen': -304.71734619140625, 'logps/rejected': -617.8062744140625, 'logits/chosen': -1.6831843852996826, 'logits/rejected': -3.7176902294158936, 'epoch': 0.58} |
|
58%|ββββββ | 890/1545 [07:47<07:52, 1.39it/s]
58%|ββββββ | 891/1545 [07:47<07:17, 1.50it/s]
58%|ββββββ | 892/1545 [07:48<06:43, 1.62it/s]
58%|ββββββ | 893/1545 [07:48<06:37, 1.64it/s]
58%|ββββββ | 894/1545 [07:49<06:27, 1.68it/s]
58%|ββββββ | 895/1545 [07:49<05:58, 1.81it/s]
58%|ββββββ | 896/1545 [07:50<06:07, 1.77it/s]
58%|ββββββ | 897/1545 [07:50<06:11, 1.74it/s]
58%|ββββββ | 898/1545 [07:51<06:01, 1.79it/s]
58%|ββββββ | 899/1545 [07:51<05:54, 1.82it/s]
58%|ββββββ | 900/1545 [07:52<05:58, 1.80it/s]
{'loss': 0.0021, 'grad_norm': 2.453125, 'learning_rate': 4.1747572815533986e-06, 'rewards/chosen': -22.741947174072266, 'rewards/rejected': -60.978431701660156, 'rewards/accuracies': 1.0, 'rewards/margins': 38.236488342285156, 'logps/chosen': -365.04083251953125, 'logps/rejected': -720.0584716796875, 'logits/chosen': -2.4995644092559814, 'logits/rejected': -3.6995277404785156, 'epoch': 0.58} |
|
58%|ββββββ | 900/1545 [07:52<05:58, 1.80it/s]
58%|ββββββ | 901/1545 [07:53<06:02, 1.78it/s]
58%|ββββββ | 902/1545 [07:53<05:34, 1.92it/s]
58%|ββββββ | 903/1545 [07:54<05:43, 1.87it/s]
59%|ββββββ | 904/1545 [07:54<05:45, 1.86it/s]
59%|ββββββ | 905/1545 [07:55<05:46, 1.85it/s]
59%|ββββββ | 906/1545 [07:55<05:21, 1.99it/s]
59%|ββββββ | 907/1545 [07:56<06:46, 1.57it/s]
59%|ββββββ | 908/1545 [07:56<06:03, 1.75it/s]
59%|ββββββ | 909/1545 [07:57<05:44, 1.84it/s]
59%|ββββββ | 910/1545 [07:57<05:56, 1.78it/s]
{'loss': 0.0, 'grad_norm': 0.001312255859375, 'learning_rate': 4.1100323624595475e-06, 'rewards/chosen': -24.250585556030273, 'rewards/rejected': -54.26154708862305, 'rewards/accuracies': 1.0, 'rewards/margins': 30.010961532592773, 'logps/chosen': -370.4151611328125, 'logps/rejected': -644.9264526367188, 'logits/chosen': -2.4921982288360596, 'logits/rejected': -3.8125457763671875, 'epoch': 0.59} |
|
59%|ββββββ | 910/1545 [07:58<05:56, 1.78it/s]
59%|ββββββ | 911/1545 [07:58<05:59, 1.76it/s]
59%|ββββββ | 912/1545 [07:58<05:20, 1.97it/s]
59%|ββββββ | 913/1545 [07:59<05:02, 2.09it/s]
59%|ββββββ | 914/1545 [07:59<04:43, 2.22it/s]
59%|ββββββ | 915/1545 [08:00<05:10, 2.03it/s]
59%|ββββββ | 916/1545 [08:00<05:22, 1.95it/s]
59%|ββββββ | 917/1545 [08:01<04:39, 2.25it/s]
59%|ββββββ | 918/1545 [08:01<05:09, 2.02it/s]
59%|ββββββ | 919/1545 [08:02<05:21, 1.95it/s]
60%|ββββββ | 920/1545 [08:02<05:27, 1.91it/s]
{'loss': 0.0141, 'grad_norm': 3.790855407714844e-05, 'learning_rate': 4.045307443365696e-06, 'rewards/chosen': -25.498334884643555, 'rewards/rejected': -52.45038986206055, 'rewards/accuracies': 1.0, 'rewards/margins': 26.952056884765625, 'logps/chosen': -437.73626708984375, 'logps/rejected': -665.8396606445312, 'logits/chosen': -2.0662381649017334, 'logits/rejected': -3.4163360595703125, 'epoch': 0.6} |
|
60%|ββββββ | 920/1545 [08:02<05:27, 1.91it/s]
60%|ββββββ | 921/1545 [08:03<05:28, 1.90it/s]
60%|ββββββ | 922/1545 [08:04<05:37, 1.85it/s]
60%|ββββββ | 923/1545 [08:04<05:40, 1.82it/s]
60%|ββββββ | 924/1545 [08:04<05:16, 1.96it/s]
60%|ββββββ | 925/1545 [08:05<05:30, 1.87it/s]
60%|ββββββ | 926/1545 [08:06<05:42, 1.81it/s]
60%|ββββββ | 927/1545 [08:06<05:38, 1.82it/s]
60%|ββββββ | 928/1545 [08:07<05:29, 1.88it/s]
60%|ββββββ | 929/1545 [08:07<05:41, 1.80it/s]
60%|ββββββ | 930/1545 [08:08<05:35, 1.84it/s]
{'loss': 0.0, 'grad_norm': 9.441375732421875e-05, 'learning_rate': 3.980582524271845e-06, 'rewards/chosen': -17.19916343688965, 'rewards/rejected': -57.50432586669922, 'rewards/accuracies': 1.0, 'rewards/margins': 40.305152893066406, 'logps/chosen': -328.04193115234375, 'logps/rejected': -684.6046752929688, 'logits/chosen': -2.187042236328125, 'logits/rejected': -4.470019340515137, 'epoch': 0.6} |
|
60%|ββββββ | 930/1545 [08:08<05:35, 1.84it/s]
60%|ββββββ | 931/1545 [08:08<05:01, 2.03it/s]
60%|ββββββ | 932/1545 [08:09<04:25, 2.31it/s]
60%|ββββββ | 933/1545 [08:09<04:15, 2.39it/s]
60%|ββββββ | 934/1545 [08:09<04:46, 2.13it/s]
61%|ββββββ | 935/1545 [08:10<05:12, 1.95it/s]
61%|ββββββ | 936/1545 [08:11<05:03, 2.01it/s]
61%|ββββββ | 937/1545 [08:11<05:19, 1.91it/s]
61%|ββββββ | 938/1545 [08:12<05:27, 1.85it/s]
61%|ββββββ | 939/1545 [08:12<05:29, 1.84it/s]
61%|ββββββ | 940/1545 [08:13<05:12, 1.94it/s]
{'loss': 0.0009, 'grad_norm': 5.857145879417658e-10, 'learning_rate': 3.915857605177994e-06, 'rewards/chosen': -24.126405715942383, 'rewards/rejected': -60.42162322998047, 'rewards/accuracies': 1.0, 'rewards/margins': 36.29521942138672, 'logps/chosen': -361.6978454589844, 'logps/rejected': -705.1375732421875, 'logits/chosen': -2.889936923980713, 'logits/rejected': -4.388113975524902, 'epoch': 0.61} |
|
61%|ββββββ | 940/1545 [08:13<05:12, 1.94it/s]
61%|ββββββ | 941/1545 [08:13<05:27, 1.84it/s]
61%|ββββββ | 942/1545 [08:14<05:29, 1.83it/s]
61%|ββββββ | 943/1545 [08:14<05:16, 1.90it/s]
61%|ββββββ | 944/1545 [08:15<05:22, 1.86it/s]
61%|ββββββ | 945/1545 [08:15<05:26, 1.84it/s]
61%|ββββββ | 946/1545 [08:16<05:34, 1.79it/s]
61%|βββββββ | 947/1545 [08:16<05:08, 1.94it/s]
61%|βββββββ | 948/1545 [08:17<05:22, 1.85it/s]
61%|βββββββ | 949/1545 [08:18<05:25, 1.83it/s]
61%|βββββββ | 950/1545 [08:18<05:18, 1.87it/s]
{'loss': 0.1108, 'grad_norm': 3.202843331805323e-21, 'learning_rate': 3.851132686084142e-06, 'rewards/chosen': -30.862689971923828, 'rewards/rejected': -68.66695404052734, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 37.804264068603516, 'logps/chosen': -447.46221923828125, 'logps/rejected': -795.8951416015625, 'logits/chosen': -2.5695688724517822, 'logits/rejected': -4.074126243591309, 'epoch': 0.61} |
|
61%|βββββββ | 950/1545 [08:18<05:18, 1.87it/s]
62%|βββββββ | 951/1545 [08:19<05:24, 1.83it/s]
62%|βββββββ | 952/1545 [08:19<05:27, 1.81it/s]
62%|βββββββ | 953/1545 [08:20<05:30, 1.79it/s]
62%|βββββββ | 954/1545 [08:20<05:07, 1.92it/s]
62%|βββββββ | 955/1545 [08:21<05:16, 1.87it/s]
62%|βββββββ | 956/1545 [08:21<05:17, 1.85it/s]
62%|βββββββ | 957/1545 [08:22<05:15, 1.86it/s]
62%|βββββββ | 958/1545 [08:22<05:07, 1.91it/s]
62%|βββββββ | 959/1545 [08:23<05:18, 1.84it/s]
62%|βββββββ | 960/1545 [08:24<05:23, 1.81it/s]
{'loss': 0.0938, 'grad_norm': 1.7848833522293717e-11, 'learning_rate': 3.7864077669902915e-06, 'rewards/chosen': -31.656015396118164, 'rewards/rejected': -76.21475982666016, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 44.558738708496094, 'logps/chosen': -445.54949951171875, 'logps/rejected': -868.03564453125, 'logits/chosen': -2.807201862335205, 'logits/rejected': -4.515857696533203, 'epoch': 0.62} |
|
62%|βββββββ | 960/1545 [08:24<05:23, 1.81it/s]
62%|βββββββ | 961/1545 [08:24<05:05, 1.91it/s]
62%|βββββββ | 962/1545 [08:25<05:18, 1.83it/s]
62%|βββββββ | 963/1545 [08:25<05:20, 1.81it/s]
62%|βββββββ | 964/1545 [08:26<05:18, 1.83it/s]
62%|βββββββ | 965/1545 [08:26<05:09, 1.88it/s]
63%|βββββββ | 966/1545 [08:27<05:17, 1.82it/s]
63%|βββββββ | 967/1545 [08:27<05:22, 1.79it/s]
63%|βββββββ | 968/1545 [08:28<05:00, 1.92it/s]
63%|βββββββ | 969/1545 [08:28<05:17, 1.81it/s]
63%|βββββββ | 970/1545 [08:29<05:20, 1.80it/s]
{'loss': 0.0, 'grad_norm': 1.8596649169921875e-05, 'learning_rate': 3.721682847896441e-06, 'rewards/chosen': -31.059711456298828, 'rewards/rejected': -74.49284362792969, 'rewards/accuracies': 1.0, 'rewards/margins': 43.43313217163086, 'logps/chosen': -468.68341064453125, 'logps/rejected': -852.8173828125, 'logits/chosen': -2.2017874717712402, 'logits/rejected': -4.380518913269043, 'epoch': 0.63} |
|
63%|βββββββ | 970/1545 [08:29<05:20, 1.80it/s]
63%|βββββββ | 971/1545 [08:29<04:51, 1.97it/s]
63%|βββββββ | 972/1545 [08:30<04:11, 2.28it/s]
63%|βββββββ | 973/1545 [08:30<04:36, 2.07it/s]
63%|βββββββ | 974/1545 [08:31<04:48, 1.98it/s]
63%|βββββββ | 975/1545 [08:31<04:57, 1.91it/s]
63%|βββββββ | 976/1545 [08:32<04:36, 2.06it/s]
63%|βββββββ | 977/1545 [08:32<04:51, 1.95it/s]
63%|βββββββ | 978/1545 [08:33<04:56, 1.91it/s]
63%|βββββββ | 979/1545 [08:33<04:58, 1.90it/s]
63%|βββββββ | 980/1545 [08:34<04:52, 1.93it/s]
{'loss': 0.0, 'grad_norm': 2.656295322589486e-17, 'learning_rate': 3.6569579288025893e-06, 'rewards/chosen': -23.480316162109375, 'rewards/rejected': -78.44859313964844, 'rewards/accuracies': 1.0, 'rewards/margins': 54.9682731628418, 'logps/chosen': -396.80621337890625, 'logps/rejected': -915.10791015625, 'logits/chosen': -2.1880345344543457, 'logits/rejected': -4.083585739135742, 'epoch': 0.63} |
|
63%|βββββββ | 980/1545 [08:34<04:52, 1.93it/s]
63%|βββββββ | 981/1545 [08:35<05:05, 1.84it/s]
64%|βββββββ | 982/1545 [08:35<05:08, 1.83it/s]
64%|βββββββ | 983/1545 [08:36<04:48, 1.94it/s]
64%|βββββββ | 984/1545 [08:36<05:01, 1.86it/s]
64%|βββββββ | 985/1545 [08:37<05:05, 1.84it/s]
64%|βββββββ | 986/1545 [08:37<05:01, 1.85it/s]
64%|βββββββ | 987/1545 [08:38<04:48, 1.93it/s]
64%|βββββββ | 988/1545 [08:38<04:56, 1.88it/s]
64%|βββββββ | 989/1545 [08:39<05:02, 1.84it/s]
64%|βββββββ | 990/1545 [08:39<04:52, 1.90it/s]
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 3.592233009708738e-06, 'rewards/chosen': -31.23908042907715, 'rewards/rejected': -84.39486694335938, 'rewards/accuracies': 1.0, 'rewards/margins': 53.155784606933594, 'logps/chosen': -469.037109375, 'logps/rejected': -947.4615478515625, 'logits/chosen': -2.7216758728027344, 'logits/rejected': -4.766693115234375, 'epoch': 0.64} |
|
64%|βββββββ | 990/1545 [08:39<04:52, 1.90it/s]
64%|βββββββ | 991/1545 [08:40<05:05, 1.81it/s]
64%|βββββββ | 992/1545 [08:41<05:06, 1.80it/s]
64%|βββββββ | 993/1545 [08:41<04:36, 2.00it/s]
64%|βββββββ | 994/1545 [08:41<04:24, 2.08it/s]
64%|βββββββ | 995/1545 [08:42<04:40, 1.96it/s]
64%|βββββββ | 996/1545 [08:42<04:50, 1.89it/s]
65%|βββββββ | 997/1545 [08:43<04:52, 1.88it/s]
65%|βββββββ | 998/1545 [08:43<04:31, 2.02it/s]
65%|βββββββ | 999/1545 [08:44<04:45, 1.91it/s]
65%|βββββββ | 1000/1545 [08:45<04:48, 1.89it/s]
{'loss': 0.0, 'grad_norm': 2.8189256484623115e-18, 'learning_rate': 3.5275080906148866e-06, 'rewards/chosen': -29.645822525024414, 'rewards/rejected': -76.21923065185547, 'rewards/accuracies': 1.0, 'rewards/margins': 46.57341384887695, 'logps/chosen': -474.364990234375, 'logps/rejected': -886.2589721679688, 'logits/chosen': -2.4084129333496094, 'logits/rejected': -3.961566209793091, 'epoch': 0.65} |
|
65%|βββββββ | 1000/1545 [08:45<04:48, 1.89it/s]
65%|βββββββ | 1001/1545 [08:45<04:48, 1.88it/s]
65%|βββββββ | 1002/1545 [08:46<04:52, 1.86it/s]
65%|βββββββ | 1003/1545 [08:47<07:29, 1.20it/s]
65%|βββββββ | 1004/1545 [08:48<06:42, 1.34it/s]
65%|βββββββ | 1005/1545 [08:48<05:51, 1.54it/s]
65%|βββββββ | 1006/1545 [08:49<05:40, 1.58it/s]
65%|βββββββ | 1007/1545 [08:49<05:29, 1.63it/s]
65%|βββββββ | 1008/1545 [08:50<05:10, 1.73it/s]
65%|βββββββ | 1009/1545 [08:50<05:14, 1.70it/s]
65%|βββββββ | 1010/1545 [08:51<05:14, 1.70it/s]
{'loss': 0.0, 'grad_norm': 6.352747104407253e-22, 'learning_rate': 3.462783171521036e-06, 'rewards/chosen': -32.23039245605469, 'rewards/rejected': -103.32981872558594, 'rewards/accuracies': 1.0, 'rewards/margins': 71.09942626953125, 'logps/chosen': -480.616455078125, 'logps/rejected': -1155.624267578125, 'logits/chosen': -2.7920470237731934, 'logits/rejected': -4.624792575836182, 'epoch': 0.65} |
|
65%|βββββββ | 1010/1545 [08:51<05:14, 1.70it/s]
65%|βββββββ | 1011/1545 [08:52<05:09, 1.73it/s]
66%|βββββββ | 1012/1545 [08:52<04:51, 1.83it/s]
66%|βββββββ | 1013/1545 [08:53<04:57, 1.79it/s]
66%|βββββββ | 1014/1545 [08:53<04:57, 1.79it/s]
66%|βββββββ | 1015/1545 [08:54<04:42, 1.88it/s]
66%|βββββββ | 1016/1545 [08:54<04:18, 2.05it/s]
66%|βββββββ | 1017/1545 [08:54<04:00, 2.20it/s]
66%|βββββββ | 1018/1545 [08:55<04:18, 2.04it/s]
66%|βββββββ | 1019/1545 [08:56<04:24, 1.99it/s]
66%|βββββββ | 1020/1545 [08:56<04:24, 1.99it/s]
{'loss': 0.0, 'grad_norm': 1.895427703857422e-05, 'learning_rate': 3.398058252427185e-06, 'rewards/chosen': -47.92738342285156, 'rewards/rejected': -90.48072052001953, 'rewards/accuracies': 1.0, 'rewards/margins': 42.553340911865234, 'logps/chosen': -608.0487060546875, 'logps/rejected': -1020.53955078125, 'logits/chosen': -3.5880520343780518, 'logits/rejected': -4.89176607131958, 'epoch': 0.66} |
|
66%|βββββββ | 1020/1545 [08:56<04:24, 1.99it/s]
66%|βββββββ | 1021/1545 [08:57<04:38, 1.88it/s]
66%|βββββββ | 1022/1545 [08:57<04:45, 1.83it/s]
66%|βββββββ | 1023/1545 [08:58<04:24, 1.97it/s]
66%|βββββββ | 1024/1545 [08:58<04:35, 1.89it/s]
66%|βββββββ | 1025/1545 [08:59<04:41, 1.85it/s]
66%|βββββββ | 1026/1545 [08:59<04:38, 1.86it/s]
66%|βββββββ | 1027/1545 [09:00<04:26, 1.94it/s]
67%|βββββββ | 1028/1545 [09:00<04:37, 1.86it/s]
67%|βββββββ | 1029/1545 [09:01<04:41, 1.84it/s]
67%|βββββββ | 1030/1545 [09:01<04:12, 2.04it/s]
{'loss': 0.0067, 'grad_norm': 0.000110626220703125, 'learning_rate': 3.3333333333333333e-06, 'rewards/chosen': -31.464359283447266, 'rewards/rejected': -79.84764862060547, 'rewards/accuracies': 1.0, 'rewards/margins': 48.38329315185547, 'logps/chosen': -460.6871032714844, 'logps/rejected': -919.3955078125, 'logits/chosen': -2.863114833831787, 'logits/rejected': -4.340217113494873, 'epoch': 0.67} |
|
67%|βββββββ | 1030/1545 [09:01<04:12, 2.04it/s]
67%|βββββββ | 1031/1545 [09:02<04:15, 2.01it/s]
67%|βββββββ | 1032/1545 [09:02<04:26, 1.92it/s]
67%|βββββββ | 1033/1545 [09:03<04:29, 1.90it/s]
67%|βββββββ | 1034/1545 [09:03<04:16, 1.99it/s]
67%|βββββββ | 1035/1545 [09:04<04:28, 1.90it/s]
67%|βββββββ | 1036/1545 [09:04<04:33, 1.86it/s]
67%|βββββββ | 1037/1545 [09:05<04:35, 1.84it/s]
67%|βββββββ | 1038/1545 [09:05<04:20, 1.95it/s]
67%|βββββββ | 1039/1545 [09:06<04:29, 1.88it/s]
67%|βββββββ | 1040/1545 [09:07<04:34, 1.84it/s]
{'loss': 0.0878, 'grad_norm': 2.0161650127192843e-13, 'learning_rate': 3.2686084142394826e-06, 'rewards/chosen': -22.702207565307617, 'rewards/rejected': -70.41169738769531, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 47.7094841003418, 'logps/chosen': -373.5330810546875, 'logps/rejected': -809.3206787109375, 'logits/chosen': -2.131121873855591, 'logits/rejected': -4.659956932067871, 'epoch': 0.67} |
|
67%|βββββββ | 1040/1545 [09:07<04:34, 1.84it/s]
67%|βββββββ | 1041/1545 [09:07<04:37, 1.81it/s]
67%|βββββββ | 1042/1545 [09:08<05:11, 1.61it/s]
68%|βββββββ | 1043/1545 [09:09<05:29, 1.52it/s]
68%|βββββββ | 1044/1545 [09:09<05:04, 1.65it/s]
68%|βββββββ | 1045/1545 [09:10<04:30, 1.85it/s]
68%|βββββββ | 1046/1545 [09:10<04:49, 1.73it/s]
68%|βββββββ | 1047/1545 [09:11<04:48, 1.73it/s]
68%|βββββββ | 1048/1545 [09:11<04:22, 1.89it/s]
68%|βββββββ | 1049/1545 [09:12<04:33, 1.81it/s]
68%|βββββββ | 1050/1545 [09:12<04:05, 2.01it/s]
{'loss': 0.0693, 'grad_norm': 1.5802470443304628e-11, 'learning_rate': 3.2038834951456315e-06, 'rewards/chosen': -28.963123321533203, 'rewards/rejected': -69.37215423583984, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 40.409034729003906, 'logps/chosen': -429.06085205078125, 'logps/rejected': -803.1228637695312, 'logits/chosen': -2.533982038497925, 'logits/rejected': -4.221534252166748, 'epoch': 0.68} |
|
68%|βββββββ | 1050/1545 [09:12<04:05, 2.01it/s]
68%|βββββββ | 1051/1545 [09:13<04:13, 1.95it/s]
68%|βββββββ | 1052/1545 [09:13<04:01, 2.04it/s]
68%|βββββββ | 1053/1545 [09:14<04:16, 1.92it/s]
68%|βββββββ | 1054/1545 [09:14<04:23, 1.86it/s]
68%|βββββββ | 1055/1545 [09:15<04:14, 1.93it/s]
68%|βββββββ | 1056/1545 [09:15<04:21, 1.87it/s]
68%|βββββββ | 1057/1545 [09:16<04:26, 1.83it/s]
68%|βββββββ | 1058/1545 [09:17<04:28, 1.81it/s]
69%|βββββββ | 1059/1545 [09:17<04:07, 1.96it/s]
69%|βββββββ | 1060/1545 [09:18<04:14, 1.90it/s]
{'loss': 0.0598, 'grad_norm': 0.0, 'learning_rate': 3.13915857605178e-06, 'rewards/chosen': -34.7342643737793, 'rewards/rejected': -71.29218292236328, 'rewards/accuracies': 1.0, 'rewards/margins': 36.55791091918945, 'logps/chosen': -516.7052612304688, 'logps/rejected': -824.2127685546875, 'logits/chosen': -3.0466580390930176, 'logits/rejected': -4.670151710510254, 'epoch': 0.69} |
|
69%|βββββββ | 1060/1545 [09:18<04:14, 1.90it/s]
69%|βββββββ | 1061/1545 [09:18<04:19, 1.86it/s]
69%|βββββββ | 1062/1545 [09:19<04:21, 1.85it/s]
69%|βββββββ | 1063/1545 [09:19<04:43, 1.70it/s]
69%|βββββββ | 1064/1545 [09:20<05:05, 1.57it/s]
69%|βββββββ | 1065/1545 [09:21<04:41, 1.70it/s]
69%|βββββββ | 1066/1545 [09:21<04:41, 1.70it/s]
69%|βββββββ | 1067/1545 [09:22<04:38, 1.72it/s]
69%|βββββββ | 1068/1545 [09:22<04:35, 1.73it/s]
69%|βββββββ | 1069/1545 [09:23<04:17, 1.85it/s]
69%|βββββββ | 1070/1545 [09:23<04:22, 1.81it/s]
{'loss': 0.0, 'grad_norm': 6.733911930671688e-20, 'learning_rate': 3.0744336569579293e-06, 'rewards/chosen': -25.6965274810791, 'rewards/rejected': -69.01310729980469, 'rewards/accuracies': 1.0, 'rewards/margins': 43.31658172607422, 'logps/chosen': -386.03173828125, 'logps/rejected': -804.2999877929688, 'logits/chosen': -2.817981004714966, 'logits/rejected': -4.601845741271973, 'epoch': 0.69} |
|
69%|βββββββ | 1070/1545 [09:23<04:22, 1.81it/s]
69%|βββββββ | 1071/1545 [09:24<04:26, 1.78it/s]
69%|βββββββ | 1072/1545 [09:24<04:21, 1.81it/s]
69%|βββββββ | 1073/1545 [09:25<04:17, 1.83it/s]
70%|βββββββ | 1074/1545 [09:26<04:21, 1.80it/s]
70%|βββββββ | 1075/1545 [09:26<04:22, 1.79it/s]
70%|βββββββ | 1076/1545 [09:27<04:06, 1.90it/s]
70%|βββββββ | 1077/1545 [09:27<04:15, 1.83it/s]
70%|βββββββ | 1078/1545 [09:28<04:16, 1.82it/s]
70%|βββββββ | 1079/1545 [09:28<04:14, 1.83it/s]
70%|βββββββ | 1080/1545 [09:29<04:08, 1.87it/s]
{'loss': 0.0, 'grad_norm': 4.3298697960381105e-15, 'learning_rate': 3.0097087378640778e-06, 'rewards/chosen': -30.43539047241211, 'rewards/rejected': -81.25576782226562, 'rewards/accuracies': 1.0, 'rewards/margins': 50.82037353515625, 'logps/chosen': -435.7405700683594, 'logps/rejected': -922.3958740234375, 'logits/chosen': -3.113784074783325, 'logits/rejected': -4.786489963531494, 'epoch': 0.7} |
|
70%|βββββββ | 1080/1545 [09:29<04:08, 1.87it/s]
70%|βββββββ | 1081/1545 [09:29<04:16, 1.81it/s]
70%|βββββββ | 1082/1545 [09:30<04:14, 1.82it/s]
70%|βββββββ | 1083/1545 [09:30<03:58, 1.93it/s]
70%|βββββββ | 1084/1545 [09:31<04:07, 1.86it/s]
70%|βββββββ | 1085/1545 [09:32<04:12, 1.82it/s]
70%|βββββββ | 1086/1545 [09:32<04:11, 1.82it/s]
70%|βββββββ | 1087/1545 [09:33<04:09, 1.84it/s]
70%|βββββββ | 1088/1545 [09:33<04:12, 1.81it/s]
70%|βββββββ | 1089/1545 [09:34<04:12, 1.80it/s]
71%|βββββββ | 1090/1545 [09:34<03:54, 1.94it/s]
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.9449838187702267e-06, 'rewards/chosen': -32.56324005126953, 'rewards/rejected': -83.65999603271484, 'rewards/accuracies': 1.0, 'rewards/margins': 51.09675979614258, 'logps/chosen': -480.71697998046875, 'logps/rejected': -947.0802612304688, 'logits/chosen': -2.62692928314209, 'logits/rejected': -4.6637372970581055, 'epoch': 0.71} |
|
71%|βββββββ | 1090/1545 [09:34<03:54, 1.94it/s]
71%|βββββββ | 1091/1545 [09:35<04:03, 1.87it/s]
71%|βββββββ | 1092/1545 [09:35<04:06, 1.84it/s]
71%|βββββββ | 1093/1545 [09:36<04:03, 1.86it/s]
71%|βββββββ | 1094/1545 [09:36<03:48, 1.97it/s]
71%|βββββββ | 1095/1545 [09:37<03:56, 1.90it/s]
71%|βββββββ | 1096/1545 [09:37<04:01, 1.86it/s]
71%|βββββββ | 1097/1545 [09:38<03:55, 1.91it/s]
71%|βββββββ | 1098/1545 [09:38<03:56, 1.89it/s]
71%|βββββββ | 1099/1545 [09:39<03:35, 2.07it/s]
71%|βββββββ | 1100/1545 [09:39<03:47, 1.96it/s]
{'loss': 0.0, 'grad_norm': 1.8925892415213273e-21, 'learning_rate': 2.880258899676376e-06, 'rewards/chosen': -30.453670501708984, 'rewards/rejected': -84.54370880126953, 'rewards/accuracies': 1.0, 'rewards/margins': 54.09003448486328, 'logps/chosen': -462.65472412109375, 'logps/rejected': -947.7019653320312, 'logits/chosen': -2.7360405921936035, 'logits/rejected': -4.802727699279785, 'epoch': 0.71} |
|
71%|βββββββ | 1100/1545 [09:39<03:47, 1.96it/s]
71%|ββββββββ | 1101/1545 [09:40<03:39, 2.02it/s]
71%|ββββββββ | 1102/1545 [09:40<03:50, 1.92it/s]
71%|ββββββββ | 1103/1545 [09:41<04:02, 1.82it/s]
71%|ββββββββ | 1104/1545 [09:41<03:38, 2.02it/s]
72%|ββββββββ | 1105/1545 [09:42<03:30, 2.09it/s]
72%|ββββββββ | 1106/1545 [09:42<03:46, 1.94it/s]
72%|ββββββββ | 1107/1545 [09:43<03:51, 1.89it/s]
72%|ββββββββ | 1108/1545 [09:44<03:48, 1.91it/s]
72%|ββββββββ | 1109/1545 [09:44<03:18, 2.20it/s]
72%|ββββββββ | 1110/1545 [09:44<03:36, 2.01it/s]
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.8155339805825245e-06, 'rewards/chosen': -17.798450469970703, 'rewards/rejected': -69.56123352050781, 'rewards/accuracies': 1.0, 'rewards/margins': 51.76277542114258, 'logps/chosen': -317.5895690917969, 'logps/rejected': -803.85595703125, 'logits/chosen': -2.070510149002075, 'logits/rejected': -4.086310863494873, 'epoch': 0.72} |
|
72%|ββββββββ | 1110/1545 [09:44<03:36, 2.01it/s]
72%|ββββββββ | 1111/1545 [09:45<03:47, 1.91it/s]
72%|ββββββββ | 1112/1545 [09:46<03:50, 1.88it/s]
72%|ββββββββ | 1113/1545 [09:46<03:42, 1.94it/s]
72%|ββββββββ | 1114/1545 [09:47<03:50, 1.87it/s]
72%|ββββββββ | 1115/1545 [09:47<03:51, 1.85it/s]
72%|ββββββββ | 1116/1545 [09:48<05:18, 1.35it/s]
72%|ββββββββ | 1117/1545 [09:49<05:06, 1.40it/s]
72%|ββββββββ | 1118/1545 [09:50<04:49, 1.48it/s]
72%|ββββββββ | 1119/1545 [09:50<04:30, 1.58it/s]
72%|ββββββββ | 1120/1545 [09:51<04:27, 1.59it/s]
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.7508090614886734e-06, 'rewards/chosen': -21.955896377563477, 'rewards/rejected': -73.12271881103516, 'rewards/accuracies': 1.0, 'rewards/margins': 51.166812896728516, 'logps/chosen': -381.0746154785156, 'logps/rejected': -851.5888671875, 'logits/chosen': -2.0508859157562256, 'logits/rejected': -4.157201290130615, 'epoch': 0.72} |
|
72%|ββββββββ | 1120/1545 [09:51<04:27, 1.59it/s]
73%|ββββββββ | 1121/1545 [09:51<04:25, 1.59it/s]
73%|ββββββββ | 1122/1545 [09:52<04:16, 1.65it/s]
73%|ββββββββ | 1123/1545 [09:52<04:07, 1.70it/s]
73%|ββββββββ | 1124/1545 [09:53<04:10, 1.68it/s]
73%|ββββββββ | 1125/1545 [09:54<04:03, 1.73it/s]
73%|ββββββββ | 1126/1545 [09:54<03:45, 1.86it/s]
73%|ββββββββ | 1127/1545 [09:55<03:53, 1.79it/s]
73%|ββββββββ | 1128/1545 [09:55<03:53, 1.79it/s]
73%|ββββββββ | 1129/1545 [09:56<03:49, 1.81it/s]
73%|ββββββββ | 1130/1545 [09:56<03:49, 1.81it/s]
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.686084142394822e-06, 'rewards/chosen': -28.051372528076172, 'rewards/rejected': -90.2176284790039, 'rewards/accuracies': 1.0, 'rewards/margins': 62.1662483215332, 'logps/chosen': -435.447509765625, 'logps/rejected': -1026.18798828125, 'logits/chosen': -2.331815242767334, 'logits/rejected': -4.296026706695557, 'epoch': 0.73} |
|
73%|ββββββββ | 1130/1545 [09:56<03:49, 1.81it/s]
73%|ββββββββ | 1131/1545 [09:57<03:51, 1.78it/s]
73%|ββββββββ | 1132/1545 [09:57<03:54, 1.76it/s]
73%|ββββββββ | 1133/1545 [09:58<03:37, 1.89it/s]
73%|ββββββββ | 1134/1545 [09:59<03:44, 1.83it/s]
73%|ββββββββ | 1135/1545 [09:59<03:46, 1.81it/s]
74%|ββββββββ | 1136/1545 [10:00<03:44, 1.82it/s]
74%|ββββββββ | 1137/1545 [10:00<03:39, 1.86it/s]
74%|ββββββββ | 1138/1545 [10:01<03:42, 1.83it/s]
74%|ββββββββ | 1139/1545 [10:01<03:44, 1.81it/s]
74%|ββββββββ | 1140/1545 [10:02<03:30, 1.93it/s]
{'loss': 0.0, 'grad_norm': 8.348877145181177e-14, 'learning_rate': 2.621359223300971e-06, 'rewards/chosen': -28.582677841186523, 'rewards/rejected': -89.46198272705078, 'rewards/accuracies': 1.0, 'rewards/margins': 60.879295349121094, 'logps/chosen': -428.8086853027344, 'logps/rejected': -1000.7501831054688, 'logits/chosen': -2.752683162689209, 'logits/rejected': -4.274931907653809, 'epoch': 0.74} |
|
74%|ββββββββ | 1140/1545 [10:02<03:30, 1.93it/s]
74%|ββββββββ | 1141/1545 [10:02<03:39, 1.84it/s]
74%|ββββββββ | 1142/1545 [10:03<03:40, 1.83it/s]
74%|ββββββββ | 1143/1545 [10:03<03:38, 1.84it/s]
74%|ββββββββ | 1144/1545 [10:04<03:29, 1.92it/s]
74%|ββββββββ | 1145/1545 [10:04<03:36, 1.85it/s]
74%|ββββββββ | 1146/1545 [10:05<03:34, 1.86it/s]
74%|ββββββββ | 1147/1545 [10:05<03:24, 1.94it/s]
74%|ββββββββ | 1148/1545 [10:06<03:35, 1.85it/s]
74%|ββββββββ | 1149/1545 [10:07<03:37, 1.82it/s]
74%|ββββββββ | 1150/1545 [10:07<03:37, 1.81it/s]
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.55663430420712e-06, 'rewards/chosen': -30.311452865600586, 'rewards/rejected': -88.15357971191406, 'rewards/accuracies': 1.0, 'rewards/margins': 57.84212112426758, 'logps/chosen': -434.56097412109375, 'logps/rejected': -994.1755981445312, 'logits/chosen': -2.7379162311553955, 'logits/rejected': -4.308984279632568, 'epoch': 0.74} |
|
74%|ββββββββ | 1150/1545 [10:07<03:37, 1.81it/s]
74%|ββββββββ | 1151/1545 [10:08<03:25, 1.91it/s]
75%|ββββββββ | 1152/1545 [10:08<03:31, 1.86it/s]
75%|ββββββββ | 1153/1545 [10:09<03:33, 1.84it/s]
75%|ββββββββ | 1154/1545 [10:09<03:26, 1.89it/s]
75%|ββββββββ | 1155/1545 [10:10<03:32, 1.84it/s]
75%|ββββββββ | 1156/1545 [10:10<03:35, 1.81it/s]
75%|ββββββββ | 1157/1545 [10:11<03:34, 1.81it/s]
75%|ββββββββ | 1158/1545 [10:11<03:16, 1.97it/s]
75%|ββββββββ | 1159/1545 [10:12<03:27, 1.86it/s]
75%|ββββββββ | 1160/1545 [10:13<03:27, 1.86it/s]
{'loss': 5.2559, 'grad_norm': 1.4543533325195312e-05, 'learning_rate': 2.491909385113269e-06, 'rewards/chosen': -36.78795623779297, 'rewards/rejected': -50.73297119140625, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 13.945019721984863, 'logps/chosen': -489.45684814453125, 'logps/rejected': -604.6040649414062, 'logits/chosen': -2.9106650352478027, 'logits/rejected': -3.734529972076416, 'epoch': 0.75} |
|
75%|ββββββββ | 1160/1545 [10:13<03:27, 1.86it/s]
75%|ββββββββ | 1161/1545 [10:13<03:25, 1.86it/s]
75%|ββββββββ | 1162/1545 [10:14<03:53, 1.64it/s]
75%|ββββββββ | 1163/1545 [10:14<03:47, 1.68it/s]
75%|ββββββββ | 1164/1545 [10:15<03:38, 1.75it/s]
75%|ββββββββ | 1165/1545 [10:15<03:30, 1.81it/s]
75%|ββββββββ | 1166/1545 [10:16<03:31, 1.79it/s]
76%|ββββββββ | 1167/1545 [10:17<03:31, 1.79it/s]
76%|ββββββββ | 1168/1545 [10:17<03:16, 1.92it/s]
76%|ββββββββ | 1169/1545 [10:18<03:23, 1.85it/s]
76%|ββββββββ | 1170/1545 [10:18<03:23, 1.84it/s]
{'loss': 2.6137, 'grad_norm': 4.929390229335695e-14, 'learning_rate': 2.427184466019418e-06, 'rewards/chosen': -25.188079833984375, 'rewards/rejected': -58.6072998046875, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 33.419219970703125, 'logps/chosen': -412.2906188964844, 'logps/rejected': -685.932861328125, 'logits/chosen': -1.7462126016616821, 'logits/rejected': -3.5213539600372314, 'epoch': 0.76} |
|
76%|ββββββββ | 1170/1545 [10:18<03:23, 1.84it/s]
76%|ββββββββ | 1171/1545 [10:19<03:22, 1.85it/s]
76%|ββββββββ | 1172/1545 [10:19<03:15, 1.91it/s]
76%|ββββββββ | 1173/1545 [10:20<03:21, 1.84it/s]
76%|ββββββββ | 1174/1545 [10:20<03:18, 1.87it/s]
76%|ββββββββ | 1175/1545 [10:21<03:11, 1.93it/s]
76%|ββββββββ | 1176/1545 [10:21<03:18, 1.86it/s]
76%|ββββββββ | 1177/1545 [10:22<03:21, 1.83it/s]
76%|ββββββββ | 1178/1545 [10:22<03:22, 1.81it/s]
76%|ββββββββ | 1179/1545 [10:23<03:08, 1.95it/s]
76%|ββββββββ | 1180/1545 [10:23<03:17, 1.85it/s]
{'loss': 0.6655, 'grad_norm': 0.0002689361572265625, 'learning_rate': 2.3624595469255667e-06, 'rewards/chosen': -22.50977325439453, 'rewards/rejected': -50.339866638183594, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 27.830097198486328, 'logps/chosen': -364.30804443359375, 'logps/rejected': -620.8726806640625, 'logits/chosen': -2.1491000652313232, 'logits/rejected': -3.5545706748962402, 'epoch': 0.76} |
|
76%|ββββββββ | 1180/1545 [10:23<03:17, 1.85it/s]
76%|ββββββββ | 1181/1545 [10:24<03:19, 1.82it/s]
77%|ββββββββ | 1182/1545 [10:25<03:12, 1.88it/s]
77%|ββββββββ | 1183/1545 [10:25<03:15, 1.85it/s]
77%|ββββββββ | 1184/1545 [10:25<02:57, 2.04it/s]
77%|ββββββββ | 1185/1545 [10:26<03:04, 1.95it/s]
77%|ββββββββ | 1186/1545 [10:26<02:57, 2.02it/s]
77%|ββββββββ | 1187/1545 [10:27<03:06, 1.92it/s]
77%|ββββββββ | 1188/1545 [10:28<03:13, 1.85it/s]
77%|ββββββββ | 1189/1545 [10:28<02:54, 2.04it/s]
77%|ββββββββ | 1190/1545 [10:28<02:49, 2.09it/s]
{'loss': 0.6483, 'grad_norm': 6.4375, 'learning_rate': 2.297734627831715e-06, 'rewards/chosen': -29.0391788482666, 'rewards/rejected': -50.39924240112305, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 21.360071182250977, 'logps/chosen': -452.758056640625, 'logps/rejected': -620.355712890625, 'logits/chosen': -2.2422232627868652, 'logits/rejected': -2.9050989151000977, 'epoch': 0.77} |
|
77%|ββββββββ | 1190/1545 [10:28<02:49, 2.09it/s]
77%|ββββββββ | 1191/1545 [10:29<03:02, 1.94it/s]
77%|ββββββββ | 1192/1545 [10:29<02:47, 2.11it/s]
77%|ββββββββ | 1193/1545 [10:30<02:56, 1.99it/s]
77%|ββββββββ | 1194/1545 [10:30<02:49, 2.07it/s]
77%|ββββββββ | 1195/1545 [10:31<02:58, 1.96it/s]
77%|ββββββββ | 1196/1545 [10:32<03:02, 1.91it/s]
77%|ββββββββ | 1197/1545 [10:32<03:03, 1.89it/s]
78%|ββββββββ | 1198/1545 [10:33<02:56, 1.96it/s]
78%|ββββββββ | 1199/1545 [10:33<03:03, 1.89it/s]
78%|ββββββββ | 1200/1545 [10:34<03:06, 1.85it/s]
{'loss': 0.0049, 'grad_norm': 0.119140625, 'learning_rate': 2.2330097087378645e-06, 'rewards/chosen': -24.393327713012695, 'rewards/rejected': -47.943050384521484, 'rewards/accuracies': 1.0, 'rewards/margins': 23.549720764160156, 'logps/chosen': -420.3701171875, 'logps/rejected': -596.0790405273438, 'logits/chosen': -2.2110846042633057, 'logits/rejected': -3.3869075775146484, 'epoch': 0.78} |
|
78%|ββββββββ | 1200/1545 [10:34<03:06, 1.85it/s]
78%|ββββββββ | 1201/1545 [10:34<02:58, 1.93it/s]
78%|ββββββββ | 1202/1545 [10:35<03:03, 1.87it/s]
78%|ββββββββ | 1203/1545 [10:35<03:08, 1.81it/s]
78%|ββββββββ | 1204/1545 [10:36<03:07, 1.82it/s]
78%|ββββββββ | 1205/1545 [10:36<02:55, 1.93it/s]
78%|ββββββββ | 1206/1545 [10:37<03:01, 1.87it/s]
78%|ββββββββ | 1207/1545 [10:37<03:03, 1.84it/s]
78%|ββββββββ | 1208/1545 [10:38<02:58, 1.89it/s]
78%|ββββββββ | 1209/1545 [10:39<03:00, 1.86it/s]
78%|ββββββββ | 1210/1545 [10:39<03:06, 1.80it/s]
{'loss': 0.0022, 'grad_norm': 2.473825588822365e-10, 'learning_rate': 2.1682847896440134e-06, 'rewards/chosen': -17.865224838256836, 'rewards/rejected': -44.9857063293457, 'rewards/accuracies': 1.0, 'rewards/margins': 27.120479583740234, 'logps/chosen': -318.7569274902344, 'logps/rejected': -567.396240234375, 'logits/chosen': -2.13226580619812, 'logits/rejected': -3.3080012798309326, 'epoch': 0.78} |
|
78%|ββββββββ | 1210/1545 [10:39<03:06, 1.80it/s]
78%|ββββββββ | 1211/1545 [10:40<03:06, 1.79it/s]
78%|ββββββββ | 1212/1545 [10:40<02:54, 1.91it/s]
79%|ββββββββ | 1213/1545 [10:41<03:02, 1.82it/s]
79%|ββββββββ | 1214/1545 [10:41<02:42, 2.04it/s]
79%|ββββββββ | 1215/1545 [10:41<02:30, 2.20it/s]
79%|ββββββββ | 1216/1545 [10:42<02:28, 2.21it/s]
79%|ββββββββ | 1217/1545 [10:43<02:42, 2.02it/s]
79%|ββββββββ | 1218/1545 [10:43<02:49, 1.93it/s]
79%|ββββββββ | 1219/1545 [10:44<02:53, 1.88it/s]
79%|ββββββββ | 1220/1545 [10:44<02:51, 1.90it/s]
{'loss': 1.6899, 'grad_norm': 1152.0, 'learning_rate': 2.103559870550162e-06, 'rewards/chosen': -28.565698623657227, 'rewards/rejected': -47.352169036865234, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 18.78647232055664, 'logps/chosen': -429.1759338378906, 'logps/rejected': -576.1998901367188, 'logits/chosen': -2.689363718032837, 'logits/rejected': -3.7332401275634766, 'epoch': 0.79} |
|
79%|ββββββββ | 1220/1545 [10:44<02:51, 1.90it/s]
79%|ββββββββ | 1221/1545 [10:45<02:57, 1.83it/s]
79%|ββββββββ | 1222/1545 [10:45<03:00, 1.79it/s]
79%|ββββββββ | 1223/1545 [10:46<02:47, 1.92it/s]
79%|ββββββββ | 1224/1545 [10:46<02:55, 1.83it/s]
79%|ββββββββ | 1225/1545 [10:47<02:55, 1.82it/s]
79%|ββββββββ | 1226/1545 [10:47<02:54, 1.83it/s]
79%|ββββββββ | 1227/1545 [10:48<02:53, 1.83it/s]
79%|ββββββββ | 1228/1545 [10:49<02:54, 1.82it/s]
80%|ββββββββ | 1229/1545 [10:50<04:09, 1.26it/s]
80%|ββββββββ | 1230/1545 [10:50<03:27, 1.52it/s]
{'loss': 0.8983, 'grad_norm': 2.4400651454925537e-07, 'learning_rate': 2.0388349514563107e-06, 'rewards/chosen': -17.619848251342773, 'rewards/rejected': -43.817138671875, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 26.19728660583496, 'logps/chosen': -310.8443298339844, 'logps/rejected': -550.3328857421875, 'logits/chosen': -2.216456651687622, 'logits/rejected': -3.418893814086914, 'epoch': 0.8} |
|
80%|ββββββββ | 1230/1545 [10:50<03:27, 1.52it/s]
80%|ββββββββ | 1231/1545 [10:51<03:16, 1.60it/s]
80%|ββββββββ | 1232/1545 [10:52<03:20, 1.56it/s]
80%|ββββββββ | 1233/1545 [10:52<03:16, 1.59it/s]
80%|ββββββββ | 1234/1545 [10:53<02:59, 1.73it/s]
80%|ββββββββ | 1235/1545 [10:53<03:00, 1.72it/s]
80%|ββββββββ | 1236/1545 [10:54<02:57, 1.74it/s]
80%|ββββββββ | 1237/1545 [10:54<02:53, 1.77it/s]
80%|ββββββββ | 1238/1545 [10:55<02:52, 1.78it/s]
80%|ββββββββ | 1239/1545 [10:55<02:55, 1.74it/s]
80%|ββββββββ | 1240/1545 [10:56<02:54, 1.75it/s]
{'loss': 1.4954, 'grad_norm': 1.7394086171407253e-11, 'learning_rate': 1.9741100323624596e-06, 'rewards/chosen': -20.375286102294922, 'rewards/rejected': -44.56410217285156, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 24.18881607055664, 'logps/chosen': -332.5040283203125, 'logps/rejected': -539.218505859375, 'logits/chosen': -2.468999147415161, 'logits/rejected': -3.7816715240478516, 'epoch': 0.8} |
|
80%|ββββββββ | 1240/1545 [10:56<02:54, 1.75it/s]
80%|ββββββββ | 1241/1545 [10:56<02:43, 1.86it/s]
80%|ββββββββ | 1242/1545 [10:57<02:28, 2.04it/s]
80%|ββββββββ | 1243/1545 [10:57<02:37, 1.91it/s]
81%|ββββββββ | 1244/1545 [10:58<02:40, 1.88it/s]
81%|ββββββββ | 1245/1545 [10:58<02:31, 1.98it/s]
81%|ββββββββ | 1246/1545 [10:59<02:37, 1.90it/s]
81%|ββββββββ | 1247/1545 [11:00<02:42, 1.83it/s]
81%|ββββββββ | 1248/1545 [11:00<02:38, 1.88it/s]
81%|ββββββββ | 1249/1545 [11:01<02:42, 1.82it/s]
81%|ββββββββ | 1250/1545 [11:01<02:45, 1.79it/s]
{'loss': 0.0, 'grad_norm': 0.0177001953125, 'learning_rate': 1.9093851132686085e-06, 'rewards/chosen': -10.88851261138916, 'rewards/rejected': -38.96172332763672, 'rewards/accuracies': 1.0, 'rewards/margins': 28.07320785522461, 'logps/chosen': -251.8507537841797, 'logps/rejected': -504.0205993652344, 'logits/chosen': -1.7540420293807983, 'logits/rejected': -3.0294809341430664, 'epoch': 0.81} |
|
81%|ββββββββ | 1250/1545 [11:01<02:45, 1.79it/s]
81%|ββββββββ | 1251/1545 [11:02<02:46, 1.76it/s]
81%|ββββββββ | 1252/1545 [11:02<02:42, 1.80it/s]
81%|ββββββββ | 1253/1545 [11:03<02:45, 1.76it/s]
81%|ββββββββ | 1254/1545 [11:04<02:46, 1.75it/s]
81%|ββββββββ | 1255/1545 [11:04<02:33, 1.89it/s]
81%|βββββββββ | 1256/1545 [11:05<02:39, 1.82it/s]
81%|βββββββββ | 1257/1545 [11:05<02:39, 1.81it/s]
81%|βββββββββ | 1258/1545 [11:06<02:38, 1.81it/s]
81%|βββββββββ | 1259/1545 [11:06<02:35, 1.84it/s]
82%|βββββββββ | 1260/1545 [11:07<02:38, 1.80it/s]
{'loss': 0.0, 'grad_norm': 8.96453857421875e-05, 'learning_rate': 1.8446601941747574e-06, 'rewards/chosen': -16.425739288330078, 'rewards/rejected': -49.47471237182617, 'rewards/accuracies': 1.0, 'rewards/margins': 33.04896926879883, 'logps/chosen': -337.70025634765625, 'logps/rejected': -628.4820556640625, 'logits/chosen': -1.5739221572875977, 'logits/rejected': -3.0157792568206787, 'epoch': 0.82} |
|
82%|βββββββββ | 1260/1545 [11:07<02:38, 1.80it/s]
82%|βββββββββ | 1261/1545 [11:07<02:41, 1.76it/s]
82%|βββββββββ | 1262/1545 [11:08<02:20, 2.02it/s]
82%|βββββββββ | 1263/1545 [11:08<02:26, 1.93it/s]
82%|βββββββββ | 1264/1545 [11:09<02:30, 1.87it/s]
82%|βββββββββ | 1265/1545 [11:09<02:31, 1.84it/s]
82%|βββββββββ | 1266/1545 [11:10<02:22, 1.95it/s]
82%|βββββββββ | 1267/1545 [11:10<02:29, 1.86it/s]
82%|βββββββββ | 1268/1545 [11:11<02:30, 1.84it/s]
82%|βββββββββ | 1269/1545 [11:12<02:27, 1.87it/s]
82%|βββββββββ | 1270/1545 [11:12<02:29, 1.84it/s]
{'loss': 0.2319, 'grad_norm': 1.3096723705530167e-10, 'learning_rate': 1.7799352750809063e-06, 'rewards/chosen': -19.763113021850586, 'rewards/rejected': -38.63679504394531, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 18.873680114746094, 'logps/chosen': -342.18231201171875, 'logps/rejected': -496.32232666015625, 'logits/chosen': -2.013960361480713, 'logits/rejected': -2.9975972175598145, 'epoch': 0.82} |
|
82%|βββββββββ | 1270/1545 [11:12<02:29, 1.84it/s]
82%|βββββββββ | 1271/1545 [11:13<02:32, 1.80it/s]
82%|βββββββββ | 1272/1545 [11:13<02:33, 1.78it/s]
82%|βββββββββ | 1273/1545 [11:14<02:21, 1.92it/s]
82%|βββββββββ | 1274/1545 [11:14<02:25, 1.86it/s]
83%|βββββββββ | 1275/1545 [11:15<02:27, 1.84it/s]
83%|βββββββββ | 1276/1545 [11:15<02:24, 1.86it/s]
83%|βββββββββ | 1277/1545 [11:16<02:25, 1.84it/s]
83%|βββββββββ | 1278/1545 [11:17<02:29, 1.78it/s]
83%|βββββββββ | 1279/1545 [11:17<02:29, 1.77it/s]
83%|βββββββββ | 1280/1545 [11:17<02:17, 1.93it/s]
{'loss': 0.444, 'grad_norm': 7.771561172376096e-16, 'learning_rate': 1.715210355987055e-06, 'rewards/chosen': -16.24677085876465, 'rewards/rejected': -37.7825927734375, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 21.535823822021484, 'logps/chosen': -301.96905517578125, 'logps/rejected': -496.69610595703125, 'logits/chosen': -1.7921947240829468, 'logits/rejected': -2.7485709190368652, 'epoch': 0.83} |
|
83%|βββββββββ | 1280/1545 [11:18<02:17, 1.93it/s]
83%|βββββββββ | 1281/1545 [11:18<02:24, 1.83it/s]
83%|βββββββββ | 1282/1545 [11:19<02:23, 1.83it/s]
83%|βββββββββ | 1283/1545 [11:19<02:20, 1.87it/s]
83%|βββββββββ | 1284/1545 [11:20<02:17, 1.89it/s]
83%|βββββββββ | 1285/1545 [11:20<02:21, 1.84it/s]
83%|βββββββββ | 1286/1545 [11:21<02:21, 1.83it/s]
83%|βββββββββ | 1287/1545 [11:21<02:12, 1.94it/s]
83%|βββββββββ | 1288/1545 [11:22<02:01, 2.11it/s]
83%|βββββββββ | 1289/1545 [11:22<02:08, 2.00it/s]
83%|βββββββββ | 1290/1545 [11:23<02:11, 1.93it/s]
{'loss': 0.066, 'grad_norm': 6.668269634246826e-07, 'learning_rate': 1.650485436893204e-06, 'rewards/chosen': -12.632109642028809, 'rewards/rejected': -42.14348602294922, 'rewards/accuracies': 1.0, 'rewards/margins': 29.511377334594727, 'logps/chosen': -311.40838623046875, 'logps/rejected': -553.1618041992188, 'logits/chosen': -1.4785627126693726, 'logits/rejected': -2.7886316776275635, 'epoch': 0.83} |
|
83%|βββββββββ | 1290/1545 [11:23<02:11, 1.93it/s]
84%|βββββββββ | 1291/1545 [11:23<02:08, 1.97it/s]
84%|βββββββββ | 1292/1545 [11:24<02:13, 1.89it/s]
84%|βββββββββ | 1293/1545 [11:24<02:17, 1.83it/s]
84%|βββββββββ | 1294/1545 [11:25<02:16, 1.84it/s]
84%|βββββββββ | 1295/1545 [11:25<02:16, 1.83it/s]
84%|βββββββββ | 1296/1545 [11:26<02:18, 1.79it/s]
84%|βββββββββ | 1297/1545 [11:27<02:19, 1.78it/s]
84%|βββββββββ | 1298/1545 [11:27<02:07, 1.94it/s]
84%|βββββββββ | 1299/1545 [11:28<02:13, 1.84it/s]
84%|βββββββββ | 1300/1545 [11:28<02:13, 1.84it/s]
{'loss': 0.0, 'grad_norm': 0.01336669921875, 'learning_rate': 1.585760517799353e-06, 'rewards/chosen': -15.508363723754883, 'rewards/rejected': -44.09370803833008, 'rewards/accuracies': 1.0, 'rewards/margins': 28.585346221923828, 'logps/chosen': -305.378662109375, 'logps/rejected': -551.26708984375, 'logits/chosen': -1.9857141971588135, 'logits/rejected': -2.942575454711914, 'epoch': 0.84} |
|
84%|βββββββββ | 1300/1545 [11:28<02:13, 1.84it/s]
84%|βββββββββ | 1301/1545 [11:29<02:12, 1.84it/s]
84%|βββββββββ | 1302/1545 [11:29<02:10, 1.86it/s]
84%|βββββββββ | 1303/1545 [11:30<02:12, 1.82it/s]
84%|βββββββββ | 1304/1545 [11:30<02:14, 1.79it/s]
84%|βββββββββ | 1305/1545 [11:31<02:06, 1.90it/s]
85%|βββββββββ | 1306/1545 [11:31<02:11, 1.82it/s]
85%|βββββββββ | 1307/1545 [11:32<02:10, 1.82it/s]
85%|βββββββββ | 1308/1545 [11:33<02:09, 1.82it/s]
85%|βββββββββ | 1309/1545 [11:33<02:07, 1.86it/s]
85%|βββββββββ | 1310/1545 [11:34<02:09, 1.82it/s]
{'loss': 0.2831, 'grad_norm': 5.438923835754395e-07, 'learning_rate': 1.5210355987055017e-06, 'rewards/chosen': -18.139490127563477, 'rewards/rejected': -49.077415466308594, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 30.93792724609375, 'logps/chosen': -349.8685607910156, 'logps/rejected': -612.5897216796875, 'logits/chosen': -1.8611408472061157, 'logits/rejected': -3.1098215579986572, 'epoch': 0.85} |
|
85%|βββββββββ | 1310/1545 [11:34<02:09, 1.82it/s]
85%|βββββββββ | 1311/1545 [11:34<02:10, 1.80it/s]
85%|βββββββββ | 1312/1545 [11:35<02:01, 1.92it/s]
85%|βββββββββ | 1313/1545 [11:35<02:04, 1.86it/s]
85%|βββββββββ | 1314/1545 [11:36<01:52, 2.05it/s]
85%|βββββββββ | 1315/1545 [11:36<01:58, 1.95it/s]
85%|βββββββββ | 1316/1545 [11:37<01:51, 2.05it/s]
85%|βββββββββ | 1317/1545 [11:37<01:56, 1.95it/s]
85%|βββββββββ | 1318/1545 [11:38<01:46, 2.12it/s]
85%|βββββββββ | 1319/1545 [11:38<01:54, 1.98it/s]
85%|βββββββββ | 1320/1545 [11:39<01:49, 2.05it/s]
{'loss': 0.0, 'grad_norm': 9.75781955236954e-17, 'learning_rate': 1.4563106796116506e-06, 'rewards/chosen': -10.908350944519043, 'rewards/rejected': -47.042503356933594, 'rewards/accuracies': 1.0, 'rewards/margins': 36.134151458740234, 'logps/chosen': -251.7775421142578, 'logps/rejected': -576.4340209960938, 'logits/chosen': -1.4616104364395142, 'logits/rejected': -3.171027898788452, 'epoch': 0.85} |
|
85%|βββββββββ | 1320/1545 [11:39<01:49, 2.05it/s]
86%|βββββββββ | 1321/1545 [11:39<01:57, 1.91it/s]
86%|βββββββββ | 1322/1545 [11:40<01:58, 1.87it/s]
86%|βββββββββ | 1323/1545 [11:40<01:59, 1.86it/s]
86%|βββββββββ | 1324/1545 [11:41<01:57, 1.88it/s]
86%|βββββββββ | 1325/1545 [11:41<01:59, 1.84it/s]
86%|βββββββββ | 1326/1545 [11:42<02:00, 1.81it/s]
86%|βββββββββ | 1327/1545 [11:42<01:44, 2.08it/s]
86%|βββββββββ | 1328/1545 [11:43<01:49, 1.97it/s]
86%|βββββββββ | 1329/1545 [11:43<01:53, 1.90it/s]
86%|βββββββββ | 1330/1545 [11:44<01:54, 1.88it/s]
{'loss': 0.0001, 'grad_norm': 2.34375, 'learning_rate': 1.3915857605177997e-06, 'rewards/chosen': -12.354753494262695, 'rewards/rejected': -41.89970016479492, 'rewards/accuracies': 1.0, 'rewards/margins': 29.54494857788086, 'logps/chosen': -297.7950134277344, 'logps/rejected': -539.6881103515625, 'logits/chosen': -1.1807644367218018, 'logits/rejected': -2.6145424842834473, 'epoch': 0.86} |
|
86%|βββββββββ | 1330/1545 [11:44<01:54, 1.88it/s]
86%|βββββββββ | 1331/1545 [11:44<01:47, 2.00it/s]
86%|βββββββββ | 1332/1545 [11:45<01:38, 2.15it/s]
86%|βββββββββ | 1333/1545 [11:45<01:45, 2.02it/s]
86%|βββββββββ | 1334/1545 [11:46<01:49, 1.93it/s]
86%|βββββββββ | 1335/1545 [11:46<01:42, 2.05it/s]
86%|βββββββββ | 1336/1545 [11:47<01:48, 1.93it/s]
87%|βββββββββ | 1337/1545 [11:47<01:50, 1.88it/s]
87%|βββββββββ | 1338/1545 [11:48<01:49, 1.89it/s]
87%|βββββββββ | 1339/1545 [11:49<01:47, 1.91it/s]
87%|βββββββββ | 1340/1545 [11:49<01:50, 1.85it/s]
{'loss': 0.0001, 'grad_norm': 1.2747477740049362e-08, 'learning_rate': 1.3268608414239483e-06, 'rewards/chosen': -7.87436580657959, 'rewards/rejected': -39.97648239135742, 'rewards/accuracies': 1.0, 'rewards/margins': 32.10211944580078, 'logps/chosen': -211.8810577392578, 'logps/rejected': -505.80975341796875, 'logits/chosen': -1.697493314743042, 'logits/rejected': -3.0915396213531494, 'epoch': 0.87} |
|
87%|βββββββββ | 1340/1545 [11:49<01:50, 1.85it/s]
87%|βββββββββ | 1341/1545 [11:50<01:52, 1.82it/s]
87%|βββββββββ | 1342/1545 [11:50<01:46, 1.90it/s]
87%|βββββββββ | 1343/1545 [11:52<02:47, 1.20it/s]
87%|βββββββββ | 1344/1545 [11:52<02:30, 1.33it/s]
87%|βββββββββ | 1345/1545 [11:53<02:16, 1.47it/s]
87%|βββββββββ | 1346/1545 [11:53<02:09, 1.53it/s]
87%|βββββββββ | 1347/1545 [11:54<02:05, 1.57it/s]
87%|βββββββββ | 1348/1545 [11:54<02:00, 1.64it/s]
87%|βββββββββ | 1349/1545 [11:55<01:48, 1.80it/s]
87%|βββββββββ | 1350/1545 [11:56<01:49, 1.78it/s]
{'loss': 0.0, 'grad_norm': 5.327165126800537e-07, 'learning_rate': 1.2621359223300972e-06, 'rewards/chosen': -13.375717163085938, 'rewards/rejected': -49.55775833129883, 'rewards/accuracies': 1.0, 'rewards/margins': 36.18204116821289, 'logps/chosen': -304.5226135253906, 'logps/rejected': -614.50341796875, 'logits/chosen': -1.5832102298736572, 'logits/rejected': -3.0017504692077637, 'epoch': 0.87} |
|
87%|βββββββββ | 1350/1545 [11:56<01:49, 1.78it/s]
87%|βββββββββ | 1351/1545 [11:56<01:48, 1.78it/s]
88%|βββββββββ | 1352/1545 [11:57<01:44, 1.85it/s]
88%|βββββββββ | 1353/1545 [11:57<01:43, 1.86it/s]
88%|βββββββββ | 1354/1545 [11:58<01:45, 1.80it/s]
88%|βββββββββ | 1355/1545 [11:58<01:46, 1.78it/s]
88%|βββββββββ | 1356/1545 [11:59<01:39, 1.90it/s]
88%|βββββββββ | 1357/1545 [11:59<01:30, 2.07it/s]
88%|βββββββββ | 1358/1545 [12:00<01:36, 1.94it/s]
88%|βββββββββ | 1359/1545 [12:00<01:38, 1.88it/s]
88%|βββββββββ | 1360/1545 [12:01<01:30, 2.04it/s]
{'loss': 0.0, 'grad_norm': 0.0023956298828125, 'learning_rate': 1.197411003236246e-06, 'rewards/chosen': -8.20715045928955, 'rewards/rejected': -40.68087387084961, 'rewards/accuracies': 1.0, 'rewards/margins': 32.47372817993164, 'logps/chosen': -225.0033416748047, 'logps/rejected': -529.2271118164062, 'logits/chosen': -1.879974603652954, 'logits/rejected': -2.8071045875549316, 'epoch': 0.88} |
|
88%|βββββββββ | 1360/1545 [12:01<01:30, 2.04it/s]
88%|βββββββββ | 1361/1545 [12:01<01:35, 1.93it/s]
88%|βββββββββ | 1362/1545 [12:02<01:26, 2.12it/s]
88%|βββββββββ | 1363/1545 [12:02<01:31, 1.99it/s]
88%|βββββββββ | 1364/1545 [12:03<01:26, 2.10it/s]
88%|βββββββββ | 1365/1545 [12:03<01:32, 1.94it/s]
88%|βββββββββ | 1366/1545 [12:04<01:34, 1.90it/s]
88%|βββββββββ | 1367/1545 [12:04<01:33, 1.91it/s]
89%|βββββββββ | 1368/1545 [12:05<01:33, 1.89it/s]
89%|βββββββββ | 1369/1545 [12:05<01:35, 1.85it/s]
89%|βββββββββ | 1370/1545 [12:06<01:36, 1.81it/s]
{'loss': 0.0, 'grad_norm': 3.245077095925808e-09, 'learning_rate': 1.132686084142395e-06, 'rewards/chosen': -13.201225280761719, 'rewards/rejected': -47.207088470458984, 'rewards/accuracies': 1.0, 'rewards/margins': 34.005863189697266, 'logps/chosen': -289.7091369628906, 'logps/rejected': -593.4539794921875, 'logits/chosen': -1.5826839208602905, 'logits/rejected': -2.911431074142456, 'epoch': 0.89} |
|
89%|βββββββββ | 1370/1545 [12:06<01:36, 1.81it/s]
89%|βββββββββ | 1371/1545 [12:06<01:30, 1.92it/s]
89%|βββββββββ | 1372/1545 [12:07<01:33, 1.85it/s]
89%|βββββββββ | 1373/1545 [12:08<01:33, 1.84it/s]
89%|βββββββββ | 1374/1545 [12:08<01:32, 1.85it/s]
89%|βββββββββ | 1375/1545 [12:09<01:30, 1.88it/s]
89%|βββββββββ | 1376/1545 [12:09<01:32, 1.83it/s]
89%|βββββββββ | 1377/1545 [12:10<01:31, 1.83it/s]
89%|βββββββββ | 1378/1545 [12:10<01:25, 1.96it/s]
89%|βββββββββ | 1379/1545 [12:11<01:28, 1.87it/s]
89%|βββββββββ | 1380/1545 [12:11<01:29, 1.84it/s]
{'loss': 0.0046, 'grad_norm': 2.625, 'learning_rate': 1.0679611650485437e-06, 'rewards/chosen': -19.794082641601562, 'rewards/rejected': -42.44209671020508, 'rewards/accuracies': 1.0, 'rewards/margins': 22.648014068603516, 'logps/chosen': -388.0662841796875, 'logps/rejected': -544.12451171875, 'logits/chosen': -1.7488387823104858, 'logits/rejected': -2.9612879753112793, 'epoch': 0.89} |
|
89%|βββββββββ | 1380/1545 [12:11<01:29, 1.84it/s]
89%|βββββββββ | 1381/1545 [12:12<01:28, 1.86it/s]
89%|βββββββββ | 1382/1545 [12:12<01:21, 2.00it/s]
90%|βββββββββ | 1383/1545 [12:13<01:25, 1.89it/s]
90%|βββββββββ | 1384/1545 [12:13<01:27, 1.84it/s]
90%|βββββββββ | 1385/1545 [12:14<01:24, 1.90it/s]
90%|βββββββββ | 1386/1545 [12:14<01:16, 2.09it/s]
90%|βββββββββ | 1387/1545 [12:15<01:20, 1.95it/s]
90%|βββββββββ | 1388/1545 [12:15<01:22, 1.90it/s]
90%|βββββββββ | 1389/1545 [12:16<01:19, 1.96it/s]
90%|βββββββββ | 1390/1545 [12:16<01:22, 1.88it/s]
{'loss': 0.0058, 'grad_norm': 75.5, 'learning_rate': 1.0032362459546926e-06, 'rewards/chosen': -14.03381633758545, 'rewards/rejected': -39.99800109863281, 'rewards/accuracies': 1.0, 'rewards/margins': 25.964187622070312, 'logps/chosen': -306.3893127441406, 'logps/rejected': -523.348388671875, 'logits/chosen': -1.803492784500122, 'logits/rejected': -2.9958126544952393, 'epoch': 0.9} |
|
90%|βββββββββ | 1390/1545 [12:16<01:22, 1.88it/s]
90%|βββββββββ | 1391/1545 [12:17<01:23, 1.84it/s]
90%|βββββββββ | 1392/1545 [12:18<01:22, 1.85it/s]
90%|βββββββββ | 1393/1545 [12:18<01:16, 1.98it/s]
90%|βββββββββ | 1394/1545 [12:19<01:21, 1.86it/s]
90%|βββββββββ | 1395/1545 [12:19<01:21, 1.84it/s]
90%|βββββββββ | 1396/1545 [12:20<01:17, 1.91it/s]
90%|βββββββββ | 1397/1545 [12:20<01:18, 1.89it/s]
90%|βββββββββ | 1398/1545 [12:21<01:19, 1.85it/s]
91%|βββββββββ | 1399/1545 [12:21<01:19, 1.85it/s]
91%|βββββββββ | 1400/1545 [12:22<01:13, 1.98it/s]
{'loss': 0.0, 'grad_norm': 4.298783551348606e-13, 'learning_rate': 9.385113268608415e-07, 'rewards/chosen': -17.3962345123291, 'rewards/rejected': -49.274452209472656, 'rewards/accuracies': 1.0, 'rewards/margins': 31.878215789794922, 'logps/chosen': -298.85546875, 'logps/rejected': -580.9243774414062, 'logits/chosen': -1.9781240224838257, 'logits/rejected': -3.3585548400878906, 'epoch': 0.91} |
|
91%|βββββββββ | 1400/1545 [12:22<01:13, 1.98it/s]
91%|βββββββββ | 1401/1545 [12:22<01:15, 1.90it/s]
91%|βββββββββ | 1402/1545 [12:23<01:16, 1.88it/s]
91%|βββββββββ | 1403/1545 [12:23<01:15, 1.88it/s]
91%|βββββββββ | 1404/1545 [12:24<01:12, 1.95it/s]
91%|βββββββββ | 1405/1545 [12:24<01:06, 2.11it/s]
91%|βββββββββ | 1406/1545 [12:25<01:09, 1.99it/s]
91%|βββββββββ | 1407/1545 [12:25<01:10, 1.95it/s]
91%|βββββββββ | 1408/1545 [12:26<01:10, 1.93it/s]
91%|βββββββββ | 1409/1545 [12:26<01:13, 1.86it/s]
91%|ββββββββββ| 1410/1545 [12:27<01:13, 1.83it/s]
{'loss': 0.0, 'grad_norm': 1.4137996329210978e-16, 'learning_rate': 8.737864077669904e-07, 'rewards/chosen': -10.19818115234375, 'rewards/rejected': -47.04527282714844, 'rewards/accuracies': 1.0, 'rewards/margins': 36.847084045410156, 'logps/chosen': -266.72540283203125, 'logps/rejected': -583.9825439453125, 'logits/chosen': -1.446902871131897, 'logits/rejected': -3.2011420726776123, 'epoch': 0.91} |
|
91%|ββββββββββ| 1410/1545 [12:27<01:13, 1.83it/s]
91%|ββββββββββ| 1411/1545 [12:27<01:10, 1.89it/s]
91%|ββββββββββ| 1412/1545 [12:28<01:12, 1.84it/s]
91%|ββββββββββ| 1413/1545 [12:29<01:12, 1.82it/s]
92%|ββββββββββ| 1414/1545 [12:29<01:11, 1.83it/s]
92%|ββββββββββ| 1415/1545 [12:30<01:09, 1.86it/s]
92%|ββββββββββ| 1416/1545 [12:30<01:12, 1.78it/s]
92%|ββββββββββ| 1417/1545 [12:31<01:11, 1.79it/s]
92%|ββββββββββ| 1418/1545 [12:31<01:06, 1.90it/s]
92%|ββββββββββ| 1419/1545 [12:32<01:08, 1.85it/s]
92%|ββββββββββ| 1420/1545 [12:32<01:08, 1.82it/s]
{'loss': 0.0013, 'grad_norm': 1.0244548320770264e-07, 'learning_rate': 8.090614886731392e-07, 'rewards/chosen': -20.795812606811523, 'rewards/rejected': -50.17866897583008, 'rewards/accuracies': 1.0, 'rewards/margins': 29.38285255432129, 'logps/chosen': -362.2292785644531, 'logps/rejected': -605.4534301757812, 'logits/chosen': -2.1136467456817627, 'logits/rejected': -3.451559543609619, 'epoch': 0.92} |
|
92%|ββββββββββ| 1420/1545 [12:32<01:08, 1.82it/s]
92%|ββββββββββ| 1421/1545 [12:33<01:08, 1.82it/s]
92%|ββββββββββ| 1422/1545 [12:34<01:06, 1.85it/s]
92%|ββββββββββ| 1423/1545 [12:34<01:07, 1.81it/s]
92%|ββββββββββ| 1424/1545 [12:35<01:07, 1.79it/s]
92%|ββββββββββ| 1425/1545 [12:35<01:03, 1.90it/s]
92%|ββββββββββ| 1426/1545 [12:36<01:05, 1.82it/s]
92%|ββββββββββ| 1427/1545 [12:36<01:04, 1.82it/s]
92%|ββββββββββ| 1428/1545 [12:37<01:04, 1.81it/s]
92%|ββββββββββ| 1429/1545 [12:37<01:02, 1.85it/s]
93%|ββββββββββ| 1430/1545 [12:38<01:04, 1.80it/s]
{'loss': 0.0001, 'grad_norm': 7.486343383789062e-05, 'learning_rate': 7.443365695792882e-07, 'rewards/chosen': -17.025291442871094, 'rewards/rejected': -57.489990234375, 'rewards/accuracies': 1.0, 'rewards/margins': 40.46469497680664, 'logps/chosen': -336.61431884765625, 'logps/rejected': -702.4564208984375, 'logits/chosen': -1.5203006267547607, 'logits/rejected': -3.2374939918518066, 'epoch': 0.93} |
|
93%|ββββββββββ| 1430/1545 [12:38<01:04, 1.80it/s]
93%|ββββββββββ| 1431/1545 [12:39<01:04, 1.78it/s]
93%|ββββββββββ| 1432/1545 [12:39<00:59, 1.89it/s]
93%|ββββββββββ| 1433/1545 [12:40<01:01, 1.82it/s]
93%|ββββββββββ| 1434/1545 [12:40<01:01, 1.80it/s]
93%|ββββββββββ| 1435/1545 [12:41<01:01, 1.79it/s]
93%|ββββββββββ| 1436/1545 [12:41<00:58, 1.86it/s]
93%|ββββββββββ| 1437/1545 [12:42<00:59, 1.82it/s]
93%|ββββββββββ| 1438/1545 [12:42<00:59, 1.80it/s]
93%|ββββββββββ| 1439/1545 [12:43<00:54, 1.93it/s]
93%|ββββββββββ| 1440/1545 [12:43<00:57, 1.84it/s]
{'loss': 0.0775, 'grad_norm': 3.4421682357788086e-06, 'learning_rate': 6.79611650485437e-07, 'rewards/chosen': -16.205768585205078, 'rewards/rejected': -38.98480224609375, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 22.779033660888672, 'logps/chosen': -307.8979797363281, 'logps/rejected': -502.49041748046875, 'logits/chosen': -2.063363552093506, 'logits/rejected': -3.2290992736816406, 'epoch': 0.93} |
|
93%|ββββββββββ| 1440/1545 [12:43<00:57, 1.84it/s]
93%|ββββββββββ| 1441/1545 [12:44<00:58, 1.79it/s]
93%|ββββββββββ| 1442/1545 [12:44<00:56, 1.81it/s]
93%|ββββββββββ| 1443/1545 [12:45<00:55, 1.85it/s]
93%|ββββββββββ| 1444/1545 [12:46<00:55, 1.83it/s]
94%|ββββββββββ| 1445/1545 [12:46<00:55, 1.81it/s]
94%|ββββββββββ| 1446/1545 [12:47<00:51, 1.93it/s]
94%|ββββββββββ| 1447/1545 [12:47<00:52, 1.87it/s]
94%|ββββββββββ| 1448/1545 [12:48<00:52, 1.83it/s]
94%|ββββββββββ| 1449/1545 [12:48<00:53, 1.78it/s]
94%|ββββββββββ| 1450/1545 [12:49<00:51, 1.84it/s]
{'loss': 0.0001, 'grad_norm': 1.2168044349891716e-13, 'learning_rate': 6.148867313915858e-07, 'rewards/chosen': -18.26565933227539, 'rewards/rejected': -48.65673828125, 'rewards/accuracies': 1.0, 'rewards/margins': 30.391077041625977, 'logps/chosen': -316.24761962890625, 'logps/rejected': -603.7527465820312, 'logits/chosen': -2.136662006378174, 'logits/rejected': -3.263352870941162, 'epoch': 0.94} |
|
94%|ββββββββββ| 1450/1545 [12:49<00:51, 1.84it/s]
94%|ββββββββββ| 1451/1545 [12:49<00:53, 1.75it/s]
94%|ββββββββββ| 1452/1545 [12:50<00:53, 1.74it/s]
94%|ββββββββββ| 1453/1545 [12:50<00:49, 1.85it/s]
94%|ββββββββββ| 1454/1545 [12:51<00:51, 1.77it/s]
94%|ββββββββββ| 1455/1545 [12:52<00:50, 1.78it/s]
94%|ββββββββββ| 1456/1545 [12:53<01:13, 1.21it/s]
94%|ββββββββββ| 1457/1545 [12:54<01:06, 1.32it/s]
94%|ββββββββββ| 1458/1545 [12:54<00:56, 1.55it/s]
94%|ββββββββββ| 1459/1545 [12:55<00:53, 1.60it/s]
94%|ββββββββββ| 1460/1545 [12:55<00:48, 1.75it/s]
{'loss': 0.0, 'grad_norm': 2.0469737016526324e-16, 'learning_rate': 5.501618122977346e-07, 'rewards/chosen': -17.234251022338867, 'rewards/rejected': -46.855690002441406, 'rewards/accuracies': 1.0, 'rewards/margins': 29.621444702148438, 'logps/chosen': -348.11370849609375, 'logps/rejected': -579.3179321289062, 'logits/chosen': -1.4622868299484253, 'logits/rejected': -3.198202133178711, 'epoch': 0.94} |
|
94%|ββββββββββ| 1460/1545 [12:55<00:48, 1.75it/s]
95%|ββββββββββ| 1461/1545 [12:56<00:49, 1.69it/s]
95%|ββββββββββ| 1462/1545 [12:56<00:49, 1.69it/s]
95%|ββββββββββ| 1463/1545 [12:57<00:47, 1.72it/s]
95%|ββββββββββ| 1464/1545 [12:57<00:45, 1.77it/s]
95%|ββββββββββ| 1465/1545 [12:58<00:45, 1.75it/s]
95%|ββββββββββ| 1466/1545 [12:59<00:44, 1.76it/s]
95%|ββββββββββ| 1467/1545 [12:59<00:41, 1.89it/s]
95%|ββββββββββ| 1468/1545 [13:00<00:42, 1.81it/s]
95%|ββββββββββ| 1469/1545 [13:00<00:41, 1.81it/s]
95%|ββββββββββ| 1470/1545 [13:01<00:40, 1.83it/s]
{'loss': 0.0, 'grad_norm': 1.4921397450962104e-13, 'learning_rate': 4.854368932038835e-07, 'rewards/chosen': -14.240861892700195, 'rewards/rejected': -46.95703887939453, 'rewards/accuracies': 1.0, 'rewards/margins': 32.7161750793457, 'logps/chosen': -306.51873779296875, 'logps/rejected': -581.2691040039062, 'logits/chosen': -1.777173638343811, 'logits/rejected': -3.262760877609253, 'epoch': 0.95} |
|
95%|ββββββββββ| 1470/1545 [13:01<00:40, 1.83it/s]
95%|ββββββββββ| 1471/1545 [13:01<00:35, 2.08it/s]
95%|ββββββββββ| 1472/1545 [13:02<00:37, 1.96it/s]
95%|ββββββββββ| 1473/1545 [13:02<00:37, 1.91it/s]
95%|ββββββββββ| 1474/1545 [13:03<00:36, 1.94it/s]
95%|ββββββββββ| 1475/1545 [13:03<00:36, 1.91it/s]
96%|ββββββββββ| 1476/1545 [13:04<00:37, 1.86it/s]
96%|ββββββββββ| 1477/1545 [13:04<00:37, 1.81it/s]
96%|ββββββββββ| 1478/1545 [13:05<00:34, 1.93it/s]
96%|ββββββββββ| 1479/1545 [13:05<00:35, 1.86it/s]
96%|ββββββββββ| 1480/1545 [13:06<00:35, 1.84it/s]
{'loss': 0.001, 'grad_norm': 5.186961971048731e-13, 'learning_rate': 4.207119741100324e-07, 'rewards/chosen': -15.928776741027832, 'rewards/rejected': -49.93461608886719, 'rewards/accuracies': 1.0, 'rewards/margins': 34.00584030151367, 'logps/chosen': -290.5811767578125, 'logps/rejected': -614.055419921875, 'logits/chosen': -2.113398551940918, 'logits/rejected': -3.1592159271240234, 'epoch': 0.96} |
|
96%|ββββββββββ| 1480/1545 [13:06<00:35, 1.84it/s]
96%|ββββββββββ| 1481/1545 [13:06<00:34, 1.83it/s]
96%|ββββββββββ| 1482/1545 [13:07<00:34, 1.85it/s]
96%|ββββββββββ| 1483/1545 [13:08<00:34, 1.82it/s]
96%|ββββββββββ| 1484/1545 [13:08<00:33, 1.83it/s]
96%|ββββββββββ| 1485/1545 [13:09<00:31, 1.93it/s]
96%|ββββββββββ| 1486/1545 [13:09<00:31, 1.88it/s]
96%|ββββββββββ| 1487/1545 [13:10<00:31, 1.84it/s]
96%|ββββββββββ| 1488/1545 [13:10<00:31, 1.83it/s]
96%|ββββββββββ| 1489/1545 [13:11<00:29, 1.87it/s]
96%|ββββββββββ| 1490/1545 [13:11<00:30, 1.80it/s]
{'loss': 0.3793, 'grad_norm': 8.836247705756861e-18, 'learning_rate': 3.5598705501618125e-07, 'rewards/chosen': -12.482555389404297, 'rewards/rejected': -39.87023162841797, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 27.387676239013672, 'logps/chosen': -264.73980712890625, 'logps/rejected': -496.36474609375, 'logits/chosen': -1.77366042137146, 'logits/rejected': -3.205986499786377, 'epoch': 0.96} |
|
96%|ββββββββββ| 1490/1545 [13:11<00:30, 1.80it/s]
97%|ββββββββββ| 1491/1545 [13:12<00:30, 1.77it/s]
97%|ββββββββββ| 1492/1545 [13:12<00:28, 1.89it/s]
97%|ββββββββββ| 1493/1545 [13:13<00:28, 1.83it/s]
97%|ββββββββββ| 1494/1545 [13:14<00:28, 1.82it/s]
97%|ββββββββββ| 1495/1545 [13:14<00:27, 1.82it/s]
97%|ββββββββββ| 1496/1545 [13:15<00:26, 1.84it/s]
97%|ββββββββββ| 1497/1545 [13:15<00:26, 1.81it/s]
97%|ββββββββββ| 1498/1545 [13:16<00:25, 1.81it/s]
97%|ββββββββββ| 1499/1545 [13:16<00:23, 1.95it/s]
97%|ββββββββββ| 1500/1545 [13:17<00:23, 1.89it/s]
{'loss': 0.0, 'grad_norm': 4.729550084903167e-14, 'learning_rate': 2.9126213592233014e-07, 'rewards/chosen': -16.625436782836914, 'rewards/rejected': -45.96239471435547, 'rewards/accuracies': 1.0, 'rewards/margins': 29.336956024169922, 'logps/chosen': -287.1039733886719, 'logps/rejected': -579.9110107421875, 'logits/chosen': -2.09769868850708, 'logits/rejected': -3.411703109741211, 'epoch': 0.97} |
|
97%|ββββββββββ| 1500/1545 [13:17<00:23, 1.89it/s]
97%|ββββββββββ| 1501/1545 [13:17<00:23, 1.85it/s]
97%|ββββββββββ| 1502/1545 [13:18<00:21, 2.04it/s]
97%|ββββββββββ| 1503/1545 [13:18<00:19, 2.12it/s]
97%|ββββββββββ| 1504/1545 [13:19<00:20, 2.00it/s]
97%|ββββββββββ| 1505/1545 [13:19<00:20, 1.94it/s]
97%|ββββββββββ| 1506/1545 [13:20<00:20, 1.91it/s]
98%|ββββββββββ| 1507/1545 [13:20<00:18, 2.04it/s]
98%|ββββββββββ| 1508/1545 [13:21<00:19, 1.93it/s]
98%|ββββββββββ| 1509/1545 [13:21<00:17, 2.11it/s]
98%|ββββββββββ| 1510/1545 [13:22<00:17, 2.02it/s]
{'loss': 0.0693, 'grad_norm': 4.041939973831177e-07, 'learning_rate': 2.26537216828479e-07, 'rewards/chosen': -13.022150039672852, 'rewards/rejected': -43.39824676513672, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 30.3760929107666, 'logps/chosen': -266.9024963378906, 'logps/rejected': -551.7422485351562, 'logits/chosen': -1.764814019203186, 'logits/rejected': -3.2696170806884766, 'epoch': 0.98} |
|
98%|ββββββββββ| 1510/1545 [13:22<00:17, 2.02it/s]
98%|ββββββββββ| 1511/1545 [13:22<00:16, 2.11it/s]
98%|ββββββββββ| 1512/1545 [13:23<00:16, 1.94it/s]
98%|ββββββββββ| 1513/1545 [13:23<00:16, 1.91it/s]
98%|ββββββββββ| 1514/1545 [13:24<00:16, 1.92it/s]
98%|ββββββββββ| 1515/1545 [13:24<00:15, 1.90it/s]
98%|ββββββββββ| 1516/1545 [13:25<00:15, 1.86it/s]
98%|ββββββββββ| 1517/1545 [13:25<00:15, 1.84it/s]
98%|ββββββββββ| 1518/1545 [13:26<00:13, 2.02it/s]
98%|ββββββββββ| 1519/1545 [13:26<00:11, 2.20it/s]
98%|ββββββββββ| 1520/1545 [13:27<00:12, 2.02it/s]
{'loss': 0.0007, 'grad_norm': 6.28125, 'learning_rate': 1.6181229773462782e-07, 'rewards/chosen': -14.050992965698242, 'rewards/rejected': -44.175750732421875, 'rewards/accuracies': 1.0, 'rewards/margins': 30.124755859375, 'logps/chosen': -316.6773376464844, 'logps/rejected': -552.1802978515625, 'logits/chosen': -1.1631816625595093, 'logits/rejected': -3.36537504196167, 'epoch': 0.98} |
|
98%|ββββββββββ| 1520/1545 [13:27<00:12, 2.02it/s]
98%|ββββββββββ| 1521/1545 [13:27<00:12, 1.94it/s]
99%|ββββββββββ| 1522/1545 [13:28<00:11, 1.93it/s]
99%|ββββββββββ| 1523/1545 [13:28<00:11, 1.91it/s]
99%|ββββββββββ| 1524/1545 [13:29<00:11, 1.85it/s]
99%|ββββββββββ| 1525/1545 [13:29<00:09, 2.05it/s]
99%|ββββββββββ| 1526/1545 [13:30<00:09, 2.09it/s]
99%|ββββββββββ| 1527/1545 [13:30<00:09, 1.93it/s]
99%|ββββββββββ| 1528/1545 [13:31<00:09, 1.87it/s]
99%|ββββββββββ| 1529/1545 [13:32<00:08, 1.85it/s]
99%|ββββββββββ| 1530/1545 [13:32<00:07, 1.97it/s]
{'loss': 0.0, 'grad_norm': 6.714628852932947e-13, 'learning_rate': 9.70873786407767e-08, 'rewards/chosen': -18.606670379638672, 'rewards/rejected': -48.73549270629883, 'rewards/accuracies': 1.0, 'rewards/margins': 30.12882423400879, 'logps/chosen': -354.6220703125, 'logps/rejected': -609.2911376953125, 'logits/chosen': -1.5294870138168335, 'logits/rejected': -3.3688862323760986, 'epoch': 0.99} |
|
99%|ββββββββββ| 1530/1545 [13:32<00:07, 1.97it/s]
99%|ββββββββββ| 1531/1545 [13:33<00:07, 1.88it/s]
99%|ββββββββββ| 1532/1545 [13:33<00:06, 1.86it/s]
99%|ββββββββββ| 1533/1545 [13:33<00:05, 2.06it/s]
99%|ββββββββββ| 1534/1545 [13:34<00:05, 2.16it/s]
99%|ββββββββββ| 1535/1545 [13:34<00:05, 1.99it/s]
99%|ββββββββββ| 1536/1545 [13:35<00:04, 1.90it/s]
99%|ββββββββββ| 1537/1545 [13:36<00:04, 1.93it/s]
100%|ββββββββββ| 1538/1545 [13:36<00:03, 1.81it/s]
100%|ββββββββββ| 1539/1545 [13:37<00:03, 1.73it/s]
100%|ββββββββββ| 1540/1545 [13:37<00:02, 1.77it/s]
{'loss': 1.0373, 'grad_norm': 9.86623976961809e-18, 'learning_rate': 3.2362459546925574e-08, 'rewards/chosen': -16.71319007873535, 'rewards/rejected': -44.19614028930664, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 27.482952117919922, 'logps/chosen': -343.7029724121094, 'logps/rejected': -564.3067016601562, 'logits/chosen': -1.7413575649261475, 'logits/rejected': -2.7366671562194824, 'epoch': 1.0} |
|
100%|ββββββββββ| 1540/1545 [13:37<00:02, 1.77it/s]
100%|ββββββββββ| 1541/1545 [13:38<00:02, 1.77it/s]
100%|ββββββββββ| 1542/1545 [13:39<00:01, 1.73it/s]
100%|ββββββββββ| 1543/1545 [13:39<00:01, 1.72it/s]
100%|ββββββββββ| 1544/1545 [13:40<00:00, 1.82it/s]
100%|ββββββββββ| 1545/1545 [13:40<00:00, 1.79it/s]
{'train_runtime': 832.5695, 'train_samples_per_second': 1.856, 'train_steps_per_second': 1.856, 'train_loss': 0.6777341416610262, 'epoch': 1.0} |
|
100%|ββββββββββ| 1545/1545 [13:52<00:00, 1.79it/s]
100%|ββββββββββ| 1545/1545 [13:52<00:00, 1.86it/s] |
|
|