File size: 181,796 Bytes
36def42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
[2025-01-09 15:54:48,830] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
git root error: Cmd('git') failed due to: exit code(128)
  cmdline: git rev-parse --show-toplevel
  stderr: 'fatal: detected dubious ownership in repository at '/workspace'
To add an exception for this directory, call:

	git config --global --add safe.directory /workspace'
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: nguyenducphu201101. Use `wandb login --relogin` to force relogin
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/pydantic/main.py:314: UserWarning: Pydantic serializer warnings:
  Expected `list[str]` but got `tuple` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/pydantic/main.py:314: UserWarning: Pydantic serializer warnings:
  Expected `list[str]` but got `tuple` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
wandb: Tracking run with wandb version 0.19.1
wandb: Run data is saved locally in /workspace/wandb/run-20250109_155451-37f2bdb2-2552-4958-b0be-7186fa7cfbe6
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run test-dpo
wandb: ⭐️ View project at https://wandb.ai/nguyenducphu201101/llm-training-platform
wandb: πŸš€ View run at https://wandb.ai/nguyenducphu201101/llm-training-platform/runs/37f2bdb2-2552-4958-b0be-7186fa7cfbe6

Generating train split:   0%|          | 0/1545 [00:00<?, ? examples/s]
Generating train split: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1545/1545 [00:00<00:00, 119090.67 examples/s]

Generating test split:   0%|          | 0/89 [00:00<?, ? examples/s]
Generating test split: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 89/89 [00:00<00:00, 37679.73 examples/s]
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': model_init_kwargs, ref_model_init_kwargs. Will not be supported from version '0.13.0'.

Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:262: UserWarning: You passed `model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:287: UserWarning: You passed `ref_model_init_kwargs` to the DPOTrainer, the value you passed will override the one in the `DPOConfig`.
  warnings.warn(
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:312: UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
  warnings.warn(
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py:319: UserWarning: You passed a ref model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM`
  warnings.warn(

Extracting prompt from train dataset:   0%|          | 0/1545 [00:00<?, ? examples/s]
Extracting prompt from train dataset:  65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1000/1545 [00:00<00:00, 9015.34 examples/s]
Extracting prompt from train dataset: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1545/1545 [00:00<00:00, 9401.17 examples/s]

Applying chat template to train dataset:   0%|          | 0/1545 [00:00<?, ? examples/s]
Applying chat template to train dataset:  25%|β–ˆβ–ˆβ–       | 379/1545 [00:00<00:00, 3744.19 examples/s]
Applying chat template to train dataset:  53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 812/1545 [00:00<00:00, 4082.95 examples/s]
Applying chat template to train dataset:  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1427/1545 [00:00<00:00, 4085.34 examples/s]
Applying chat template to train dataset: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1545/1545 [00:00<00:00, 4014.06 examples/s]

Tokenizing train dataset:   0%|          | 0/1545 [00:00<?, ? examples/s]
Tokenizing train dataset:   7%|β–‹         | 110/1545 [00:00<00:01, 1087.62 examples/s]
Tokenizing train dataset:  15%|β–ˆβ–Œ        | 232/1545 [00:00<00:01, 1153.16 examples/s]
Tokenizing train dataset:  23%|β–ˆβ–ˆβ–Ž       | 353/1545 [00:00<00:01, 1170.38 examples/s]
Tokenizing train dataset:  31%|β–ˆβ–ˆβ–ˆ       | 472/1545 [00:00<00:00, 1172.74 examples/s]
Tokenizing train dataset:  41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 639/1545 [00:00<00:00, 1140.62 examples/s]
Tokenizing train dataset:  52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 810/1545 [00:00<00:00, 1133.80 examples/s]
Tokenizing train dataset:  60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 929/1545 [00:00<00:00, 1143.19 examples/s]
Tokenizing train dataset:  68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1051/1545 [00:00<00:00, 1161.64 examples/s]
Tokenizing train dataset:  76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1176/1545 [00:01<00:00, 1183.67 examples/s]
Tokenizing train dataset:  88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1354/1545 [00:01<00:00, 1182.12 examples/s]
Tokenizing train dataset:  99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1530/1545 [00:01<00:00, 1173.45 examples/s]
Tokenizing train dataset: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1545/1545 [00:01<00:00, 1158.17 examples/s]
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.

  0%|          | 0/1545 [00:00<?, ?it/s]Could not estimate the number of tokens of the input, floating-point operations will not be computed

  0%|          | 1/1545 [00:01<30:08,  1.17s/it]
  0%|          | 2/1545 [00:02<25:15,  1.02it/s]
  0%|          | 3/1545 [00:02<17:59,  1.43it/s]
  0%|          | 4/1545 [00:02<16:30,  1.56it/s]
  0%|          | 5/1545 [00:03<14:43,  1.74it/s]
  0%|          | 6/1545 [00:03<14:45,  1.74it/s]
  0%|          | 7/1545 [00:04<14:57,  1.71it/s]
  1%|          | 8/1545 [00:05<15:12,  1.68it/s]
  1%|          | 9/1545 [00:05<15:30,  1.65it/s]
  1%|          | 10/1545 [00:06<15:11,  1.68it/s]
                                                 
{'loss': 2.7692, 'grad_norm': 18.5, 'learning_rate': 9.935275080906149e-06, 'rewards/chosen': -13.267723083496094, 'rewards/rejected': -12.376993179321289, 'rewards/accuracies': 0.5, 'rewards/margins': -0.8907286524772644, 'logps/chosen': -272.36419677734375, 'logps/rejected': -224.4181365966797, 'logits/chosen': -0.7734757661819458, 'logits/rejected': -0.8499571084976196, 'epoch': 0.01}

  1%|          | 10/1545 [00:06<15:11,  1.68it/s]
  1%|          | 11/1545 [00:06<14:49,  1.72it/s]
  1%|          | 12/1545 [00:07<14:17,  1.79it/s]
  1%|          | 13/1545 [00:08<14:34,  1.75it/s]
  1%|          | 14/1545 [00:08<14:24,  1.77it/s]
  1%|          | 15/1545 [00:09<13:25,  1.90it/s]
  1%|          | 16/1545 [00:09<13:46,  1.85it/s]
  1%|          | 17/1545 [00:10<13:56,  1.83it/s]
  1%|          | 18/1545 [00:10<14:00,  1.82it/s]
  1%|          | 19/1545 [00:11<13:08,  1.94it/s]
  1%|▏         | 20/1545 [00:11<13:36,  1.87it/s]
                                                 
{'loss': 0.2762, 'grad_norm': 0.8828125, 'learning_rate': 9.870550161812299e-06, 'rewards/chosen': -16.860511779785156, 'rewards/rejected': -22.893171310424805, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 6.03265905380249, 'logps/chosen': -311.78521728515625, 'logps/rejected': -344.44610595703125, 'logits/chosen': -0.5504701137542725, 'logits/rejected': -0.5987938642501831, 'epoch': 0.01}

  1%|▏         | 20/1545 [00:11<13:36,  1.87it/s]
  1%|▏         | 21/1545 [00:12<14:05,  1.80it/s]
  1%|▏         | 22/1545 [00:12<13:28,  1.88it/s]
  1%|▏         | 23/1545 [00:13<13:57,  1.82it/s]
  2%|▏         | 24/1545 [00:14<14:16,  1.78it/s]
  2%|▏         | 25/1545 [00:14<14:12,  1.78it/s]
  2%|▏         | 26/1545 [00:15<13:29,  1.88it/s]
  2%|▏         | 27/1545 [00:15<13:49,  1.83it/s]
  2%|▏         | 28/1545 [00:16<13:54,  1.82it/s]
  2%|▏         | 29/1545 [00:16<13:13,  1.91it/s]
  2%|▏         | 30/1545 [00:17<13:19,  1.89it/s]
                                                 
{'loss': 4.4466, 'grad_norm': 7.82310962677002e-08, 'learning_rate': 9.805825242718447e-06, 'rewards/chosen': -35.80329132080078, 'rewards/rejected': -39.130760192871094, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 3.3274664878845215, 'logps/chosen': -513.8514404296875, 'logps/rejected': -501.32061767578125, 'logits/chosen': -2.7659506797790527, 'logits/rejected': -3.2069015502929688, 'epoch': 0.02}

  2%|▏         | 30/1545 [00:17<13:19,  1.89it/s]
  2%|▏         | 31/1545 [00:17<13:40,  1.84it/s]
  2%|▏         | 32/1545 [00:18<13:52,  1.82it/s]
  2%|▏         | 33/1545 [00:18<12:56,  1.95it/s]
  2%|▏         | 34/1545 [00:19<11:55,  2.11it/s]
  2%|▏         | 35/1545 [00:19<12:51,  1.96it/s]
  2%|▏         | 36/1545 [00:20<13:24,  1.88it/s]
  2%|▏         | 37/1545 [00:20<12:32,  2.00it/s]
  2%|▏         | 38/1545 [00:21<13:11,  1.90it/s]
  3%|β–Ž         | 39/1545 [00:21<13:24,  1.87it/s]
  3%|β–Ž         | 40/1545 [00:22<13:16,  1.89it/s]
                                                 
{'loss': 1.966, 'grad_norm': 1616.0, 'learning_rate': 9.741100323624596e-06, 'rewards/chosen': -40.694602966308594, 'rewards/rejected': -51.3997802734375, 'rewards/accuracies': 0.6000000238418579, 'rewards/margins': 10.705179214477539, 'logps/chosen': -561.7604370117188, 'logps/rejected': -636.379638671875, 'logits/chosen': -2.4068899154663086, 'logits/rejected': -2.463327407836914, 'epoch': 0.03}

  3%|β–Ž         | 40/1545 [00:22<13:16,  1.89it/s]
  3%|β–Ž         | 41/1545 [00:22<13:18,  1.88it/s]
  3%|β–Ž         | 42/1545 [00:23<13:33,  1.85it/s]
  3%|β–Ž         | 43/1545 [00:23<12:27,  2.01it/s]
  3%|β–Ž         | 44/1545 [00:24<12:09,  2.06it/s]
  3%|β–Ž         | 45/1545 [00:24<12:41,  1.97it/s]
  3%|β–Ž         | 46/1545 [00:25<13:09,  1.90it/s]
  3%|β–Ž         | 47/1545 [00:26<13:23,  1.86it/s]
  3%|β–Ž         | 48/1545 [00:26<12:13,  2.04it/s]
  3%|β–Ž         | 49/1545 [00:27<12:59,  1.92it/s]
  3%|β–Ž         | 50/1545 [00:27<13:16,  1.88it/s]
                                                 
{'loss': 2.376, 'grad_norm': 26.75, 'learning_rate': 9.676375404530746e-06, 'rewards/chosen': -23.04972267150879, 'rewards/rejected': -31.14634132385254, 'rewards/accuracies': 0.6000000238418579, 'rewards/margins': 8.09661865234375, 'logps/chosen': -366.2521057128906, 'logps/rejected': -414.253173828125, 'logits/chosen': -1.4296232461929321, 'logits/rejected': -1.88992440700531, 'epoch': 0.03}

  3%|β–Ž         | 50/1545 [00:27<13:16,  1.88it/s]
  3%|β–Ž         | 51/1545 [00:28<13:31,  1.84it/s]
  3%|β–Ž         | 52/1545 [00:28<13:15,  1.88it/s]
  3%|β–Ž         | 53/1545 [00:29<13:37,  1.82it/s]
  3%|β–Ž         | 54/1545 [00:29<13:40,  1.82it/s]
  4%|β–Ž         | 55/1545 [00:30<12:53,  1.93it/s]
  4%|β–Ž         | 56/1545 [00:30<13:25,  1.85it/s]
  4%|β–Ž         | 57/1545 [00:31<13:30,  1.84it/s]
  4%|▍         | 58/1545 [00:31<13:25,  1.85it/s]
  4%|▍         | 59/1545 [00:32<13:04,  1.89it/s]
  4%|▍         | 60/1545 [00:32<13:26,  1.84it/s]
                                                 
{'loss': 3.5255, 'grad_norm': 1.2620790914089075e-19, 'learning_rate': 9.611650485436894e-06, 'rewards/chosen': -41.635231018066406, 'rewards/rejected': -57.55849075317383, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 15.923251152038574, 'logps/chosen': -561.2437744140625, 'logps/rejected': -689.3187255859375, 'logits/chosen': -3.666018009185791, 'logits/rejected': -3.4739983081817627, 'epoch': 0.04}

  4%|▍         | 60/1545 [00:33<13:26,  1.84it/s]
  4%|▍         | 61/1545 [00:33<13:25,  1.84it/s]
  4%|▍         | 62/1545 [00:33<12:48,  1.93it/s]
  4%|▍         | 63/1545 [00:34<13:15,  1.86it/s]
  4%|▍         | 64/1545 [00:36<21:48,  1.13it/s]
  4%|▍         | 65/1545 [00:36<19:19,  1.28it/s]
  4%|▍         | 66/1545 [00:37<17:18,  1.42it/s]
  4%|▍         | 67/1545 [00:37<16:28,  1.50it/s]
  4%|▍         | 68/1545 [00:38<15:40,  1.57it/s]
  4%|▍         | 69/1545 [00:38<14:16,  1.72it/s]
  5%|▍         | 70/1545 [00:39<14:19,  1.72it/s]
                                                 
{'loss': 0.4333, 'grad_norm': 4000.0, 'learning_rate': 9.546925566343042e-06, 'rewards/chosen': -40.17411422729492, 'rewards/rejected': -62.64879608154297, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 22.47468376159668, 'logps/chosen': -535.5774536132812, 'logps/rejected': -740.8128662109375, 'logits/chosen': -3.367959976196289, 'logits/rejected': -3.1139683723449707, 'epoch': 0.05}

  5%|▍         | 70/1545 [00:39<14:19,  1.72it/s]
  5%|▍         | 71/1545 [00:40<14:10,  1.73it/s]
  5%|▍         | 72/1545 [00:40<13:52,  1.77it/s]
  5%|▍         | 73/1545 [00:40<12:05,  2.03it/s]
  5%|▍         | 74/1545 [00:41<12:40,  1.93it/s]
  5%|▍         | 75/1545 [00:42<13:09,  1.86it/s]
  5%|▍         | 76/1545 [00:42<13:07,  1.87it/s]
  5%|▍         | 77/1545 [00:43<13:09,  1.86it/s]
  5%|β–Œ         | 78/1545 [00:43<13:23,  1.83it/s]
  5%|β–Œ         | 79/1545 [00:44<13:44,  1.78it/s]
  5%|β–Œ         | 80/1545 [00:44<12:42,  1.92it/s]
                                                 
{'loss': 0.4326, 'grad_norm': 13824.0, 'learning_rate': 9.482200647249192e-06, 'rewards/chosen': -48.43840789794922, 'rewards/rejected': -76.73748779296875, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 28.299081802368164, 'logps/chosen': -626.9752807617188, 'logps/rejected': -873.7962646484375, 'logits/chosen': -2.483962297439575, 'logits/rejected': -2.444535493850708, 'epoch': 0.05}

  5%|β–Œ         | 80/1545 [00:44<12:42,  1.92it/s]
  5%|β–Œ         | 81/1545 [00:45<13:20,  1.83it/s]
  5%|β–Œ         | 82/1545 [00:45<13:23,  1.82it/s]
  5%|β–Œ         | 83/1545 [00:46<13:11,  1.85it/s]
  5%|β–Œ         | 84/1545 [00:46<12:54,  1.89it/s]
  6%|β–Œ         | 85/1545 [00:47<13:07,  1.85it/s]
  6%|β–Œ         | 86/1545 [00:48<13:24,  1.81it/s]
  6%|β–Œ         | 87/1545 [00:48<12:57,  1.87it/s]
  6%|β–Œ         | 88/1545 [00:49<13:13,  1.84it/s]
  6%|β–Œ         | 89/1545 [00:49<13:20,  1.82it/s]
  6%|β–Œ         | 90/1545 [00:50<13:17,  1.82it/s]
                                                 
{'loss': 2.8932, 'grad_norm': 0.001190185546875, 'learning_rate': 9.41747572815534e-06, 'rewards/chosen': -35.804256439208984, 'rewards/rejected': -56.7745361328125, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 20.97027015686035, 'logps/chosen': -504.11846923828125, 'logps/rejected': -688.4282836914062, 'logits/chosen': -1.8350107669830322, 'logits/rejected': -1.7690149545669556, 'epoch': 0.06}

  6%|β–Œ         | 90/1545 [00:50<13:17,  1.82it/s]
  6%|β–Œ         | 91/1545 [00:50<13:03,  1.86it/s]
  6%|β–Œ         | 92/1545 [00:51<13:32,  1.79it/s]
  6%|β–Œ         | 93/1545 [00:51<13:47,  1.76it/s]
  6%|β–Œ         | 94/1545 [00:52<12:27,  1.94it/s]
  6%|β–Œ         | 95/1545 [00:52<11:45,  2.06it/s]
  6%|β–Œ         | 96/1545 [00:53<12:24,  1.95it/s]
  6%|β–‹         | 97/1545 [00:53<12:46,  1.89it/s]
  6%|β–‹         | 98/1545 [00:54<12:21,  1.95it/s]
  6%|β–‹         | 99/1545 [00:54<12:47,  1.88it/s]
  6%|β–‹         | 100/1545 [00:55<13:04,  1.84it/s]
                                                  
{'loss': 1.493, 'grad_norm': 0.055908203125, 'learning_rate': 9.35275080906149e-06, 'rewards/chosen': -16.02591896057129, 'rewards/rejected': -22.183359146118164, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 6.157437801361084, 'logps/chosen': -340.04986572265625, 'logps/rejected': -332.9122314453125, 'logits/chosen': -1.376056432723999, 'logits/rejected': -1.7216821908950806, 'epoch': 0.06}

  6%|β–‹         | 100/1545 [00:55<13:04,  1.84it/s]
  7%|β–‹         | 101/1545 [00:56<13:28,  1.79it/s]
  7%|β–‹         | 102/1545 [00:56<12:21,  1.95it/s]
  7%|β–‹         | 103/1545 [00:57<12:58,  1.85it/s]
  7%|β–‹         | 104/1545 [00:57<13:06,  1.83it/s]
  7%|β–‹         | 105/1545 [00:58<11:49,  2.03it/s]
  7%|β–‹         | 106/1545 [00:58<11:20,  2.11it/s]
  7%|β–‹         | 107/1545 [00:59<12:10,  1.97it/s]
  7%|β–‹         | 108/1545 [00:59<11:12,  2.14it/s]
  7%|β–‹         | 109/1545 [01:00<11:50,  2.02it/s]
  7%|β–‹         | 110/1545 [01:00<11:33,  2.07it/s]
                                                  
{'loss': 1.7504, 'grad_norm': 68.5, 'learning_rate': 9.288025889967638e-06, 'rewards/chosen': -17.503616333007812, 'rewards/rejected': -20.149761199951172, 'rewards/accuracies': 0.5, 'rewards/margins': 2.6461453437805176, 'logps/chosen': -324.0485534667969, 'logps/rejected': -315.428955078125, 'logits/chosen': -1.0650968551635742, 'logits/rejected': -1.4754924774169922, 'epoch': 0.07}

  7%|β–‹         | 110/1545 [01:00<11:33,  2.07it/s]
  7%|β–‹         | 111/1545 [01:01<12:15,  1.95it/s]
  7%|β–‹         | 112/1545 [01:01<12:51,  1.86it/s]
  7%|β–‹         | 113/1545 [01:02<12:17,  1.94it/s]
  7%|β–‹         | 114/1545 [01:02<11:17,  2.11it/s]
  7%|β–‹         | 115/1545 [01:03<12:05,  1.97it/s]
  8%|β–Š         | 116/1545 [01:03<12:24,  1.92it/s]
  8%|β–Š         | 117/1545 [01:03<11:06,  2.14it/s]
  8%|β–Š         | 118/1545 [01:04<11:36,  2.05it/s]
  8%|β–Š         | 119/1545 [01:05<12:12,  1.95it/s]
  8%|β–Š         | 120/1545 [01:05<11:08,  2.13it/s]
                                                  
{'loss': 0.0812, 'grad_norm': 322.0, 'learning_rate': 9.223300970873788e-06, 'rewards/chosen': -22.525089263916016, 'rewards/rejected': -34.67645263671875, 'rewards/accuracies': 1.0, 'rewards/margins': 12.151366233825684, 'logps/chosen': -371.64007568359375, 'logps/rejected': -470.7439880371094, 'logits/chosen': -3.351106643676758, 'logits/rejected': -3.953688144683838, 'epoch': 0.08}

  8%|β–Š         | 120/1545 [01:05<11:08,  2.13it/s]
  8%|β–Š         | 121/1545 [01:05<10:27,  2.27it/s]
  8%|β–Š         | 122/1545 [01:06<10:18,  2.30it/s]
  8%|β–Š         | 123/1545 [01:06<11:14,  2.11it/s]
  8%|β–Š         | 124/1545 [01:07<11:40,  2.03it/s]
  8%|β–Š         | 125/1545 [01:07<12:05,  1.96it/s]
  8%|β–Š         | 126/1545 [01:08<11:53,  1.99it/s]
  8%|β–Š         | 127/1545 [01:08<12:14,  1.93it/s]
  8%|β–Š         | 128/1545 [01:09<11:14,  2.10it/s]
  8%|β–Š         | 129/1545 [01:09<11:30,  2.05it/s]
  8%|β–Š         | 130/1545 [01:10<11:49,  1.99it/s]
                                                  
{'loss': 8.072, 'grad_norm': 7.534027099609375e-05, 'learning_rate': 9.158576051779936e-06, 'rewards/chosen': -28.951553344726562, 'rewards/rejected': -31.89861488342285, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 2.9470643997192383, 'logps/chosen': -441.2911682128906, 'logps/rejected': -428.54656982421875, 'logits/chosen': -3.3049476146698, 'logits/rejected': -3.9077095985412598, 'epoch': 0.08}

  8%|β–Š         | 130/1545 [01:10<11:49,  1.99it/s]
  8%|β–Š         | 131/1545 [01:10<12:16,  1.92it/s]
  9%|β–Š         | 132/1545 [01:11<12:20,  1.91it/s]
  9%|β–Š         | 133/1545 [01:11<11:28,  2.05it/s]
  9%|β–Š         | 134/1545 [01:12<12:02,  1.95it/s]
  9%|β–Š         | 135/1545 [01:13<12:24,  1.89it/s]
  9%|β–‰         | 136/1545 [01:13<12:28,  1.88it/s]
  9%|β–‰         | 137/1545 [01:13<11:41,  2.01it/s]
  9%|β–‰         | 138/1545 [01:14<12:11,  1.92it/s]
  9%|β–‰         | 139/1545 [01:15<12:24,  1.89it/s]
  9%|β–‰         | 140/1545 [01:15<12:20,  1.90it/s]
                                                  
{'loss': 0.9608, 'grad_norm': 2.086162567138672e-07, 'learning_rate': 9.093851132686085e-06, 'rewards/chosen': -19.53597640991211, 'rewards/rejected': -32.082984924316406, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 12.547004699707031, 'logps/chosen': -317.4342956542969, 'logps/rejected': -419.8863220214844, 'logits/chosen': -2.860546112060547, 'logits/rejected': -3.551792621612549, 'epoch': 0.09}

  9%|β–‰         | 140/1545 [01:15<12:20,  1.90it/s]
  9%|β–‰         | 141/1545 [01:15<11:08,  2.10it/s]
  9%|β–‰         | 142/1545 [01:16<11:58,  1.95it/s]
  9%|β–‰         | 143/1545 [01:17<12:16,  1.90it/s]
  9%|β–‰         | 144/1545 [01:17<12:07,  1.93it/s]
  9%|β–‰         | 145/1545 [01:18<12:27,  1.87it/s]
  9%|β–‰         | 146/1545 [01:18<12:40,  1.84it/s]
 10%|β–‰         | 147/1545 [01:19<12:40,  1.84it/s]
 10%|β–‰         | 148/1545 [01:19<11:53,  1.96it/s]
 10%|β–‰         | 149/1545 [01:20<12:18,  1.89it/s]
 10%|β–‰         | 150/1545 [01:20<12:33,  1.85it/s]
                                                  
{'loss': 1.0182, 'grad_norm': 18.5, 'learning_rate': 9.029126213592233e-06, 'rewards/chosen': -17.807960510253906, 'rewards/rejected': -36.81556701660156, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 19.00760269165039, 'logps/chosen': -303.423095703125, 'logps/rejected': -467.78778076171875, 'logits/chosen': -3.5883376598358154, 'logits/rejected': -4.601079940795898, 'epoch': 0.1}

 10%|β–‰         | 150/1545 [01:20<12:33,  1.85it/s]
 10%|β–‰         | 151/1545 [01:21<12:15,  1.89it/s]
 10%|β–‰         | 152/1545 [01:21<12:18,  1.89it/s]
 10%|β–‰         | 153/1545 [01:22<12:30,  1.85it/s]
 10%|β–‰         | 154/1545 [01:23<12:41,  1.83it/s]
 10%|β–ˆ         | 155/1545 [01:23<12:00,  1.93it/s]
 10%|β–ˆ         | 156/1545 [01:24<12:31,  1.85it/s]
 10%|β–ˆ         | 157/1545 [01:24<11:22,  2.03it/s]
 10%|β–ˆ         | 158/1545 [01:25<11:47,  1.96it/s]
 10%|β–ˆ         | 159/1545 [01:25<11:12,  2.06it/s]
 10%|β–ˆ         | 160/1545 [01:26<11:57,  1.93it/s]
                                                  
{'loss': 0.0122, 'grad_norm': 6.184563972055912e-10, 'learning_rate': 8.964401294498383e-06, 'rewards/chosen': -19.044435501098633, 'rewards/rejected': -41.322547912597656, 'rewards/accuracies': 1.0, 'rewards/margins': 22.278114318847656, 'logps/chosen': -330.6968994140625, 'logps/rejected': -523.6119384765625, 'logits/chosen': -3.189589023590088, 'logits/rejected': -4.832894325256348, 'epoch': 0.1}

 10%|β–ˆ         | 160/1545 [01:26<11:57,  1.93it/s]
 10%|β–ˆ         | 161/1545 [01:26<12:13,  1.89it/s]
 10%|β–ˆ         | 162/1545 [01:27<12:12,  1.89it/s]
 11%|β–ˆ         | 163/1545 [01:27<12:04,  1.91it/s]
 11%|β–ˆ         | 164/1545 [01:28<12:25,  1.85it/s]
 11%|β–ˆ         | 165/1545 [01:28<12:29,  1.84it/s]
 11%|β–ˆ         | 166/1545 [01:29<11:56,  1.92it/s]
 11%|β–ˆ         | 167/1545 [01:29<12:14,  1.88it/s]
 11%|β–ˆ         | 168/1545 [01:30<12:20,  1.86it/s]
 11%|β–ˆ         | 169/1545 [01:30<12:19,  1.86it/s]
 11%|β–ˆ         | 170/1545 [01:31<12:00,  1.91it/s]
                                                  
{'loss': 1.4695, 'grad_norm': 8.216219588366713e-20, 'learning_rate': 8.899676375404531e-06, 'rewards/chosen': -46.257057189941406, 'rewards/rejected': -82.32180786132812, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 36.06475067138672, 'logps/chosen': -610.6324462890625, 'logps/rejected': -940.2429809570312, 'logits/chosen': -4.580934047698975, 'logits/rejected': -5.433409214019775, 'epoch': 0.11}

 11%|β–ˆ         | 170/1545 [01:31<12:00,  1.91it/s]
 11%|β–ˆ         | 171/1545 [01:31<12:26,  1.84it/s]
 11%|β–ˆ         | 172/1545 [01:32<12:13,  1.87it/s]
 11%|β–ˆ         | 173/1545 [01:32<11:43,  1.95it/s]
 11%|β–ˆβ–        | 174/1545 [01:33<12:17,  1.86it/s]
 11%|β–ˆβ–        | 175/1545 [01:34<12:30,  1.82it/s]
 11%|β–ˆβ–        | 176/1545 [01:34<12:26,  1.83it/s]
 11%|β–ˆβ–        | 177/1545 [01:34<10:43,  2.13it/s]
 12%|β–ˆβ–        | 178/1545 [01:35<11:25,  1.99it/s]
 12%|β–ˆβ–        | 179/1545 [01:36<11:43,  1.94it/s]
 12%|β–ˆβ–        | 180/1545 [01:37<16:58,  1.34it/s]
                                                  
{'loss': 1.1198, 'grad_norm': 0.03271484375, 'learning_rate': 8.834951456310681e-06, 'rewards/chosen': -24.422029495239258, 'rewards/rejected': -46.21000289916992, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 21.787975311279297, 'logps/chosen': -383.09124755859375, 'logps/rejected': -577.4222412109375, 'logits/chosen': -4.082339286804199, 'logits/rejected': -5.3761396408081055, 'epoch': 0.12}

 12%|β–ˆβ–        | 180/1545 [01:37<16:58,  1.34it/s]
 12%|β–ˆβ–        | 181/1545 [01:37<15:05,  1.51it/s]
 12%|β–ˆβ–        | 182/1545 [01:38<14:34,  1.56it/s]
 12%|β–ˆβ–        | 183/1545 [01:38<12:40,  1.79it/s]
 12%|β–ˆβ–        | 184/1545 [01:39<12:34,  1.80it/s]
 12%|β–ˆβ–        | 185/1545 [01:39<12:23,  1.83it/s]
 12%|β–ˆβ–        | 186/1545 [01:40<12:33,  1.80it/s]
 12%|β–ˆβ–        | 187/1545 [01:40<12:35,  1.80it/s]
 12%|β–ˆβ–        | 188/1545 [01:41<11:44,  1.93it/s]
 12%|β–ˆβ–        | 189/1545 [01:41<12:01,  1.88it/s]
 12%|β–ˆβ–        | 190/1545 [01:42<12:12,  1.85it/s]
                                                  
{'loss': 0.0704, 'grad_norm': 0.00084686279296875, 'learning_rate': 8.770226537216829e-06, 'rewards/chosen': -19.553829193115234, 'rewards/rejected': -37.149200439453125, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 17.59537124633789, 'logps/chosen': -342.45489501953125, 'logps/rejected': -478.92120361328125, 'logits/chosen': -3.0818886756896973, 'logits/rejected': -4.275086402893066, 'epoch': 0.12}

 12%|β–ˆβ–        | 190/1545 [01:42<12:12,  1.85it/s]
 12%|β–ˆβ–        | 191/1545 [01:43<12:16,  1.84it/s]
 12%|β–ˆβ–        | 192/1545 [01:43<11:25,  1.97it/s]
 12%|β–ˆβ–        | 193/1545 [01:44<11:49,  1.90it/s]
 13%|β–ˆβ–Ž        | 194/1545 [01:44<12:07,  1.86it/s]
 13%|β–ˆβ–Ž        | 195/1545 [01:44<10:58,  2.05it/s]
 13%|β–ˆβ–Ž        | 196/1545 [01:45<10:29,  2.14it/s]
 13%|β–ˆβ–Ž        | 197/1545 [01:45<11:07,  2.02it/s]
 13%|β–ˆβ–Ž        | 198/1545 [01:46<11:23,  1.97it/s]
 13%|β–ˆβ–Ž        | 199/1545 [01:47<11:34,  1.94it/s]
 13%|β–ˆβ–Ž        | 200/1545 [01:47<11:37,  1.93it/s]
                                                  
{'loss': 0.012, 'grad_norm': 6.0625, 'learning_rate': 8.705501618122979e-06, 'rewards/chosen': -33.0775146484375, 'rewards/rejected': -53.18238067626953, 'rewards/accuracies': 1.0, 'rewards/margins': 20.104867935180664, 'logps/chosen': -470.367919921875, 'logps/rejected': -644.5907592773438, 'logits/chosen': -3.6899101734161377, 'logits/rejected': -5.173361778259277, 'epoch': 0.13}

 13%|β–ˆβ–Ž        | 200/1545 [01:47<11:37,  1.93it/s]
 13%|β–ˆβ–Ž        | 201/1545 [01:48<12:10,  1.84it/s]
 13%|β–ˆβ–Ž        | 202/1545 [01:48<12:21,  1.81it/s]
 13%|β–ˆβ–Ž        | 203/1545 [01:49<11:27,  1.95it/s]
 13%|β–ˆβ–Ž        | 204/1545 [01:49<11:50,  1.89it/s]
 13%|β–ˆβ–Ž        | 205/1545 [01:50<12:17,  1.82it/s]
 13%|β–ˆβ–Ž        | 206/1545 [01:50<12:13,  1.83it/s]
 13%|β–ˆβ–Ž        | 207/1545 [01:51<12:05,  1.84it/s]
 13%|β–ˆβ–Ž        | 208/1545 [01:51<12:23,  1.80it/s]
 14%|β–ˆβ–Ž        | 209/1545 [01:52<12:19,  1.81it/s]
 14%|β–ˆβ–Ž        | 210/1545 [01:53<11:46,  1.89it/s]
                                                  
{'loss': 0.1499, 'grad_norm': 0.0, 'learning_rate': 8.640776699029127e-06, 'rewards/chosen': -35.89064025878906, 'rewards/rejected': -85.90962219238281, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 50.01898956298828, 'logps/chosen': -512.2362670898438, 'logps/rejected': -987.8312377929688, 'logits/chosen': -3.8464527130126953, 'logits/rejected': -5.528945446014404, 'epoch': 0.14}

 14%|β–ˆβ–Ž        | 210/1545 [01:53<11:46,  1.89it/s]
 14%|β–ˆβ–Ž        | 211/1545 [01:53<12:16,  1.81it/s]
 14%|β–ˆβ–Ž        | 212/1545 [01:54<12:23,  1.79it/s]
 14%|β–ˆβ–        | 213/1545 [01:54<12:12,  1.82it/s]
 14%|β–ˆβ–        | 214/1545 [01:55<11:54,  1.86it/s]
 14%|β–ˆβ–        | 215/1545 [01:55<12:10,  1.82it/s]
 14%|β–ˆβ–        | 216/1545 [01:56<12:09,  1.82it/s]
 14%|β–ˆβ–        | 217/1545 [01:56<11:13,  1.97it/s]
 14%|β–ˆβ–        | 218/1545 [01:57<11:42,  1.89it/s]
 14%|β–ˆβ–        | 219/1545 [01:57<11:54,  1.86it/s]
 14%|β–ˆβ–        | 220/1545 [01:58<11:52,  1.86it/s]
                                                  
{'loss': 0.6222, 'grad_norm': 720.0, 'learning_rate': 8.576051779935276e-06, 'rewards/chosen': -39.93156814575195, 'rewards/rejected': -58.58339309692383, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 18.65182876586914, 'logps/chosen': -554.5364990234375, 'logps/rejected': -703.7771606445312, 'logits/chosen': -3.3229453563690186, 'logits/rejected': -4.664990425109863, 'epoch': 0.14}

 14%|β–ˆβ–        | 220/1545 [01:58<11:52,  1.86it/s]
 14%|β–ˆβ–        | 221/1545 [01:58<11:11,  1.97it/s]
 14%|β–ˆβ–        | 222/1545 [01:59<11:35,  1.90it/s]
 14%|β–ˆβ–        | 223/1545 [02:00<11:46,  1.87it/s]
 14%|β–ˆβ–        | 224/1545 [02:00<10:40,  2.06it/s]
 15%|β–ˆβ–        | 225/1545 [02:00<10:18,  2.13it/s]
 15%|β–ˆβ–        | 226/1545 [02:01<11:03,  1.99it/s]
 15%|β–ˆβ–        | 227/1545 [02:01<10:09,  2.16it/s]
 15%|β–ˆβ–        | 228/1545 [02:02<10:43,  2.05it/s]
 15%|β–ˆβ–        | 229/1545 [02:02<09:23,  2.33it/s]
 15%|β–ˆβ–        | 230/1545 [02:03<10:25,  2.10it/s]
                                                  
{'loss': 0.0693, 'grad_norm': 0.0, 'learning_rate': 8.511326860841424e-06, 'rewards/chosen': -45.17763900756836, 'rewards/rejected': -84.54122924804688, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 39.363590240478516, 'logps/chosen': -602.5833740234375, 'logps/rejected': -968.5574951171875, 'logits/chosen': -3.7494091987609863, 'logits/rejected': -5.287755966186523, 'epoch': 0.15}

 15%|β–ˆβ–        | 230/1545 [02:03<10:25,  2.10it/s]
 15%|β–ˆβ–        | 231/1545 [02:03<11:00,  1.99it/s]
 15%|β–ˆβ–Œ        | 232/1545 [02:04<11:10,  1.96it/s]
 15%|β–ˆβ–Œ        | 233/1545 [02:04<11:03,  1.98it/s]
 15%|β–ˆβ–Œ        | 234/1545 [02:05<11:36,  1.88it/s]
 15%|β–ˆβ–Œ        | 235/1545 [02:05<11:29,  1.90it/s]
 15%|β–ˆβ–Œ        | 236/1545 [02:06<11:04,  1.97it/s]
 15%|β–ˆβ–Œ        | 237/1545 [02:06<11:44,  1.86it/s]
 15%|β–ˆβ–Œ        | 238/1545 [02:07<11:56,  1.82it/s]
 15%|β–ˆβ–Œ        | 239/1545 [02:08<11:52,  1.83it/s]
 16%|β–ˆβ–Œ        | 240/1545 [02:08<10:52,  2.00it/s]
                                                  
{'loss': 0.542, 'grad_norm': 334.0, 'learning_rate': 8.446601941747573e-06, 'rewards/chosen': -34.770347595214844, 'rewards/rejected': -57.98395919799805, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 23.213611602783203, 'logps/chosen': -471.40185546875, 'logps/rejected': -686.8424682617188, 'logits/chosen': -3.1056551933288574, 'logits/rejected': -4.154791355133057, 'epoch': 0.16}

 16%|β–ˆβ–Œ        | 240/1545 [02:08<10:52,  2.00it/s]
 16%|β–ˆβ–Œ        | 241/1545 [02:09<11:23,  1.91it/s]
 16%|β–ˆβ–Œ        | 242/1545 [02:09<11:28,  1.89it/s]
 16%|β–ˆβ–Œ        | 243/1545 [02:10<11:26,  1.90it/s]
 16%|β–ˆβ–Œ        | 244/1545 [02:10<11:16,  1.92it/s]
 16%|β–ˆβ–Œ        | 245/1545 [02:11<11:35,  1.87it/s]
 16%|β–ˆβ–Œ        | 246/1545 [02:11<11:44,  1.84it/s]
 16%|β–ˆβ–Œ        | 247/1545 [02:12<11:09,  1.94it/s]
 16%|β–ˆβ–Œ        | 248/1545 [02:12<11:29,  1.88it/s]
 16%|β–ˆβ–Œ        | 249/1545 [02:13<11:42,  1.85it/s]
 16%|β–ˆβ–Œ        | 250/1545 [02:13<11:35,  1.86it/s]
                                                  
{'loss': 0.0827, 'grad_norm': 0.00689697265625, 'learning_rate': 8.381877022653722e-06, 'rewards/chosen': -13.427328109741211, 'rewards/rejected': -41.39923095703125, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 27.97190284729004, 'logps/chosen': -279.3583068847656, 'logps/rejected': -531.146484375, 'logits/chosen': -1.5291231870651245, 'logits/rejected': -3.7414169311523438, 'epoch': 0.16}

 16%|β–ˆβ–Œ        | 250/1545 [02:13<11:35,  1.86it/s]
 16%|β–ˆβ–Œ        | 251/1545 [02:14<11:04,  1.95it/s]
 16%|β–ˆβ–‹        | 252/1545 [02:14<11:31,  1.87it/s]
 16%|β–ˆβ–‹        | 253/1545 [02:15<10:22,  2.08it/s]
 16%|β–ˆβ–‹        | 254/1545 [02:15<10:44,  2.00it/s]
 17%|β–ˆβ–‹        | 255/1545 [02:16<10:06,  2.13it/s]
 17%|β–ˆβ–‹        | 256/1545 [02:16<10:46,  2.00it/s]
 17%|β–ˆβ–‹        | 257/1545 [02:17<11:12,  1.92it/s]
 17%|β–ˆβ–‹        | 258/1545 [02:17<11:10,  1.92it/s]
 17%|β–ˆβ–‹        | 259/1545 [02:18<10:05,  2.12it/s]
 17%|β–ˆβ–‹        | 260/1545 [02:18<10:55,  1.96it/s]
                                                  
{'loss': 0.4419, 'grad_norm': 2.9558577807620168e-12, 'learning_rate': 8.317152103559872e-06, 'rewards/chosen': -17.75248146057129, 'rewards/rejected': -30.9813175201416, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 13.228837966918945, 'logps/chosen': -318.49920654296875, 'logps/rejected': -418.8551330566406, 'logits/chosen': -2.046048164367676, 'logits/rejected': -2.655272960662842, 'epoch': 0.17}

 17%|β–ˆβ–‹        | 260/1545 [02:18<10:55,  1.96it/s]
 17%|β–ˆβ–‹        | 261/1545 [02:19<11:08,  1.92it/s]
 17%|β–ˆβ–‹        | 262/1545 [02:19<10:45,  1.99it/s]
 17%|β–ˆβ–‹        | 263/1545 [02:20<11:14,  1.90it/s]
 17%|β–ˆβ–‹        | 264/1545 [02:20<11:31,  1.85it/s]
 17%|β–ˆβ–‹        | 265/1545 [02:21<11:34,  1.84it/s]
 17%|β–ˆβ–‹        | 266/1545 [02:21<10:52,  1.96it/s]
 17%|β–ˆβ–‹        | 267/1545 [02:22<11:28,  1.86it/s]
 17%|β–ˆβ–‹        | 268/1545 [02:23<11:40,  1.82it/s]
 17%|β–ˆβ–‹        | 269/1545 [02:23<11:10,  1.90it/s]
 17%|β–ˆβ–‹        | 270/1545 [02:24<11:41,  1.82it/s]
                                                  
{'loss': 0.0115, 'grad_norm': 4.363059997558594e-05, 'learning_rate': 8.25242718446602e-06, 'rewards/chosen': -12.999127388000488, 'rewards/rejected': -31.61448097229004, 'rewards/accuracies': 1.0, 'rewards/margins': 18.61534881591797, 'logps/chosen': -296.9534606933594, 'logps/rejected': -435.48883056640625, 'logits/chosen': -1.737764596939087, 'logits/rejected': -3.4163742065429688, 'epoch': 0.17}

 17%|β–ˆβ–‹        | 270/1545 [02:24<11:41,  1.82it/s]
 18%|β–ˆβ–Š        | 271/1545 [02:24<12:05,  1.76it/s]
 18%|β–ˆβ–Š        | 272/1545 [02:25<11:49,  1.79it/s]
 18%|β–ˆβ–Š        | 273/1545 [02:25<11:08,  1.90it/s]
 18%|β–ˆβ–Š        | 274/1545 [02:26<11:28,  1.85it/s]
 18%|β–ˆβ–Š        | 275/1545 [02:26<10:19,  2.05it/s]
 18%|β–ˆβ–Š        | 276/1545 [02:27<10:39,  1.99it/s]
 18%|β–ˆβ–Š        | 277/1545 [02:27<10:04,  2.10it/s]
 18%|β–ˆβ–Š        | 278/1545 [02:28<10:43,  1.97it/s]
 18%|β–ˆβ–Š        | 279/1545 [02:28<10:56,  1.93it/s]
 18%|β–ˆβ–Š        | 280/1545 [02:29<10:56,  1.93it/s]
                                                  
{'loss': 0.0759, 'grad_norm': 1.4028046280145645e-08, 'learning_rate': 8.18770226537217e-06, 'rewards/chosen': -28.01290512084961, 'rewards/rejected': -43.24602508544922, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 15.233118057250977, 'logps/chosen': -442.2579040527344, 'logps/rejected': -560.4046630859375, 'logits/chosen': -2.9247617721557617, 'logits/rejected': -3.853868007659912, 'epoch': 0.18}

 18%|β–ˆβ–Š        | 280/1545 [02:29<10:56,  1.93it/s]
 18%|β–ˆβ–Š        | 281/1545 [02:29<09:50,  2.14it/s]
 18%|β–ˆβ–Š        | 282/1545 [02:30<10:36,  1.98it/s]
 18%|β–ˆβ–Š        | 283/1545 [02:30<11:00,  1.91it/s]
 18%|β–ˆβ–Š        | 284/1545 [02:31<10:33,  1.99it/s]
 18%|β–ˆβ–Š        | 285/1545 [02:31<11:11,  1.88it/s]
 19%|β–ˆβ–Š        | 286/1545 [02:32<11:30,  1.82it/s]
 19%|β–ˆβ–Š        | 287/1545 [02:33<11:30,  1.82it/s]
 19%|β–ˆβ–Š        | 288/1545 [02:33<10:37,  1.97it/s]
 19%|β–ˆβ–Š        | 289/1545 [02:34<11:15,  1.86it/s]
 19%|β–ˆβ–‰        | 290/1545 [02:34<11:23,  1.84it/s]
                                                  
{'loss': 3.3436, 'grad_norm': 5.0090140368830305e-17, 'learning_rate': 8.122977346278318e-06, 'rewards/chosen': -25.31455421447754, 'rewards/rejected': -46.47494888305664, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 21.16039276123047, 'logps/chosen': -403.2884216308594, 'logps/rejected': -592.1914672851562, 'logits/chosen': -2.8424019813537598, 'logits/rejected': -3.816692352294922, 'epoch': 0.19}

 19%|β–ˆβ–‰        | 290/1545 [02:34<11:23,  1.84it/s]
 19%|β–ˆβ–‰        | 291/1545 [02:35<11:07,  1.88it/s]
 19%|β–ˆβ–‰        | 292/1545 [02:35<11:27,  1.82it/s]
 19%|β–ˆβ–‰        | 293/1545 [02:36<11:42,  1.78it/s]
 19%|β–ˆβ–‰        | 294/1545 [02:36<11:37,  1.79it/s]
 19%|β–ˆβ–‰        | 295/1545 [02:37<10:51,  1.92it/s]
 19%|β–ˆβ–‰        | 296/1545 [02:37<11:12,  1.86it/s]
 19%|β–ˆβ–‰        | 297/1545 [02:39<16:07,  1.29it/s]
 19%|β–ˆβ–‰        | 298/1545 [02:39<14:14,  1.46it/s]
 19%|β–ˆβ–‰        | 299/1545 [02:40<13:15,  1.57it/s]
 19%|β–ˆβ–‰        | 300/1545 [02:40<12:50,  1.62it/s]
                                                  
{'loss': 0.0003, 'grad_norm': 3.92901711165905e-09, 'learning_rate': 8.058252427184466e-06, 'rewards/chosen': -10.483701705932617, 'rewards/rejected': -42.29178237915039, 'rewards/accuracies': 1.0, 'rewards/margins': 31.808080673217773, 'logps/chosen': -243.2613983154297, 'logps/rejected': -532.6328735351562, 'logits/chosen': -1.451924443244934, 'logits/rejected': -4.144944190979004, 'epoch': 0.19}

 19%|β–ˆβ–‰        | 300/1545 [02:40<12:50,  1.62it/s]
 19%|β–ˆβ–‰        | 301/1545 [02:41<12:28,  1.66it/s]
 20%|β–ˆβ–‰        | 302/1545 [02:41<11:27,  1.81it/s]
 20%|β–ˆβ–‰        | 303/1545 [02:42<11:37,  1.78it/s]
 20%|β–ˆβ–‰        | 304/1545 [02:42<11:35,  1.78it/s]
 20%|β–ˆβ–‰        | 305/1545 [02:43<11:32,  1.79it/s]
 20%|β–ˆβ–‰        | 306/1545 [02:44<11:30,  1.80it/s]
 20%|β–ˆβ–‰        | 307/1545 [02:44<11:37,  1.78it/s]
 20%|β–ˆβ–‰        | 308/1545 [02:45<11:41,  1.76it/s]
 20%|β–ˆβ–ˆ        | 309/1545 [02:45<10:48,  1.91it/s]
 20%|β–ˆβ–ˆ        | 310/1545 [02:46<11:14,  1.83it/s]
                                                  
{'loss': 0.4085, 'grad_norm': 520.0, 'learning_rate': 7.993527508090616e-06, 'rewards/chosen': -10.010625839233398, 'rewards/rejected': -25.10512924194336, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 15.094502449035645, 'logps/chosen': -236.5189971923828, 'logps/rejected': -370.3948059082031, 'logits/chosen': -2.0673410892486572, 'logits/rejected': -3.2709903717041016, 'epoch': 0.2}

 20%|β–ˆβ–ˆ        | 310/1545 [02:46<11:14,  1.83it/s]
 20%|β–ˆβ–ˆ        | 311/1545 [02:46<11:20,  1.81it/s]
 20%|β–ˆβ–ˆ        | 312/1545 [02:47<11:16,  1.82it/s]
 20%|β–ˆβ–ˆ        | 313/1545 [02:47<10:59,  1.87it/s]
 20%|β–ˆβ–ˆ        | 314/1545 [02:48<11:09,  1.84it/s]
 20%|β–ˆβ–ˆ        | 315/1545 [02:48<11:15,  1.82it/s]
 20%|β–ˆβ–ˆ        | 316/1545 [02:49<10:30,  1.95it/s]
 21%|β–ˆβ–ˆ        | 317/1545 [02:49<11:07,  1.84it/s]
 21%|β–ˆβ–ˆ        | 318/1545 [02:50<11:17,  1.81it/s]
 21%|β–ˆβ–ˆ        | 319/1545 [02:51<11:14,  1.82it/s]
 21%|β–ˆβ–ˆ        | 320/1545 [02:51<11:03,  1.85it/s]
                                                  
{'loss': 0.0186, 'grad_norm': 348.0, 'learning_rate': 7.928802588996765e-06, 'rewards/chosen': -7.756754398345947, 'rewards/rejected': -28.2783260345459, 'rewards/accuracies': 1.0, 'rewards/margins': 20.52157211303711, 'logps/chosen': -252.49423217773438, 'logps/rejected': -395.2229919433594, 'logits/chosen': -1.2267696857452393, 'logits/rejected': -2.0097601413726807, 'epoch': 0.21}

 21%|β–ˆβ–ˆ        | 320/1545 [02:51<11:03,  1.85it/s]
 21%|β–ˆβ–ˆ        | 321/1545 [02:52<11:35,  1.76it/s]
 21%|β–ˆβ–ˆ        | 322/1545 [02:52<10:21,  1.97it/s]
 21%|β–ˆβ–ˆ        | 323/1545 [02:53<10:10,  2.00it/s]
 21%|β–ˆβ–ˆ        | 324/1545 [02:53<10:31,  1.93it/s]
 21%|β–ˆβ–ˆ        | 325/1545 [02:54<10:45,  1.89it/s]
 21%|β–ˆβ–ˆ        | 326/1545 [02:54<11:08,  1.82it/s]
 21%|β–ˆβ–ˆ        | 327/1545 [02:55<10:09,  2.00it/s]
 21%|β–ˆβ–ˆ        | 328/1545 [02:55<10:38,  1.91it/s]
 21%|β–ˆβ–ˆβ–       | 329/1545 [02:56<10:51,  1.87it/s]
 21%|β–ˆβ–ˆβ–       | 330/1545 [02:56<10:48,  1.87it/s]
                                                  
{'loss': 0.6079, 'grad_norm': 3.0184188481996443e-16, 'learning_rate': 7.864077669902913e-06, 'rewards/chosen': -10.482062339782715, 'rewards/rejected': -27.2606201171875, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 16.7785587310791, 'logps/chosen': -251.0252227783203, 'logps/rejected': -369.49072265625, 'logits/chosen': -1.2209112644195557, 'logits/rejected': -2.5535261631011963, 'epoch': 0.21}

 21%|β–ˆβ–ˆβ–       | 330/1545 [02:56<10:48,  1.87it/s]
 21%|β–ˆβ–ˆβ–       | 331/1545 [02:57<10:39,  1.90it/s]
 21%|β–ˆβ–ˆβ–       | 332/1545 [02:58<11:21,  1.78it/s]
 22%|β–ˆβ–ˆβ–       | 333/1545 [02:58<11:12,  1.80it/s]
 22%|β–ˆβ–ˆβ–       | 334/1545 [02:58<10:30,  1.92it/s]
 22%|β–ˆβ–ˆβ–       | 335/1545 [02:59<10:59,  1.83it/s]
 22%|β–ˆβ–ˆβ–       | 336/1545 [03:00<11:08,  1.81it/s]
 22%|β–ˆβ–ˆβ–       | 337/1545 [03:00<11:01,  1.82it/s]
 22%|β–ˆβ–ˆβ–       | 338/1545 [03:01<10:47,  1.86it/s]
 22%|β–ˆβ–ˆβ–       | 339/1545 [03:01<11:00,  1.83it/s]
 22%|β–ˆβ–ˆβ–       | 340/1545 [03:02<11:12,  1.79it/s]
                                                  
{'loss': 0.6174, 'grad_norm': 2.7755575615628914e-17, 'learning_rate': 7.799352750809061e-06, 'rewards/chosen': -15.824236869812012, 'rewards/rejected': -32.17787170410156, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 16.35363006591797, 'logps/chosen': -278.61468505859375, 'logps/rejected': -427.072021484375, 'logits/chosen': -2.014265537261963, 'logits/rejected': -2.9095723628997803, 'epoch': 0.22}

 22%|β–ˆβ–ˆβ–       | 340/1545 [03:02<11:12,  1.79it/s]
 22%|β–ˆβ–ˆβ–       | 341/1545 [03:02<10:38,  1.89it/s]
 22%|β–ˆβ–ˆβ–       | 342/1545 [03:03<11:00,  1.82it/s]
 22%|β–ˆβ–ˆβ–       | 343/1545 [03:03<09:57,  2.01it/s]
 22%|β–ˆβ–ˆβ–       | 344/1545 [03:04<10:23,  1.93it/s]
 22%|β–ˆβ–ˆβ–       | 345/1545 [03:04<09:48,  2.04it/s]
 22%|β–ˆβ–ˆβ–       | 346/1545 [03:05<10:28,  1.91it/s]
 22%|β–ˆβ–ˆβ–       | 347/1545 [03:05<10:40,  1.87it/s]
 23%|β–ˆβ–ˆβ–Ž       | 348/1545 [03:06<10:37,  1.88it/s]
 23%|β–ˆβ–ˆβ–Ž       | 349/1545 [03:07<10:33,  1.89it/s]
 23%|β–ˆβ–ˆβ–Ž       | 350/1545 [03:07<10:42,  1.86it/s]
                                                  
{'loss': 0.0, 'grad_norm': 4.607859233063394e-19, 'learning_rate': 7.734627831715211e-06, 'rewards/chosen': -10.83470344543457, 'rewards/rejected': -41.98002624511719, 'rewards/accuracies': 1.0, 'rewards/margins': 31.145328521728516, 'logps/chosen': -240.7436065673828, 'logps/rejected': -537.6058349609375, 'logits/chosen': -1.7035901546478271, 'logits/rejected': -3.3107447624206543, 'epoch': 0.23}

 23%|β–ˆβ–ˆβ–Ž       | 350/1545 [03:07<10:42,  1.86it/s]
 23%|β–ˆβ–ˆβ–Ž       | 351/1545 [03:08<10:46,  1.85it/s]
 23%|β–ˆβ–ˆβ–Ž       | 352/1545 [03:08<09:58,  1.99it/s]
 23%|β–ˆβ–ˆβ–Ž       | 353/1545 [03:09<10:20,  1.92it/s]
 23%|β–ˆβ–ˆβ–Ž       | 354/1545 [03:09<10:31,  1.89it/s]
 23%|β–ˆβ–ˆβ–Ž       | 355/1545 [03:10<10:33,  1.88it/s]
 23%|β–ˆβ–ˆβ–Ž       | 356/1545 [03:10<10:00,  1.98it/s]
 23%|β–ˆβ–ˆβ–Ž       | 357/1545 [03:11<10:30,  1.88it/s]
 23%|β–ˆβ–ˆβ–Ž       | 358/1545 [03:11<09:28,  2.09it/s]
 23%|β–ˆβ–ˆβ–Ž       | 359/1545 [03:12<09:52,  2.00it/s]
 23%|β–ˆβ–ˆβ–Ž       | 360/1545 [03:12<09:29,  2.08it/s]
                                                  
{'loss': 4.1093, 'grad_norm': 7.44648787076585e-12, 'learning_rate': 7.66990291262136e-06, 'rewards/chosen': -18.4683837890625, 'rewards/rejected': -34.16698455810547, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 15.698600769042969, 'logps/chosen': -337.5670166015625, 'logps/rejected': -447.79180908203125, 'logits/chosen': -2.0399553775787354, 'logits/rejected': -3.212721347808838, 'epoch': 0.23}

 23%|β–ˆβ–ˆβ–Ž       | 360/1545 [03:12<09:29,  2.08it/s]
 23%|β–ˆβ–ˆβ–Ž       | 361/1545 [03:13<10:10,  1.94it/s]
 23%|β–ˆβ–ˆβ–Ž       | 362/1545 [03:13<10:25,  1.89it/s]
 23%|β–ˆβ–ˆβ–Ž       | 363/1545 [03:14<10:11,  1.93it/s]
 24%|β–ˆβ–ˆβ–Ž       | 364/1545 [03:14<10:17,  1.91it/s]
 24%|β–ˆβ–ˆβ–Ž       | 365/1545 [03:15<10:34,  1.86it/s]
 24%|β–ˆβ–ˆβ–Ž       | 366/1545 [03:15<10:50,  1.81it/s]
 24%|β–ˆβ–ˆβ–       | 367/1545 [03:16<10:09,  1.93it/s]
 24%|β–ˆβ–ˆβ–       | 368/1545 [03:16<10:43,  1.83it/s]
 24%|β–ˆβ–ˆβ–       | 369/1545 [03:17<10:49,  1.81it/s]
 24%|β–ˆβ–ˆβ–       | 370/1545 [03:18<10:31,  1.86it/s]
                                                  
{'loss': 1.606, 'grad_norm': 0.011474609375, 'learning_rate': 7.605177993527508e-06, 'rewards/chosen': -15.731010437011719, 'rewards/rejected': -35.59105682373047, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 19.86004638671875, 'logps/chosen': -323.3865966796875, 'logps/rejected': -465.696044921875, 'logits/chosen': -2.1124587059020996, 'logits/rejected': -3.7612509727478027, 'epoch': 0.24}

 24%|β–ˆβ–ˆβ–       | 370/1545 [03:18<10:31,  1.86it/s]
 24%|β–ˆβ–ˆβ–       | 371/1545 [03:18<10:42,  1.83it/s]
 24%|β–ˆβ–ˆβ–       | 372/1545 [03:19<10:48,  1.81it/s]
 24%|β–ˆβ–ˆβ–       | 373/1545 [03:19<10:53,  1.79it/s]
 24%|β–ˆβ–ˆβ–       | 374/1545 [03:20<10:04,  1.94it/s]
 24%|β–ˆβ–ˆβ–       | 375/1545 [03:20<10:27,  1.86it/s]
 24%|β–ˆβ–ˆβ–       | 376/1545 [03:21<10:43,  1.82it/s]
 24%|β–ˆβ–ˆβ–       | 377/1545 [03:21<10:30,  1.85it/s]
 24%|β–ˆβ–ˆβ–       | 378/1545 [03:22<10:32,  1.85it/s]
 25%|β–ˆβ–ˆβ–       | 379/1545 [03:22<10:46,  1.80it/s]
 25%|β–ˆβ–ˆβ–       | 380/1545 [03:23<09:41,  2.00it/s]
                                                  
{'loss': 0.0008, 'grad_norm': 2.7466739993542433e-10, 'learning_rate': 7.540453074433658e-06, 'rewards/chosen': -12.65916919708252, 'rewards/rejected': -49.6566047668457, 'rewards/accuracies': 1.0, 'rewards/margins': 36.99742889404297, 'logps/chosen': -260.13958740234375, 'logps/rejected': -610.1561279296875, 'logits/chosen': -1.6531693935394287, 'logits/rejected': -4.022229194641113, 'epoch': 0.25}

 25%|β–ˆβ–ˆβ–       | 380/1545 [03:23<09:41,  2.00it/s]
 25%|β–ˆβ–ˆβ–       | 381/1545 [03:23<09:32,  2.03it/s]
 25%|β–ˆβ–ˆβ–       | 382/1545 [03:24<10:10,  1.90it/s]
 25%|β–ˆβ–ˆβ–       | 383/1545 [03:24<10:24,  1.86it/s]
 25%|β–ˆβ–ˆβ–       | 384/1545 [03:25<10:29,  1.85it/s]
 25%|β–ˆβ–ˆβ–       | 385/1545 [03:25<09:47,  1.97it/s]
 25%|β–ˆβ–ˆβ–       | 386/1545 [03:26<09:02,  2.14it/s]
 25%|β–ˆβ–ˆβ–Œ       | 387/1545 [03:26<09:38,  2.00it/s]
 25%|β–ˆβ–ˆβ–Œ       | 388/1545 [03:27<09:53,  1.95it/s]
 25%|β–ˆβ–ˆβ–Œ       | 389/1545 [03:27<09:30,  2.03it/s]
 25%|β–ˆβ–ˆβ–Œ       | 390/1545 [03:28<08:51,  2.17it/s]
                                                  
{'loss': 0.0693, 'grad_norm': 0.205078125, 'learning_rate': 7.475728155339807e-06, 'rewards/chosen': -22.143047332763672, 'rewards/rejected': -71.9369888305664, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 49.793941497802734, 'logps/chosen': -388.7550964355469, 'logps/rejected': -835.4134521484375, 'logits/chosen': -2.400635242462158, 'logits/rejected': -5.395668029785156, 'epoch': 0.25}

 25%|β–ˆβ–ˆβ–Œ       | 390/1545 [03:28<08:51,  2.17it/s]
 25%|β–ˆβ–ˆβ–Œ       | 391/1545 [03:28<08:24,  2.29it/s]
 25%|β–ˆβ–ˆβ–Œ       | 392/1545 [03:29<08:08,  2.36it/s]
 25%|β–ˆβ–ˆβ–Œ       | 393/1545 [03:29<08:41,  2.21it/s]
 26%|β–ˆβ–ˆβ–Œ       | 394/1545 [03:29<08:25,  2.28it/s]
 26%|β–ˆβ–ˆβ–Œ       | 395/1545 [03:30<09:14,  2.08it/s]
 26%|β–ˆβ–ˆβ–Œ       | 396/1545 [03:31<09:45,  1.96it/s]
 26%|β–ˆβ–ˆβ–Œ       | 397/1545 [03:31<09:43,  1.97it/s]
 26%|β–ˆβ–ˆβ–Œ       | 398/1545 [03:31<08:49,  2.16it/s]
 26%|β–ˆβ–ˆβ–Œ       | 399/1545 [03:32<09:31,  2.01it/s]
 26%|β–ˆβ–ˆβ–Œ       | 400/1545 [03:33<09:58,  1.91it/s]
                                                  
{'loss': 2.9874, 'grad_norm': 0.0002765655517578125, 'learning_rate': 7.411003236245955e-06, 'rewards/chosen': -18.571969985961914, 'rewards/rejected': -58.970985412597656, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 40.39901351928711, 'logps/chosen': -332.44952392578125, 'logps/rejected': -692.6171875, 'logits/chosen': -2.102853298187256, 'logits/rejected': -5.4109063148498535, 'epoch': 0.26}

 26%|β–ˆβ–ˆβ–Œ       | 400/1545 [03:33<09:58,  1.91it/s]
 26%|β–ˆβ–ˆβ–Œ       | 401/1545 [03:33<09:33,  1.99it/s]
 26%|β–ˆβ–ˆβ–Œ       | 402/1545 [03:34<10:00,  1.90it/s]
 26%|β–ˆβ–ˆβ–Œ       | 403/1545 [03:34<10:11,  1.87it/s]
 26%|β–ˆβ–ˆβ–Œ       | 404/1545 [03:35<10:16,  1.85it/s]
 26%|β–ˆβ–ˆβ–Œ       | 405/1545 [03:35<09:46,  1.94it/s]
 26%|β–ˆβ–ˆβ–‹       | 406/1545 [03:36<10:07,  1.87it/s]
 26%|β–ˆβ–ˆβ–‹       | 407/1545 [03:36<10:23,  1.83it/s]
 26%|β–ˆβ–ˆβ–‹       | 408/1545 [03:37<10:05,  1.88it/s]
 26%|β–ˆβ–ˆβ–‹       | 409/1545 [03:37<10:21,  1.83it/s]
 27%|β–ˆβ–ˆβ–‹       | 410/1545 [03:38<10:25,  1.82it/s]
                                                  
{'loss': 0.0823, 'grad_norm': 1.1188966420050406e-16, 'learning_rate': 7.3462783171521046e-06, 'rewards/chosen': -16.90290069580078, 'rewards/rejected': -36.86577224731445, 'rewards/accuracies': 1.0, 'rewards/margins': 19.962865829467773, 'logps/chosen': -327.2436828613281, 'logps/rejected': -486.8968811035156, 'logits/chosen': -1.703176498413086, 'logits/rejected': -2.9482998847961426, 'epoch': 0.27}

 27%|β–ˆβ–ˆβ–‹       | 410/1545 [03:38<10:25,  1.82it/s]
 27%|β–ˆβ–ˆβ–‹       | 411/1545 [03:39<10:34,  1.79it/s]
 27%|β–ˆβ–ˆβ–‹       | 412/1545 [03:40<13:59,  1.35it/s]
 27%|β–ˆβ–ˆβ–‹       | 413/1545 [03:40<13:05,  1.44it/s]
 27%|β–ˆβ–ˆβ–‹       | 414/1545 [03:41<12:11,  1.55it/s]
 27%|β–ˆβ–ˆβ–‹       | 415/1545 [03:41<11:01,  1.71it/s]
 27%|β–ˆβ–ˆβ–‹       | 416/1545 [03:42<11:09,  1.69it/s]
 27%|β–ˆβ–ˆβ–‹       | 417/1545 [03:43<10:52,  1.73it/s]
 27%|β–ˆβ–ˆβ–‹       | 418/1545 [03:43<10:42,  1.75it/s]
 27%|β–ˆβ–ˆβ–‹       | 419/1545 [03:43<09:55,  1.89it/s]
 27%|β–ˆβ–ˆβ–‹       | 420/1545 [03:44<10:21,  1.81it/s]
                                                  
{'loss': 1.0062, 'grad_norm': 0.0, 'learning_rate': 7.2815533980582534e-06, 'rewards/chosen': -29.327747344970703, 'rewards/rejected': -75.21330261230469, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 45.88555145263672, 'logps/chosen': -462.95111083984375, 'logps/rejected': -864.5911865234375, 'logits/chosen': -2.609936237335205, 'logits/rejected': -4.504573345184326, 'epoch': 0.27}

 27%|β–ˆβ–ˆβ–‹       | 420/1545 [03:44<10:21,  1.81it/s]
 27%|β–ˆβ–ˆβ–‹       | 421/1545 [03:45<10:26,  1.79it/s]
 27%|β–ˆβ–ˆβ–‹       | 422/1545 [03:45<09:57,  1.88it/s]
 27%|β–ˆβ–ˆβ–‹       | 423/1545 [03:46<10:24,  1.80it/s]
 27%|β–ˆβ–ˆβ–‹       | 424/1545 [03:46<10:29,  1.78it/s]
 28%|β–ˆβ–ˆβ–Š       | 425/1545 [03:47<10:22,  1.80it/s]
 28%|β–ˆβ–ˆβ–Š       | 426/1545 [03:47<08:51,  2.11it/s]
 28%|β–ˆβ–ˆβ–Š       | 427/1545 [03:48<09:37,  1.94it/s]
 28%|β–ˆβ–ˆβ–Š       | 428/1545 [03:48<09:54,  1.88it/s]
 28%|β–ˆβ–ˆβ–Š       | 429/1545 [03:49<09:56,  1.87it/s]
 28%|β–ˆβ–ˆβ–Š       | 430/1545 [03:49<09:49,  1.89it/s]
                                                  
{'loss': 0.972, 'grad_norm': 2.5011104298755527e-12, 'learning_rate': 7.2168284789644015e-06, 'rewards/chosen': -48.82561492919922, 'rewards/rejected': -107.1943359375, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 58.36871337890625, 'logps/chosen': -632.4086303710938, 'logps/rejected': -1192.110107421875, 'logits/chosen': -3.8830108642578125, 'logits/rejected': -4.851853370666504, 'epoch': 0.28}

 28%|β–ˆβ–ˆβ–Š       | 430/1545 [03:49<09:49,  1.89it/s]
 28%|β–ˆβ–ˆβ–Š       | 431/1545 [03:50<10:13,  1.82it/s]
 28%|β–ˆβ–ˆβ–Š       | 432/1545 [03:51<10:21,  1.79it/s]
 28%|β–ˆβ–ˆβ–Š       | 433/1545 [03:51<09:41,  1.91it/s]
 28%|β–ˆβ–ˆβ–Š       | 434/1545 [03:52<10:04,  1.84it/s]
 28%|β–ˆβ–ˆβ–Š       | 435/1545 [03:52<10:13,  1.81it/s]
 28%|β–ˆβ–ˆβ–Š       | 436/1545 [03:53<09:14,  2.00it/s]
 28%|β–ˆβ–ˆβ–Š       | 437/1545 [03:53<08:51,  2.08it/s]
 28%|β–ˆβ–ˆβ–Š       | 438/1545 [03:53<08:17,  2.22it/s]
 28%|β–ˆβ–ˆβ–Š       | 439/1545 [03:54<08:56,  2.06it/s]
 28%|β–ˆβ–ˆβ–Š       | 440/1545 [03:54<08:16,  2.22it/s]
                                                  
{'loss': 0.0, 'grad_norm': 6.054962083888163e-22, 'learning_rate': 7.152103559870551e-06, 'rewards/chosen': -39.671104431152344, 'rewards/rejected': -89.18429565429688, 'rewards/accuracies': 1.0, 'rewards/margins': 49.51319122314453, 'logps/chosen': -521.9129028320312, 'logps/rejected': -983.9114990234375, 'logits/chosen': -3.958955764770508, 'logits/rejected': -6.404815673828125, 'epoch': 0.28}

 28%|β–ˆβ–ˆβ–Š       | 440/1545 [03:54<08:16,  2.22it/s]
 29%|β–ˆβ–ˆβ–Š       | 441/1545 [03:55<08:23,  2.19it/s]
 29%|β–ˆβ–ˆβ–Š       | 442/1545 [03:55<09:05,  2.02it/s]
 29%|β–ˆβ–ˆβ–Š       | 443/1545 [03:56<08:28,  2.17it/s]
 29%|β–ˆβ–ˆβ–Š       | 444/1545 [03:56<08:56,  2.05it/s]
 29%|β–ˆβ–ˆβ–‰       | 445/1545 [03:57<08:45,  2.09it/s]
 29%|β–ˆβ–ˆβ–‰       | 446/1545 [03:57<09:22,  1.95it/s]
 29%|β–ˆβ–ˆβ–‰       | 447/1545 [03:58<09:39,  1.90it/s]
 29%|β–ˆβ–ˆβ–‰       | 448/1545 [03:58<09:44,  1.88it/s]
 29%|β–ˆβ–ˆβ–‰       | 449/1545 [03:59<09:15,  1.97it/s]
 29%|β–ˆβ–ˆβ–‰       | 450/1545 [03:59<09:46,  1.87it/s]
                                                  
{'loss': 3.3081, 'grad_norm': 6.103515625e-05, 'learning_rate': 7.0873786407767e-06, 'rewards/chosen': -54.06614303588867, 'rewards/rejected': -93.74281311035156, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 39.676658630371094, 'logps/chosen': -693.2251586914062, 'logps/rejected': -1041.663818359375, 'logits/chosen': -3.7713088989257812, 'logits/rejected': -5.899226665496826, 'epoch': 0.29}

 29%|β–ˆβ–ˆβ–‰       | 450/1545 [04:00<09:46,  1.87it/s]
 29%|β–ˆβ–ˆβ–‰       | 451/1545 [04:00<10:07,  1.80it/s]
 29%|β–ˆβ–ˆβ–‰       | 452/1545 [04:01<09:35,  1.90it/s]
 29%|β–ˆβ–ˆβ–‰       | 453/1545 [04:01<09:48,  1.85it/s]
 29%|β–ˆβ–ˆβ–‰       | 454/1545 [04:02<09:59,  1.82it/s]
 29%|β–ˆβ–ˆβ–‰       | 455/1545 [04:02<10:01,  1.81it/s]
 30%|β–ˆβ–ˆβ–‰       | 456/1545 [04:03<09:10,  1.98it/s]
 30%|β–ˆβ–ˆβ–‰       | 457/1545 [04:03<08:34,  2.11it/s]
 30%|β–ˆβ–ˆβ–‰       | 458/1545 [04:04<09:04,  2.00it/s]
 30%|β–ˆβ–ˆβ–‰       | 459/1545 [04:04<09:22,  1.93it/s]
 30%|β–ˆβ–ˆβ–‰       | 460/1545 [04:05<08:52,  2.04it/s]
                                                  
{'loss': 2.1395, 'grad_norm': 4.376943252282217e-12, 'learning_rate': 7.022653721682848e-06, 'rewards/chosen': -43.0927619934082, 'rewards/rejected': -76.77229309082031, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 33.679527282714844, 'logps/chosen': -589.8778686523438, 'logps/rejected': -876.2001953125, 'logits/chosen': -3.6679959297180176, 'logits/rejected': -4.975624084472656, 'epoch': 0.3}

 30%|β–ˆβ–ˆβ–‰       | 460/1545 [04:05<08:52,  2.04it/s]
 30%|β–ˆβ–ˆβ–‰       | 461/1545 [04:05<09:21,  1.93it/s]
 30%|β–ˆβ–ˆβ–‰       | 462/1545 [04:06<09:31,  1.90it/s]
 30%|β–ˆβ–ˆβ–‰       | 463/1545 [04:06<09:36,  1.88it/s]
 30%|β–ˆβ–ˆβ–ˆ       | 464/1545 [04:07<09:31,  1.89it/s]
 30%|β–ˆβ–ˆβ–ˆ       | 465/1545 [04:07<09:55,  1.81it/s]
 30%|β–ˆβ–ˆβ–ˆ       | 466/1545 [04:08<10:06,  1.78it/s]
 30%|β–ˆβ–ˆβ–ˆ       | 467/1545 [04:08<09:24,  1.91it/s]
 30%|β–ˆβ–ˆβ–ˆ       | 468/1545 [04:09<09:52,  1.82it/s]
 30%|β–ˆβ–ˆβ–ˆ       | 469/1545 [04:10<10:04,  1.78it/s]
 30%|β–ˆβ–ˆβ–ˆ       | 470/1545 [04:10<09:50,  1.82it/s]
                                                  
{'loss': 0.0008, 'grad_norm': 0.0, 'learning_rate': 6.957928802588997e-06, 'rewards/chosen': -28.976566314697266, 'rewards/rejected': -84.09245300292969, 'rewards/accuracies': 1.0, 'rewards/margins': 55.11588668823242, 'logps/chosen': -440.51788330078125, 'logps/rejected': -952.6383056640625, 'logits/chosen': -2.842275619506836, 'logits/rejected': -4.066740989685059, 'epoch': 0.3}

 30%|β–ˆβ–ˆβ–ˆ       | 470/1545 [04:10<09:50,  1.82it/s]
 30%|β–ˆβ–ˆβ–ˆ       | 471/1545 [04:11<08:50,  2.02it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 472/1545 [04:11<09:23,  1.90it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 473/1545 [04:12<09:25,  1.90it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 474/1545 [04:12<09:18,  1.92it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 475/1545 [04:13<09:38,  1.85it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 476/1545 [04:13<09:45,  1.83it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 477/1545 [04:14<09:42,  1.83it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 478/1545 [04:14<08:53,  2.00it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 479/1545 [04:15<09:22,  1.90it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 480/1545 [04:15<09:25,  1.88it/s]
                                                  
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 6.893203883495147e-06, 'rewards/chosen': -20.838396072387695, 'rewards/rejected': -92.45664978027344, 'rewards/accuracies': 1.0, 'rewards/margins': 71.61825561523438, 'logps/chosen': -362.0668029785156, 'logps/rejected': -1039.4049072265625, 'logits/chosen': -2.1511878967285156, 'logits/rejected': -4.547484874725342, 'epoch': 0.31}

 31%|β–ˆβ–ˆβ–ˆ       | 480/1545 [04:15<09:25,  1.88it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 481/1545 [04:16<09:25,  1.88it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 482/1545 [04:16<08:35,  2.06it/s]
 31%|β–ˆβ–ˆβ–ˆβ–      | 483/1545 [04:17<09:04,  1.95it/s]
 31%|β–ˆβ–ˆβ–ˆβ–      | 484/1545 [04:17<09:14,  1.91it/s]
 31%|β–ˆβ–ˆβ–ˆβ–      | 485/1545 [04:18<08:58,  1.97it/s]
 31%|β–ˆβ–ˆβ–ˆβ–      | 486/1545 [04:18<09:19,  1.89it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 487/1545 [04:19<09:37,  1.83it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 488/1545 [04:20<09:46,  1.80it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 489/1545 [04:20<09:03,  1.94it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 490/1545 [04:21<09:31,  1.85it/s]
                                                  
{'loss': 0.2412, 'grad_norm': 152.0, 'learning_rate': 6.828478964401295e-06, 'rewards/chosen': -30.904491424560547, 'rewards/rejected': -65.34832763671875, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 34.44383239746094, 'logps/chosen': -486.5516052246094, 'logps/rejected': -770.8928833007812, 'logits/chosen': -2.9180967807769775, 'logits/rejected': -4.752321720123291, 'epoch': 0.32}

 32%|β–ˆβ–ˆβ–ˆβ–      | 490/1545 [04:21<09:31,  1.85it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 491/1545 [04:21<09:47,  1.79it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 492/1545 [04:22<09:18,  1.89it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 493/1545 [04:22<09:44,  1.80it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 494/1545 [04:23<09:47,  1.79it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 495/1545 [04:23<09:47,  1.79it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 496/1545 [04:24<09:05,  1.92it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 497/1545 [04:24<09:21,  1.86it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 498/1545 [04:25<09:39,  1.81it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 499/1545 [04:26<09:36,  1.81it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 500/1545 [04:26<09:58,  1.75it/s]
                                                  
{'loss': 1.302, 'grad_norm': 2.6072732907671432e-21, 'learning_rate': 6.763754045307444e-06, 'rewards/chosen': -14.026659965515137, 'rewards/rejected': -57.82221221923828, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 43.795555114746094, 'logps/chosen': -295.4187927246094, 'logps/rejected': -692.2587890625, 'logits/chosen': -1.8123325109481812, 'logits/rejected': -4.419227123260498, 'epoch': 0.32}

 32%|β–ˆβ–ˆβ–ˆβ–      | 500/1545 [04:26<09:58,  1.75it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 501/1545 [04:27<10:10,  1.71it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 502/1545 [04:27<09:58,  1.74it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 503/1545 [04:28<09:47,  1.77it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 504/1545 [04:29<10:25,  1.66it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 505/1545 [04:29<10:18,  1.68it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 506/1545 [04:30<09:34,  1.81it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 507/1545 [04:30<09:55,  1.74it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 508/1545 [04:31<09:58,  1.73it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 509/1545 [04:31<09:36,  1.80it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 510/1545 [04:32<09:42,  1.78it/s]
                                                  
{'loss': 0.7018, 'grad_norm': 1.9063008949160576e-09, 'learning_rate': 6.6990291262135935e-06, 'rewards/chosen': -18.401165008544922, 'rewards/rejected': -41.45631408691406, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 23.05514907836914, 'logps/chosen': -362.13482666015625, 'logps/rejected': -531.1427612304688, 'logits/chosen': -1.9105132818222046, 'logits/rejected': -3.823967695236206, 'epoch': 0.33}

 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 510/1545 [04:32<09:42,  1.78it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 511/1545 [04:33<09:52,  1.75it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 512/1545 [04:33<09:40,  1.78it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 513/1545 [04:33<09:00,  1.91it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 514/1545 [04:34<09:16,  1.85it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 515/1545 [04:35<09:22,  1.83it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 516/1545 [04:35<09:13,  1.86it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 517/1545 [04:36<09:30,  1.80it/s]
 34%|β–ˆβ–ˆβ–ˆβ–Ž      | 518/1545 [04:36<09:33,  1.79it/s]
 34%|β–ˆβ–ˆβ–ˆβ–Ž      | 519/1545 [04:37<09:31,  1.80it/s]
 34%|β–ˆβ–ˆβ–ˆβ–Ž      | 520/1545 [04:37<08:48,  1.94it/s]
                                                  
{'loss': 0.9755, 'grad_norm': 3.924811864397526e-17, 'learning_rate': 6.6343042071197415e-06, 'rewards/chosen': -7.66559362411499, 'rewards/rejected': -35.98261642456055, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 28.317020416259766, 'logps/chosen': -237.4998321533203, 'logps/rejected': -480.2190856933594, 'logits/chosen': -1.257673978805542, 'logits/rejected': -3.1603283882141113, 'epoch': 0.34}

 34%|β–ˆβ–ˆβ–ˆβ–Ž      | 520/1545 [04:37<08:48,  1.94it/s]
 34%|β–ˆβ–ˆβ–ˆβ–Ž      | 521/1545 [04:38<09:07,  1.87it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 522/1545 [04:38<09:11,  1.85it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 523/1545 [04:39<09:10,  1.86it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 524/1545 [04:39<09:03,  1.88it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 525/1545 [04:40<09:23,  1.81it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 526/1545 [04:41<13:50,  1.23it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 527/1545 [04:42<11:38,  1.46it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 528/1545 [04:42<10:59,  1.54it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 529/1545 [04:43<10:37,  1.59it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 530/1545 [04:44<10:09,  1.67it/s]
                                                  
{'loss': 0.4228, 'grad_norm': 0.0, 'learning_rate': 6.56957928802589e-06, 'rewards/chosen': -30.713363647460938, 'rewards/rejected': -60.528602600097656, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 29.81524085998535, 'logps/chosen': -450.9524841308594, 'logps/rejected': -714.2261962890625, 'logits/chosen': -3.075618267059326, 'logits/rejected': -4.854428768157959, 'epoch': 0.34}

 34%|β–ˆβ–ˆβ–ˆβ–      | 530/1545 [04:44<10:09,  1.67it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 531/1545 [04:44<09:39,  1.75it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 532/1545 [04:45<09:42,  1.74it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 533/1545 [04:45<09:37,  1.75it/s]
 35%|β–ˆβ–ˆβ–ˆβ–      | 534/1545 [04:46<08:54,  1.89it/s]
 35%|β–ˆβ–ˆβ–ˆβ–      | 535/1545 [04:46<09:14,  1.82it/s]
 35%|β–ˆβ–ˆβ–ˆβ–      | 536/1545 [04:47<09:17,  1.81it/s]
 35%|β–ˆβ–ˆβ–ˆβ–      | 537/1545 [04:47<09:14,  1.82it/s]
 35%|β–ˆβ–ˆβ–ˆβ–      | 538/1545 [04:48<09:06,  1.84it/s]
 35%|β–ˆβ–ˆβ–ˆβ–      | 539/1545 [04:48<09:18,  1.80it/s]
 35%|β–ˆβ–ˆβ–ˆβ–      | 540/1545 [04:49<09:20,  1.79it/s]
                                                  
{'loss': 0.723, 'grad_norm': 4.440892098500626e-15, 'learning_rate': 6.50485436893204e-06, 'rewards/chosen': -16.0915584564209, 'rewards/rejected': -53.74153518676758, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 37.64997482299805, 'logps/chosen': -319.9918518066406, 'logps/rejected': -653.3815307617188, 'logits/chosen': -1.6358880996704102, 'logits/rejected': -3.925060749053955, 'epoch': 0.35}

 35%|β–ˆβ–ˆβ–ˆβ–      | 540/1545 [04:49<09:20,  1.79it/s]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 541/1545 [04:49<08:44,  1.91it/s]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 542/1545 [04:50<09:08,  1.83it/s]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 543/1545 [04:51<09:12,  1.81it/s]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 544/1545 [04:51<09:09,  1.82it/s]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 545/1545 [04:52<09:01,  1.85it/s]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 546/1545 [04:52<09:09,  1.82it/s]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 547/1545 [04:53<09:15,  1.80it/s]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 548/1545 [04:53<08:38,  1.92it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 549/1545 [04:54<08:53,  1.87it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 550/1545 [04:54<08:57,  1.85it/s]
                                                  
{'loss': 0.0105, 'grad_norm': 3.9257486150745535e-13, 'learning_rate': 6.440129449838188e-06, 'rewards/chosen': -14.184236526489258, 'rewards/rejected': -45.90851593017578, 'rewards/accuracies': 1.0, 'rewards/margins': 31.72427749633789, 'logps/chosen': -261.56756591796875, 'logps/rejected': -570.337890625, 'logits/chosen': -1.7183490991592407, 'logits/rejected': -3.5872623920440674, 'epoch': 0.36}

 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 550/1545 [04:54<08:57,  1.85it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 551/1545 [04:55<09:07,  1.81it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 552/1545 [04:55<08:51,  1.87it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 553/1545 [04:56<09:04,  1.82it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 554/1545 [04:57<09:02,  1.83it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 555/1545 [04:57<08:34,  1.92it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 556/1545 [04:58<08:50,  1.86it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 557/1545 [04:58<08:55,  1.84it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 558/1545 [04:59<09:02,  1.82it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 559/1545 [04:59<08:16,  1.99it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 560/1545 [04:59<07:39,  2.15it/s]
                                                  
{'loss': 1.7896, 'grad_norm': 7.104873657226562e-05, 'learning_rate': 6.375404530744337e-06, 'rewards/chosen': -30.25199317932129, 'rewards/rejected': -56.004981994628906, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 25.75299072265625, 'logps/chosen': -435.36322021484375, 'logps/rejected': -665.7625732421875, 'logits/chosen': -1.7875016927719116, 'logits/rejected': -3.8524105548858643, 'epoch': 0.36}

 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 560/1545 [05:00<07:39,  2.15it/s]
 36%|β–ˆβ–ˆβ–ˆβ–‹      | 561/1545 [05:00<08:08,  2.01it/s]
 36%|β–ˆβ–ˆβ–ˆβ–‹      | 562/1545 [05:01<08:29,  1.93it/s]
 36%|β–ˆβ–ˆβ–ˆβ–‹      | 563/1545 [05:01<07:58,  2.05it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 564/1545 [05:02<08:22,  1.95it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 565/1545 [05:02<08:33,  1.91it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 566/1545 [05:03<08:37,  1.89it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 567/1545 [05:03<08:25,  1.94it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 568/1545 [05:04<08:39,  1.88it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 569/1545 [05:04<08:46,  1.85it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 570/1545 [05:05<08:16,  1.96it/s]
                                                  
{'loss': 0.0057, 'grad_norm': 1.0302869668521453e-12, 'learning_rate': 6.310679611650487e-06, 'rewards/chosen': -9.948836326599121, 'rewards/rejected': -50.712440490722656, 'rewards/accuracies': 1.0, 'rewards/margins': 40.76360321044922, 'logps/chosen': -243.68978881835938, 'logps/rejected': -616.48486328125, 'logits/chosen': -1.0023655891418457, 'logits/rejected': -3.4231104850769043, 'epoch': 0.37}

 37%|β–ˆβ–ˆβ–ˆβ–‹      | 570/1545 [05:05<08:16,  1.96it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 571/1545 [05:05<08:46,  1.85it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 572/1545 [05:06<08:48,  1.84it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 573/1545 [05:06<08:45,  1.85it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 574/1545 [05:07<08:32,  1.90it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 575/1545 [05:08<08:43,  1.85it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 576/1545 [05:08<08:45,  1.84it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 577/1545 [05:09<08:19,  1.94it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 578/1545 [05:09<08:36,  1.87it/s]
 37%|β–ˆβ–ˆβ–ˆβ–‹      | 579/1545 [05:10<08:48,  1.83it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 580/1545 [05:10<08:46,  1.83it/s]
                                                  
{'loss': 0.0143, 'grad_norm': 0.0, 'learning_rate': 6.245954692556635e-06, 'rewards/chosen': -18.19172477722168, 'rewards/rejected': -61.596435546875, 'rewards/accuracies': 1.0, 'rewards/margins': 43.40471267700195, 'logps/chosen': -331.9469909667969, 'logps/rejected': -743.6862182617188, 'logits/chosen': -1.4042354822158813, 'logits/rejected': -3.8643798828125, 'epoch': 0.38}

 38%|β–ˆβ–ˆβ–ˆβ–Š      | 580/1545 [05:10<08:46,  1.83it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 581/1545 [05:11<08:20,  1.93it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 582/1545 [05:11<08:42,  1.84it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 583/1545 [05:12<08:53,  1.80it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 584/1545 [05:12<08:40,  1.85it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 585/1545 [05:13<08:53,  1.80it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 586/1545 [05:14<09:09,  1.74it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 587/1545 [05:14<09:00,  1.77it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 588/1545 [05:15<08:39,  1.84it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 589/1545 [05:15<08:46,  1.82it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 590/1545 [05:16<08:40,  1.83it/s]
                                                  
{'loss': 0.0179, 'grad_norm': 77.5, 'learning_rate': 6.181229773462784e-06, 'rewards/chosen': -18.66933822631836, 'rewards/rejected': -50.0496826171875, 'rewards/accuracies': 1.0, 'rewards/margins': 31.380340576171875, 'logps/chosen': -321.9429626464844, 'logps/rejected': -626.2196655273438, 'logits/chosen': -1.854098916053772, 'logits/rejected': -3.260258436203003, 'epoch': 0.38}

 38%|β–ˆβ–ˆβ–ˆβ–Š      | 590/1545 [05:16<08:40,  1.83it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 591/1545 [05:16<08:12,  1.94it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 592/1545 [05:17<08:37,  1.84it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 593/1545 [05:17<08:51,  1.79it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 594/1545 [05:18<07:55,  2.00it/s]
 39%|β–ˆβ–ˆβ–ˆβ–Š      | 595/1545 [05:18<07:44,  2.05it/s]
 39%|β–ˆβ–ˆβ–ˆβ–Š      | 596/1545 [05:19<08:24,  1.88it/s]
 39%|β–ˆβ–ˆβ–ˆβ–Š      | 597/1545 [05:19<08:37,  1.83it/s]
 39%|β–ˆβ–ˆβ–ˆβ–Š      | 598/1545 [05:20<07:48,  2.02it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 599/1545 [05:20<07:27,  2.11it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 600/1545 [05:21<07:53,  2.00it/s]
                                                  
{'loss': 0.0001, 'grad_norm': 0.5234375, 'learning_rate': 6.116504854368932e-06, 'rewards/chosen': -20.77777099609375, 'rewards/rejected': -52.45641326904297, 'rewards/accuracies': 1.0, 'rewards/margins': 31.678646087646484, 'logps/chosen': -362.299072265625, 'logps/rejected': -631.9798583984375, 'logits/chosen': -1.6714366674423218, 'logits/rejected': -3.967179775238037, 'epoch': 0.39}

 39%|β–ˆβ–ˆβ–ˆβ–‰      | 600/1545 [05:21<07:53,  2.00it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 601/1545 [05:21<08:15,  1.90it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 602/1545 [05:22<08:17,  1.90it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 603/1545 [05:22<08:08,  1.93it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 604/1545 [05:23<08:23,  1.87it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 605/1545 [05:24<08:28,  1.85it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 606/1545 [05:24<08:07,  1.92it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 607/1545 [05:24<07:28,  2.09it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 608/1545 [05:25<07:53,  1.98it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 609/1545 [05:25<08:07,  1.92it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 610/1545 [05:26<07:55,  1.97it/s]
                                                  
{'loss': 0.0579, 'grad_norm': 0.0, 'learning_rate': 6.0517799352750815e-06, 'rewards/chosen': -18.156063079833984, 'rewards/rejected': -48.84654998779297, 'rewards/accuracies': 1.0, 'rewards/margins': 30.690486907958984, 'logps/chosen': -305.2608947753906, 'logps/rejected': -596.3084716796875, 'logits/chosen': -1.5014350414276123, 'logits/rejected': -3.3216071128845215, 'epoch': 0.39}

 39%|β–ˆβ–ˆβ–ˆβ–‰      | 610/1545 [05:26<07:55,  1.97it/s]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 611/1545 [05:27<08:20,  1.86it/s]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 612/1545 [05:27<07:32,  2.06it/s]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 613/1545 [05:27<07:55,  1.96it/s]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 614/1545 [05:28<07:28,  2.08it/s]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 615/1545 [05:28<07:56,  1.95it/s]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 616/1545 [05:29<08:07,  1.91it/s]
 40%|β–ˆβ–ˆβ–ˆβ–‰      | 617/1545 [05:30<08:09,  1.90it/s]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 618/1545 [05:30<08:01,  1.92it/s]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 619/1545 [05:30<07:22,  2.09it/s]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 620/1545 [05:31<07:43,  1.99it/s]
                                                  
{'loss': 0.0011, 'grad_norm': 12.3125, 'learning_rate': 5.9870550161812304e-06, 'rewards/chosen': -20.277729034423828, 'rewards/rejected': -44.97526931762695, 'rewards/accuracies': 1.0, 'rewards/margins': 24.69754409790039, 'logps/chosen': -371.6828308105469, 'logps/rejected': -576.7119750976562, 'logits/chosen': -1.2356998920440674, 'logits/rejected': -2.410062074661255, 'epoch': 0.4}

 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 620/1545 [05:31<07:43,  1.99it/s]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 621/1545 [05:32<07:59,  1.93it/s]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 622/1545 [05:32<08:12,  1.88it/s]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 623/1545 [05:33<08:25,  1.82it/s]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 624/1545 [05:33<08:24,  1.83it/s]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 625/1545 [05:34<07:55,  1.94it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 626/1545 [05:34<08:20,  1.84it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 627/1545 [05:35<08:27,  1.81it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 628/1545 [05:35<08:07,  1.88it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 629/1545 [05:36<08:24,  1.81it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 630/1545 [05:37<08:29,  1.80it/s]
                                                  
{'loss': 0.0001, 'grad_norm': 6.606569513678551e-09, 'learning_rate': 5.9223300970873785e-06, 'rewards/chosen': -22.200729370117188, 'rewards/rejected': -52.896087646484375, 'rewards/accuracies': 1.0, 'rewards/margins': 30.695358276367188, 'logps/chosen': -359.304931640625, 'logps/rejected': -632.1669921875, 'logits/chosen': -2.2159793376922607, 'logits/rejected': -4.487706184387207, 'epoch': 0.41}

 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 630/1545 [05:37<08:29,  1.80it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 631/1545 [05:37<08:32,  1.78it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 632/1545 [05:38<07:54,  1.92it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 633/1545 [05:38<08:11,  1.86it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 634/1545 [05:39<08:16,  1.83it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 635/1545 [05:39<08:01,  1.89it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 636/1545 [05:40<07:07,  2.13it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 637/1545 [05:40<07:35,  1.99it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 638/1545 [05:41<07:53,  1.92it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 639/1545 [05:41<07:36,  1.98it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 640/1545 [05:42<07:54,  1.91it/s]
                                                  
{'loss': 0.0756, 'grad_norm': 2.8731357570865868e-18, 'learning_rate': 5.857605177993528e-06, 'rewards/chosen': -37.544307708740234, 'rewards/rejected': -75.83995056152344, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 38.295654296875, 'logps/chosen': -509.95721435546875, 'logps/rejected': -876.0435791015625, 'logits/chosen': -3.4378981590270996, 'logits/rejected': -4.655713081359863, 'epoch': 0.41}

 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 640/1545 [05:42<07:54,  1.91it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 641/1545 [05:43<11:29,  1.31it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 642/1545 [05:44<10:22,  1.45it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 643/1545 [05:44<09:02,  1.66it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 644/1545 [05:45<08:56,  1.68it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 645/1545 [05:45<08:41,  1.73it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 646/1545 [05:46<08:27,  1.77it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 647/1545 [05:46<08:11,  1.83it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 648/1545 [05:47<08:16,  1.81it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 649/1545 [05:47<08:16,  1.81it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 650/1545 [05:48<07:40,  1.94it/s]
                                                  
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 5.792880258899677e-06, 'rewards/chosen': -17.858400344848633, 'rewards/rejected': -68.02870178222656, 'rewards/accuracies': 1.0, 'rewards/margins': 50.1702995300293, 'logps/chosen': -305.210205078125, 'logps/rejected': -778.7840576171875, 'logits/chosen': -2.0918753147125244, 'logits/rejected': -5.265947341918945, 'epoch': 0.42}

 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 650/1545 [05:48<07:40,  1.94it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 651/1545 [05:48<07:57,  1.87it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 652/1545 [05:49<08:05,  1.84it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 653/1545 [05:49<07:19,  2.03it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 654/1545 [05:50<07:01,  2.11it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 655/1545 [05:50<07:26,  1.99it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 656/1545 [05:51<07:46,  1.91it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 657/1545 [05:51<08:01,  1.84it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 658/1545 [05:52<07:45,  1.91it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 659/1545 [05:52<07:58,  1.85it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 660/1545 [05:53<07:58,  1.85it/s]
                                                  
{'loss': 2.3511, 'grad_norm': 0.0, 'learning_rate': 5.728155339805825e-06, 'rewards/chosen': -37.305686950683594, 'rewards/rejected': -80.12673950195312, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 42.821044921875, 'logps/chosen': -520.0347290039062, 'logps/rejected': -906.5696411132812, 'logits/chosen': -3.2962348461151123, 'logits/rejected': -6.164005279541016, 'epoch': 0.43}

 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 660/1545 [05:53<07:58,  1.85it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 661/1545 [05:53<07:34,  1.95it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 662/1545 [05:54<07:50,  1.88it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 663/1545 [05:55<08:01,  1.83it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 664/1545 [05:55<08:02,  1.83it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 665/1545 [05:55<07:23,  1.98it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 666/1545 [05:56<07:41,  1.90it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 667/1545 [05:56<06:58,  2.10it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 668/1545 [05:57<07:18,  2.00it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 669/1545 [05:57<06:56,  2.10it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 670/1545 [05:58<07:20,  1.99it/s]
                                                  
{'loss': 0.0, 'grad_norm': 0.00046539306640625, 'learning_rate': 5.663430420711975e-06, 'rewards/chosen': -21.083389282226562, 'rewards/rejected': -60.96876907348633, 'rewards/accuracies': 1.0, 'rewards/margins': 39.885379791259766, 'logps/chosen': -365.4623107910156, 'logps/rejected': -721.8714599609375, 'logits/chosen': -2.13993501663208, 'logits/rejected': -5.14273738861084, 'epoch': 0.43}

 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 670/1545 [05:58<07:20,  1.99it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 671/1545 [05:59<07:44,  1.88it/s]
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 672/1545 [05:59<07:41,  1.89it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 673/1545 [06:00<07:37,  1.91it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 674/1545 [06:00<07:50,  1.85it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 675/1545 [06:01<07:52,  1.84it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 676/1545 [06:01<07:21,  1.97it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 677/1545 [06:02<07:38,  1.89it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 678/1545 [06:02<07:46,  1.86it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 679/1545 [06:03<07:39,  1.88it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 680/1545 [06:03<07:04,  2.04it/s]
                                                  
{'loss': 0.001, 'grad_norm': 9.492850949754938e-12, 'learning_rate': 5.598705501618124e-06, 'rewards/chosen': -25.24799346923828, 'rewards/rejected': -60.609230041503906, 'rewards/accuracies': 1.0, 'rewards/margins': 35.361228942871094, 'logps/chosen': -402.2320251464844, 'logps/rejected': -727.8626708984375, 'logits/chosen': -2.94899845123291, 'logits/rejected': -5.123431205749512, 'epoch': 0.44}

 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 680/1545 [06:03<07:04,  2.04it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 681/1545 [06:04<07:28,  1.93it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 682/1545 [06:04<07:33,  1.90it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 683/1545 [06:05<07:32,  1.91it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 684/1545 [06:05<06:27,  2.22it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 685/1545 [06:06<07:03,  2.03it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 686/1545 [06:06<06:33,  2.19it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 687/1545 [06:07<06:58,  2.05it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 688/1545 [06:07<06:43,  2.12it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 689/1545 [06:08<07:11,  1.98it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 690/1545 [06:08<07:25,  1.92it/s]
                                                  
{'loss': 0.2533, 'grad_norm': 4.76837158203125e-07, 'learning_rate': 5.533980582524272e-06, 'rewards/chosen': -26.505752563476562, 'rewards/rejected': -50.40888595581055, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 23.90313720703125, 'logps/chosen': -408.7076721191406, 'logps/rejected': -599.2103881835938, 'logits/chosen': -2.7856857776641846, 'logits/rejected': -5.5106892585754395, 'epoch': 0.45}

 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 690/1545 [06:08<07:25,  1.92it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 691/1545 [06:09<07:36,  1.87it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 692/1545 [06:09<06:37,  2.15it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 693/1545 [06:09<05:52,  2.42it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 694/1545 [06:10<06:28,  2.19it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 695/1545 [06:10<06:50,  2.07it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 696/1545 [06:11<06:01,  2.35it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 697/1545 [06:11<05:24,  2.61it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 698/1545 [06:11<05:33,  2.54it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 699/1545 [06:12<05:48,  2.43it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 700/1545 [06:12<05:51,  2.40it/s]
                                                  
{'loss': 0.3027, 'grad_norm': 0.0, 'learning_rate': 5.4692556634304216e-06, 'rewards/chosen': -22.45337677001953, 'rewards/rejected': -68.20580291748047, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 45.752418518066406, 'logps/chosen': -372.1856384277344, 'logps/rejected': -798.4400634765625, 'logits/chosen': -2.2510809898376465, 'logits/rejected': -5.148565292358398, 'epoch': 0.45}

 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 700/1545 [06:12<05:51,  2.40it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 701/1545 [06:13<06:09,  2.28it/s]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 702/1545 [06:13<06:19,  2.22it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 703/1545 [06:14<06:16,  2.24it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 704/1545 [06:14<06:43,  2.08it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 705/1545 [06:15<07:11,  1.95it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 706/1545 [06:15<06:52,  2.03it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 707/1545 [06:16<07:00,  1.99it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 708/1545 [06:16<06:29,  2.15it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 709/1545 [06:17<06:29,  2.15it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 710/1545 [06:17<06:47,  2.05it/s]
                                                  
{'loss': 0.5746, 'grad_norm': 0.0, 'learning_rate': 5.4045307443365705e-06, 'rewards/chosen': -27.447168350219727, 'rewards/rejected': -84.84578704833984, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 57.39862060546875, 'logps/chosen': -426.19427490234375, 'logps/rejected': -969.3411865234375, 'logits/chosen': -2.232661724090576, 'logits/rejected': -4.795763969421387, 'epoch': 0.46}

 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 710/1545 [06:17<06:47,  2.05it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 711/1545 [06:18<07:15,  1.92it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 712/1545 [06:18<07:04,  1.96it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 713/1545 [06:19<07:04,  1.96it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 714/1545 [06:19<07:20,  1.89it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 715/1545 [06:20<07:16,  1.90it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 716/1545 [06:20<07:07,  1.94it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 717/1545 [06:21<07:12,  1.91it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 718/1545 [06:22<07:29,  1.84it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 719/1545 [06:22<07:15,  1.90it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 720/1545 [06:23<07:19,  1.88it/s]
                                                  
{'loss': 0.003, 'grad_norm': 0.0, 'learning_rate': 5.3398058252427185e-06, 'rewards/chosen': -38.11956787109375, 'rewards/rejected': -97.89371490478516, 'rewards/accuracies': 1.0, 'rewards/margins': 59.774139404296875, 'logps/chosen': -570.1392822265625, 'logps/rejected': -1108.6182861328125, 'logits/chosen': -2.4496023654937744, 'logits/rejected': -5.000131607055664, 'epoch': 0.47}

 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 720/1545 [06:23<07:19,  1.88it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 721/1545 [06:23<07:25,  1.85it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 722/1545 [06:24<07:37,  1.80it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 723/1545 [06:24<07:19,  1.87it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 724/1545 [06:25<07:21,  1.86it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 725/1545 [06:25<07:28,  1.83it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 726/1545 [06:26<07:26,  1.83it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 727/1545 [06:26<07:04,  1.93it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 728/1545 [06:27<07:10,  1.90it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 729/1545 [06:27<07:22,  1.84it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 730/1545 [06:28<07:00,  1.94it/s]
                                                  
{'loss': 1.7073, 'grad_norm': 2.3418766925686896e-16, 'learning_rate': 5.275080906148867e-06, 'rewards/chosen': -44.976646423339844, 'rewards/rejected': -86.87692260742188, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 41.9002799987793, 'logps/chosen': -594.9097290039062, 'logps/rejected': -975.1007690429688, 'logits/chosen': -3.272984027862549, 'logits/rejected': -5.4031596183776855, 'epoch': 0.47}

 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 730/1545 [06:28<07:00,  1.94it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 731/1545 [06:28<07:05,  1.91it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 732/1545 [06:29<07:16,  1.86it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 733/1545 [06:29<06:59,  1.93it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 734/1545 [06:30<07:06,  1.90it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 735/1545 [06:31<07:10,  1.88it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 736/1545 [06:31<07:15,  1.86it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 737/1545 [06:32<06:55,  1.94it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 738/1545 [06:32<07:00,  1.92it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 739/1545 [06:33<07:10,  1.87it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 740/1545 [06:33<06:53,  1.95it/s]
                                                  
{'loss': 0.0581, 'grad_norm': 30.5, 'learning_rate': 5.210355987055017e-06, 'rewards/chosen': -43.055625915527344, 'rewards/rejected': -78.43653106689453, 'rewards/accuracies': 1.0, 'rewards/margins': 35.38090133666992, 'logps/chosen': -582.3439331054688, 'logps/rejected': -901.49658203125, 'logits/chosen': -3.8236382007598877, 'logits/rejected': -6.230503082275391, 'epoch': 0.48}

 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 740/1545 [06:33<06:53,  1.95it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 741/1545 [06:34<06:56,  1.93it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 742/1545 [06:34<07:07,  1.88it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 743/1545 [06:35<06:55,  1.93it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 744/1545 [06:35<06:46,  1.97it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 745/1545 [06:36<06:56,  1.92it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 746/1545 [06:36<07:07,  1.87it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 747/1545 [06:37<06:50,  1.94it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 748/1545 [06:37<06:56,  1.91it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 749/1545 [06:38<07:07,  1.86it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 750/1545 [06:38<06:46,  1.95it/s]
                                                  
{'loss': 3.8903, 'grad_norm': 0.0, 'learning_rate': 5.145631067961165e-06, 'rewards/chosen': -22.969589233398438, 'rewards/rejected': -51.09120559692383, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 28.121618270874023, 'logps/chosen': -399.1053466796875, 'logps/rejected': -617.9463500976562, 'logits/chosen': -1.9127031564712524, 'logits/rejected': -4.489147186279297, 'epoch': 0.49}

 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 750/1545 [06:38<06:46,  1.95it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 751/1545 [06:39<06:49,  1.94it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 752/1545 [06:39<07:01,  1.88it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 753/1545 [06:40<06:05,  2.17it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 754/1545 [06:40<05:17,  2.49it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 755/1545 [06:40<04:51,  2.71it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 756/1545 [06:41<04:25,  2.97it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 757/1545 [06:41<04:07,  3.19it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 758/1545 [06:41<03:55,  3.35it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 759/1545 [06:41<03:33,  3.67it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 760/1545 [06:42<03:30,  3.72it/s]
                                                  
{'loss': 0.0001, 'grad_norm': 0.0, 'learning_rate': 5.080906148867314e-06, 'rewards/chosen': -20.00014877319336, 'rewards/rejected': -70.9677505493164, 'rewards/accuracies': 1.0, 'rewards/margins': 50.96759796142578, 'logps/chosen': -339.2450256347656, 'logps/rejected': -807.701416015625, 'logits/chosen': -2.3724873065948486, 'logits/rejected': -5.234989166259766, 'epoch': 0.49}

 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 760/1545 [06:42<03:30,  3.72it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 761/1545 [06:42<03:30,  3.73it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 762/1545 [06:42<03:29,  3.73it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 763/1545 [06:42<03:29,  3.73it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 764/1545 [06:43<03:31,  3.69it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 765/1545 [06:43<03:20,  3.90it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 766/1545 [06:43<03:25,  3.78it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 767/1545 [06:44<06:59,  1.86it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 768/1545 [06:45<05:56,  2.18it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 769/1545 [06:45<05:12,  2.48it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 770/1545 [06:45<04:41,  2.75it/s]
                                                  
{'loss': 1.0551, 'grad_norm': 6.261267546103788e-18, 'learning_rate': 5.016181229773464e-06, 'rewards/chosen': -21.470638275146484, 'rewards/rejected': -69.12281036376953, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 47.65216827392578, 'logps/chosen': -400.18658447265625, 'logps/rejected': -822.2199096679688, 'logits/chosen': -2.1039681434631348, 'logits/rejected': -4.775751113891602, 'epoch': 0.5}

 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 770/1545 [06:45<04:41,  2.75it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 771/1545 [06:45<04:21,  2.96it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 772/1545 [06:46<04:09,  3.10it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 773/1545 [06:46<03:46,  3.41it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 774/1545 [06:46<03:41,  3.48it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 775/1545 [06:46<03:37,  3.55it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 776/1545 [06:47<03:40,  3.48it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 777/1545 [06:47<03:46,  3.40it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 778/1545 [06:47<03:30,  3.64it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 779/1545 [06:48<03:31,  3.61it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 780/1545 [06:48<03:33,  3.59it/s]
                                                  
{'loss': 0.0001, 'grad_norm': 0.0, 'learning_rate': 4.951456310679612e-06, 'rewards/chosen': -25.87270736694336, 'rewards/rejected': -69.1749267578125, 'rewards/accuracies': 1.0, 'rewards/margins': 43.302223205566406, 'logps/chosen': -421.61248779296875, 'logps/rejected': -803.4362182617188, 'logits/chosen': -2.607177972793579, 'logits/rejected': -4.537522315979004, 'epoch': 0.5}

 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 780/1545 [06:48<03:33,  3.59it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 781/1545 [06:48<03:56,  3.23it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 782/1545 [06:49<04:22,  2.90it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 783/1545 [06:49<04:42,  2.70it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 784/1545 [06:49<04:20,  2.92it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 785/1545 [06:50<04:43,  2.68it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 786/1545 [06:50<04:58,  2.54it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 787/1545 [06:51<05:15,  2.40it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 788/1545 [06:51<06:04,  2.08it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 789/1545 [06:52<06:22,  1.97it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 790/1545 [06:52<06:27,  1.95it/s]
                                                  
{'loss': 7.3323, 'grad_norm': 0.2470703125, 'learning_rate': 4.886731391585761e-06, 'rewards/chosen': -44.284454345703125, 'rewards/rejected': -73.53287506103516, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 29.2484188079834, 'logps/chosen': -594.3431396484375, 'logps/rejected': -851.0657348632812, 'logits/chosen': -2.5097155570983887, 'logits/rejected': -3.9307990074157715, 'epoch': 0.51}

 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 790/1545 [06:53<06:27,  1.95it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 791/1545 [06:53<06:25,  1.96it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 792/1545 [06:54<06:44,  1.86it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 793/1545 [06:54<06:49,  1.83it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 794/1545 [06:55<06:30,  1.92it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 795/1545 [06:55<06:44,  1.86it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 796/1545 [06:56<06:45,  1.85it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 797/1545 [06:56<06:45,  1.85it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 798/1545 [06:57<06:24,  1.94it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 799/1545 [06:57<06:37,  1.88it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 800/1545 [06:58<06:41,  1.86it/s]
                                                  
{'loss': 0.0002, 'grad_norm': 1.4375, 'learning_rate': 4.82200647249191e-06, 'rewards/chosen': -13.080400466918945, 'rewards/rejected': -36.216304779052734, 'rewards/accuracies': 1.0, 'rewards/margins': 23.135906219482422, 'logps/chosen': -296.13519287109375, 'logps/rejected': -475.113037109375, 'logits/chosen': -1.1140010356903076, 'logits/rejected': -2.2951102256774902, 'epoch': 0.52}

 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 800/1545 [06:58<06:41,  1.86it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 801/1545 [06:58<06:30,  1.91it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 802/1545 [06:59<06:42,  1.84it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 803/1545 [07:00<06:47,  1.82it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 804/1545 [07:00<06:51,  1.80it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 805/1545 [07:01<06:23,  1.93it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 806/1545 [07:01<06:40,  1.85it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 807/1545 [07:02<06:49,  1.80it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 808/1545 [07:02<06:32,  1.88it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 809/1545 [07:03<05:57,  2.06it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 810/1545 [07:03<06:18,  1.94it/s]
                                                  
{'loss': 0.0703, 'grad_norm': 9.38598532229662e-10, 'learning_rate': 4.7572815533980585e-06, 'rewards/chosen': -24.983036041259766, 'rewards/rejected': -45.387813568115234, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 20.404781341552734, 'logps/chosen': -387.3971252441406, 'logps/rejected': -552.9581909179688, 'logits/chosen': -2.1105751991271973, 'logits/rejected': -3.3740882873535156, 'epoch': 0.52}

 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 810/1545 [07:03<06:18,  1.94it/s]
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 811/1545 [07:04<06:21,  1.92it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 812/1545 [07:04<06:06,  2.00it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 813/1545 [07:05<06:30,  1.88it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 814/1545 [07:05<06:40,  1.83it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 815/1545 [07:06<06:39,  1.83it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 816/1545 [07:06<06:08,  1.98it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 817/1545 [07:07<06:27,  1.88it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 818/1545 [07:07<06:34,  1.85it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 819/1545 [07:08<06:28,  1.87it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 820/1545 [07:09<06:32,  1.85it/s]
                                                  
{'loss': 1.553, 'grad_norm': 0.0, 'learning_rate': 4.6925566343042074e-06, 'rewards/chosen': -23.155715942382812, 'rewards/rejected': -62.099571228027344, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 38.94385528564453, 'logps/chosen': -375.70166015625, 'logps/rejected': -735.560791015625, 'logits/chosen': -1.7601606845855713, 'logits/rejected': -3.79761004447937, 'epoch': 0.53}

 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 820/1545 [07:09<06:32,  1.85it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 821/1545 [07:09<06:43,  1.79it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 822/1545 [07:10<06:47,  1.77it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 823/1545 [07:10<06:20,  1.90it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 824/1545 [07:11<06:30,  1.85it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 825/1545 [07:11<05:53,  2.04it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 826/1545 [07:12<06:06,  1.96it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 827/1545 [07:12<05:45,  2.08it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 828/1545 [07:13<06:05,  1.96it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 829/1545 [07:13<06:12,  1.92it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 830/1545 [07:14<06:14,  1.91it/s]
                                                  
{'loss': 0.0695, 'grad_norm': 2.703125, 'learning_rate': 4.627831715210356e-06, 'rewards/chosen': -38.55735397338867, 'rewards/rejected': -69.79898834228516, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 31.241634368896484, 'logps/chosen': -564.8526611328125, 'logps/rejected': -828.9482421875, 'logits/chosen': -2.5433785915374756, 'logits/rejected': -3.9132227897644043, 'epoch': 0.54}

 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 830/1545 [07:14<06:14,  1.91it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 831/1545 [07:14<06:17,  1.89it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 832/1545 [07:15<06:26,  1.85it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 833/1545 [07:15<06:29,  1.83it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 834/1545 [07:16<06:03,  1.95it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 835/1545 [07:16<06:19,  1.87it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 836/1545 [07:17<06:26,  1.84it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 837/1545 [07:18<06:31,  1.81it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 838/1545 [07:18<06:25,  1.84it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 839/1545 [07:19<06:31,  1.80it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 840/1545 [07:19<06:33,  1.79it/s]
                                                  
{'loss': 0.0001, 'grad_norm': 3.91155481338501e-07, 'learning_rate': 4.563106796116505e-06, 'rewards/chosen': -48.7236328125, 'rewards/rejected': -81.7518539428711, 'rewards/accuracies': 1.0, 'rewards/margins': 33.028221130371094, 'logps/chosen': -654.4888916015625, 'logps/rejected': -931.0814208984375, 'logits/chosen': -2.453652858734131, 'logits/rejected': -4.23397970199585, 'epoch': 0.54}

 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 840/1545 [07:19<06:33,  1.79it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 841/1545 [07:20<06:12,  1.89it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 842/1545 [07:20<06:21,  1.84it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 843/1545 [07:21<06:25,  1.82it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 844/1545 [07:21<06:21,  1.84it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 845/1545 [07:22<06:10,  1.89it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 846/1545 [07:22<06:18,  1.85it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 847/1545 [07:23<06:19,  1.84it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 848/1545 [07:23<05:54,  1.97it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 849/1545 [07:24<06:07,  1.89it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 850/1545 [07:25<06:14,  1.86it/s]
                                                  
{'loss': 0.0, 'grad_norm': 2.286988957586611e-19, 'learning_rate': 4.498381877022654e-06, 'rewards/chosen': -44.718666076660156, 'rewards/rejected': -92.24812316894531, 'rewards/accuracies': 1.0, 'rewards/margins': 47.529449462890625, 'logps/chosen': -589.9989013671875, 'logps/rejected': -1017.69140625, 'logits/chosen': -3.3844847679138184, 'logits/rejected': -4.966015338897705, 'epoch': 0.55}

 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 850/1545 [07:25<06:14,  1.86it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 851/1545 [07:25<06:21,  1.82it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 852/1545 [07:26<05:57,  1.94it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 853/1545 [07:26<06:09,  1.87it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 854/1545 [07:27<06:16,  1.83it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 855/1545 [07:27<06:05,  1.89it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 856/1545 [07:28<06:06,  1.88it/s]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 857/1545 [07:28<06:19,  1.81it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 858/1545 [07:29<06:22,  1.80it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 859/1545 [07:29<05:49,  1.96it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 860/1545 [07:30<06:02,  1.89it/s]
                                                  
{'loss': 0.8693, 'grad_norm': 0.0, 'learning_rate': 4.433656957928803e-06, 'rewards/chosen': -40.95757293701172, 'rewards/rejected': -79.82716369628906, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 38.86958694458008, 'logps/chosen': -559.9370727539062, 'logps/rejected': -917.9359130859375, 'logits/chosen': -3.1516611576080322, 'logits/rejected': -4.5850043296813965, 'epoch': 0.56}

 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 860/1545 [07:30<06:02,  1.89it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 861/1545 [07:30<06:11,  1.84it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 862/1545 [07:31<06:10,  1.85it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 863/1545 [07:31<06:05,  1.87it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 864/1545 [07:32<06:12,  1.83it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 865/1545 [07:33<06:18,  1.80it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 866/1545 [07:33<05:57,  1.90it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 867/1545 [07:34<06:09,  1.84it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 868/1545 [07:34<05:36,  2.01it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 869/1545 [07:35<05:47,  1.95it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 870/1545 [07:35<05:25,  2.07it/s]
                                                  
{'loss': 1.2715, 'grad_norm': 0.0, 'learning_rate': 4.368932038834952e-06, 'rewards/chosen': -32.4947509765625, 'rewards/rejected': -78.07811737060547, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 45.58336639404297, 'logps/chosen': -480.0074157714844, 'logps/rejected': -896.2662963867188, 'logits/chosen': -2.7316079139709473, 'logits/rejected': -4.247876167297363, 'epoch': 0.56}

 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 870/1545 [07:35<05:25,  2.07it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 871/1545 [07:36<05:46,  1.95it/s]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 872/1545 [07:36<05:55,  1.89it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 873/1545 [07:37<05:58,  1.88it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 874/1545 [07:37<05:49,  1.92it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 875/1545 [07:38<05:59,  1.86it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 876/1545 [07:38<06:03,  1.84it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 877/1545 [07:39<05:44,  1.94it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 878/1545 [07:39<05:56,  1.87it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 879/1545 [07:40<05:24,  2.05it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 880/1545 [07:40<05:45,  1.93it/s]
                                                  
{'loss': 2.2344, 'grad_norm': 3.552436828613281e-05, 'learning_rate': 4.304207119741101e-06, 'rewards/chosen': -30.8316707611084, 'rewards/rejected': -63.74101638793945, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 32.909339904785156, 'logps/chosen': -454.8379821777344, 'logps/rejected': -745.82373046875, 'logits/chosen': -2.527254581451416, 'logits/rejected': -3.76324725151062, 'epoch': 0.57}

 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 880/1545 [07:40<05:45,  1.93it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 881/1545 [07:41<05:34,  1.98it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 882/1545 [07:41<05:45,  1.92it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 883/1545 [07:42<05:49,  1.89it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 884/1545 [07:42<05:48,  1.90it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 885/1545 [07:43<05:36,  1.96it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 886/1545 [07:44<05:55,  1.85it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 887/1545 [07:44<05:52,  1.87it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 888/1545 [07:45<05:39,  1.94it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 889/1545 [07:46<08:42,  1.25it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 890/1545 [07:47<07:52,  1.39it/s]
                                                  
{'loss': 0.0, 'grad_norm': 3.0547380447387695e-07, 'learning_rate': 4.23948220064725e-06, 'rewards/chosen': -15.884663581848145, 'rewards/rejected': -51.30836868286133, 'rewards/accuracies': 1.0, 'rewards/margins': 35.423702239990234, 'logps/chosen': -304.71734619140625, 'logps/rejected': -617.8062744140625, 'logits/chosen': -1.6831843852996826, 'logits/rejected': -3.7176902294158936, 'epoch': 0.58}

 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 890/1545 [07:47<07:52,  1.39it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 891/1545 [07:47<07:17,  1.50it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 892/1545 [07:48<06:43,  1.62it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 893/1545 [07:48<06:37,  1.64it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 894/1545 [07:49<06:27,  1.68it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 895/1545 [07:49<05:58,  1.81it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 896/1545 [07:50<06:07,  1.77it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 897/1545 [07:50<06:11,  1.74it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 898/1545 [07:51<06:01,  1.79it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 899/1545 [07:51<05:54,  1.82it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 900/1545 [07:52<05:58,  1.80it/s]
                                                  
{'loss': 0.0021, 'grad_norm': 2.453125, 'learning_rate': 4.1747572815533986e-06, 'rewards/chosen': -22.741947174072266, 'rewards/rejected': -60.978431701660156, 'rewards/accuracies': 1.0, 'rewards/margins': 38.236488342285156, 'logps/chosen': -365.04083251953125, 'logps/rejected': -720.0584716796875, 'logits/chosen': -2.4995644092559814, 'logits/rejected': -3.6995277404785156, 'epoch': 0.58}

 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 900/1545 [07:52<05:58,  1.80it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 901/1545 [07:53<06:02,  1.78it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 902/1545 [07:53<05:34,  1.92it/s]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 903/1545 [07:54<05:43,  1.87it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 904/1545 [07:54<05:45,  1.86it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 905/1545 [07:55<05:46,  1.85it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 906/1545 [07:55<05:21,  1.99it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 907/1545 [07:56<06:46,  1.57it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 908/1545 [07:56<06:03,  1.75it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 909/1545 [07:57<05:44,  1.84it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 910/1545 [07:57<05:56,  1.78it/s]
                                                  
{'loss': 0.0, 'grad_norm': 0.001312255859375, 'learning_rate': 4.1100323624595475e-06, 'rewards/chosen': -24.250585556030273, 'rewards/rejected': -54.26154708862305, 'rewards/accuracies': 1.0, 'rewards/margins': 30.010961532592773, 'logps/chosen': -370.4151611328125, 'logps/rejected': -644.9264526367188, 'logits/chosen': -2.4921982288360596, 'logits/rejected': -3.8125457763671875, 'epoch': 0.59}

 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 910/1545 [07:58<05:56,  1.78it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 911/1545 [07:58<05:59,  1.76it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 912/1545 [07:58<05:20,  1.97it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 913/1545 [07:59<05:02,  2.09it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 914/1545 [07:59<04:43,  2.22it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 915/1545 [08:00<05:10,  2.03it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 916/1545 [08:00<05:22,  1.95it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 917/1545 [08:01<04:39,  2.25it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 918/1545 [08:01<05:09,  2.02it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 919/1545 [08:02<05:21,  1.95it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 920/1545 [08:02<05:27,  1.91it/s]
                                                  
{'loss': 0.0141, 'grad_norm': 3.790855407714844e-05, 'learning_rate': 4.045307443365696e-06, 'rewards/chosen': -25.498334884643555, 'rewards/rejected': -52.45038986206055, 'rewards/accuracies': 1.0, 'rewards/margins': 26.952056884765625, 'logps/chosen': -437.73626708984375, 'logps/rejected': -665.8396606445312, 'logits/chosen': -2.0662381649017334, 'logits/rejected': -3.4163360595703125, 'epoch': 0.6}

 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 920/1545 [08:02<05:27,  1.91it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 921/1545 [08:03<05:28,  1.90it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 922/1545 [08:04<05:37,  1.85it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 923/1545 [08:04<05:40,  1.82it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 924/1545 [08:04<05:16,  1.96it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 925/1545 [08:05<05:30,  1.87it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 926/1545 [08:06<05:42,  1.81it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 927/1545 [08:06<05:38,  1.82it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 928/1545 [08:07<05:29,  1.88it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 929/1545 [08:07<05:41,  1.80it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 930/1545 [08:08<05:35,  1.84it/s]
                                                  
{'loss': 0.0, 'grad_norm': 9.441375732421875e-05, 'learning_rate': 3.980582524271845e-06, 'rewards/chosen': -17.19916343688965, 'rewards/rejected': -57.50432586669922, 'rewards/accuracies': 1.0, 'rewards/margins': 40.305152893066406, 'logps/chosen': -328.04193115234375, 'logps/rejected': -684.6046752929688, 'logits/chosen': -2.187042236328125, 'logits/rejected': -4.470019340515137, 'epoch': 0.6}

 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 930/1545 [08:08<05:35,  1.84it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 931/1545 [08:08<05:01,  2.03it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 932/1545 [08:09<04:25,  2.31it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 933/1545 [08:09<04:15,  2.39it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 934/1545 [08:09<04:46,  2.13it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 935/1545 [08:10<05:12,  1.95it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 936/1545 [08:11<05:03,  2.01it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 937/1545 [08:11<05:19,  1.91it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 938/1545 [08:12<05:27,  1.85it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 939/1545 [08:12<05:29,  1.84it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 940/1545 [08:13<05:12,  1.94it/s]
                                                  
{'loss': 0.0009, 'grad_norm': 5.857145879417658e-10, 'learning_rate': 3.915857605177994e-06, 'rewards/chosen': -24.126405715942383, 'rewards/rejected': -60.42162322998047, 'rewards/accuracies': 1.0, 'rewards/margins': 36.29521942138672, 'logps/chosen': -361.6978454589844, 'logps/rejected': -705.1375732421875, 'logits/chosen': -2.889936923980713, 'logits/rejected': -4.388113975524902, 'epoch': 0.61}

 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 940/1545 [08:13<05:12,  1.94it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 941/1545 [08:13<05:27,  1.84it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 942/1545 [08:14<05:29,  1.83it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 943/1545 [08:14<05:16,  1.90it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 944/1545 [08:15<05:22,  1.86it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 945/1545 [08:15<05:26,  1.84it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 946/1545 [08:16<05:34,  1.79it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 947/1545 [08:16<05:08,  1.94it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 948/1545 [08:17<05:22,  1.85it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 949/1545 [08:18<05:25,  1.83it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 950/1545 [08:18<05:18,  1.87it/s]
                                                  
{'loss': 0.1108, 'grad_norm': 3.202843331805323e-21, 'learning_rate': 3.851132686084142e-06, 'rewards/chosen': -30.862689971923828, 'rewards/rejected': -68.66695404052734, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 37.804264068603516, 'logps/chosen': -447.46221923828125, 'logps/rejected': -795.8951416015625, 'logits/chosen': -2.5695688724517822, 'logits/rejected': -4.074126243591309, 'epoch': 0.61}

 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 950/1545 [08:18<05:18,  1.87it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 951/1545 [08:19<05:24,  1.83it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 952/1545 [08:19<05:27,  1.81it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 953/1545 [08:20<05:30,  1.79it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 954/1545 [08:20<05:07,  1.92it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 955/1545 [08:21<05:16,  1.87it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 956/1545 [08:21<05:17,  1.85it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 957/1545 [08:22<05:15,  1.86it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 958/1545 [08:22<05:07,  1.91it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 959/1545 [08:23<05:18,  1.84it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 960/1545 [08:24<05:23,  1.81it/s]
                                                  
{'loss': 0.0938, 'grad_norm': 1.7848833522293717e-11, 'learning_rate': 3.7864077669902915e-06, 'rewards/chosen': -31.656015396118164, 'rewards/rejected': -76.21475982666016, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 44.558738708496094, 'logps/chosen': -445.54949951171875, 'logps/rejected': -868.03564453125, 'logits/chosen': -2.807201862335205, 'logits/rejected': -4.515857696533203, 'epoch': 0.62}

 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 960/1545 [08:24<05:23,  1.81it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 961/1545 [08:24<05:05,  1.91it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 962/1545 [08:25<05:18,  1.83it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 963/1545 [08:25<05:20,  1.81it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 964/1545 [08:26<05:18,  1.83it/s]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 965/1545 [08:26<05:09,  1.88it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 966/1545 [08:27<05:17,  1.82it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 967/1545 [08:27<05:22,  1.79it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 968/1545 [08:28<05:00,  1.92it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 969/1545 [08:28<05:17,  1.81it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 970/1545 [08:29<05:20,  1.80it/s]
                                                  
{'loss': 0.0, 'grad_norm': 1.8596649169921875e-05, 'learning_rate': 3.721682847896441e-06, 'rewards/chosen': -31.059711456298828, 'rewards/rejected': -74.49284362792969, 'rewards/accuracies': 1.0, 'rewards/margins': 43.43313217163086, 'logps/chosen': -468.68341064453125, 'logps/rejected': -852.8173828125, 'logits/chosen': -2.2017874717712402, 'logits/rejected': -4.380518913269043, 'epoch': 0.63}

 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 970/1545 [08:29<05:20,  1.80it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 971/1545 [08:29<04:51,  1.97it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 972/1545 [08:30<04:11,  2.28it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 973/1545 [08:30<04:36,  2.07it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 974/1545 [08:31<04:48,  1.98it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 975/1545 [08:31<04:57,  1.91it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 976/1545 [08:32<04:36,  2.06it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 977/1545 [08:32<04:51,  1.95it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 978/1545 [08:33<04:56,  1.91it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 979/1545 [08:33<04:58,  1.90it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 980/1545 [08:34<04:52,  1.93it/s]
                                                  
{'loss': 0.0, 'grad_norm': 2.656295322589486e-17, 'learning_rate': 3.6569579288025893e-06, 'rewards/chosen': -23.480316162109375, 'rewards/rejected': -78.44859313964844, 'rewards/accuracies': 1.0, 'rewards/margins': 54.9682731628418, 'logps/chosen': -396.80621337890625, 'logps/rejected': -915.10791015625, 'logits/chosen': -2.1880345344543457, 'logits/rejected': -4.083585739135742, 'epoch': 0.63}

 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 980/1545 [08:34<04:52,  1.93it/s]
 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 981/1545 [08:35<05:05,  1.84it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 982/1545 [08:35<05:08,  1.83it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 983/1545 [08:36<04:48,  1.94it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 984/1545 [08:36<05:01,  1.86it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 985/1545 [08:37<05:05,  1.84it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 986/1545 [08:37<05:01,  1.85it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 987/1545 [08:38<04:48,  1.93it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 988/1545 [08:38<04:56,  1.88it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 989/1545 [08:39<05:02,  1.84it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 990/1545 [08:39<04:52,  1.90it/s]
                                                  
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 3.592233009708738e-06, 'rewards/chosen': -31.23908042907715, 'rewards/rejected': -84.39486694335938, 'rewards/accuracies': 1.0, 'rewards/margins': 53.155784606933594, 'logps/chosen': -469.037109375, 'logps/rejected': -947.4615478515625, 'logits/chosen': -2.7216758728027344, 'logits/rejected': -4.766693115234375, 'epoch': 0.64}

 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 990/1545 [08:39<04:52,  1.90it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 991/1545 [08:40<05:05,  1.81it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 992/1545 [08:41<05:06,  1.80it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 993/1545 [08:41<04:36,  2.00it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 994/1545 [08:41<04:24,  2.08it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 995/1545 [08:42<04:40,  1.96it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 996/1545 [08:42<04:50,  1.89it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 997/1545 [08:43<04:52,  1.88it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 998/1545 [08:43<04:31,  2.02it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 999/1545 [08:44<04:45,  1.91it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1000/1545 [08:45<04:48,  1.89it/s]
                                                   
{'loss': 0.0, 'grad_norm': 2.8189256484623115e-18, 'learning_rate': 3.5275080906148866e-06, 'rewards/chosen': -29.645822525024414, 'rewards/rejected': -76.21923065185547, 'rewards/accuracies': 1.0, 'rewards/margins': 46.57341384887695, 'logps/chosen': -474.364990234375, 'logps/rejected': -886.2589721679688, 'logits/chosen': -2.4084129333496094, 'logits/rejected': -3.961566209793091, 'epoch': 0.65}

 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1000/1545 [08:45<04:48,  1.89it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1001/1545 [08:45<04:48,  1.88it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1002/1545 [08:46<04:52,  1.86it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1003/1545 [08:47<07:29,  1.20it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 1004/1545 [08:48<06:42,  1.34it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1005/1545 [08:48<05:51,  1.54it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1006/1545 [08:49<05:40,  1.58it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1007/1545 [08:49<05:29,  1.63it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1008/1545 [08:50<05:10,  1.73it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1009/1545 [08:50<05:14,  1.70it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1010/1545 [08:51<05:14,  1.70it/s]
                                                   
{'loss': 0.0, 'grad_norm': 6.352747104407253e-22, 'learning_rate': 3.462783171521036e-06, 'rewards/chosen': -32.23039245605469, 'rewards/rejected': -103.32981872558594, 'rewards/accuracies': 1.0, 'rewards/margins': 71.09942626953125, 'logps/chosen': -480.616455078125, 'logps/rejected': -1155.624267578125, 'logits/chosen': -2.7920470237731934, 'logits/rejected': -4.624792575836182, 'epoch': 0.65}

 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1010/1545 [08:51<05:14,  1.70it/s]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1011/1545 [08:52<05:09,  1.73it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1012/1545 [08:52<04:51,  1.83it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1013/1545 [08:53<04:57,  1.79it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1014/1545 [08:53<04:57,  1.79it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1015/1545 [08:54<04:42,  1.88it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1016/1545 [08:54<04:18,  2.05it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1017/1545 [08:54<04:00,  2.20it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1018/1545 [08:55<04:18,  2.04it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1019/1545 [08:56<04:24,  1.99it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1020/1545 [08:56<04:24,  1.99it/s]
                                                   
{'loss': 0.0, 'grad_norm': 1.895427703857422e-05, 'learning_rate': 3.398058252427185e-06, 'rewards/chosen': -47.92738342285156, 'rewards/rejected': -90.48072052001953, 'rewards/accuracies': 1.0, 'rewards/margins': 42.553340911865234, 'logps/chosen': -608.0487060546875, 'logps/rejected': -1020.53955078125, 'logits/chosen': -3.5880520343780518, 'logits/rejected': -4.89176607131958, 'epoch': 0.66}

 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1020/1545 [08:56<04:24,  1.99it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1021/1545 [08:57<04:38,  1.88it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1022/1545 [08:57<04:45,  1.83it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 1023/1545 [08:58<04:24,  1.97it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1024/1545 [08:58<04:35,  1.89it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1025/1545 [08:59<04:41,  1.85it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1026/1545 [08:59<04:38,  1.86it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1027/1545 [09:00<04:26,  1.94it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1028/1545 [09:00<04:37,  1.86it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1029/1545 [09:01<04:41,  1.84it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1030/1545 [09:01<04:12,  2.04it/s]
                                                   
{'loss': 0.0067, 'grad_norm': 0.000110626220703125, 'learning_rate': 3.3333333333333333e-06, 'rewards/chosen': -31.464359283447266, 'rewards/rejected': -79.84764862060547, 'rewards/accuracies': 1.0, 'rewards/margins': 48.38329315185547, 'logps/chosen': -460.6871032714844, 'logps/rejected': -919.3955078125, 'logits/chosen': -2.863114833831787, 'logits/rejected': -4.340217113494873, 'epoch': 0.67}

 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1030/1545 [09:01<04:12,  2.04it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1031/1545 [09:02<04:15,  2.01it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1032/1545 [09:02<04:26,  1.92it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1033/1545 [09:03<04:29,  1.90it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1034/1545 [09:03<04:16,  1.99it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1035/1545 [09:04<04:28,  1.90it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1036/1545 [09:04<04:33,  1.86it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1037/1545 [09:05<04:35,  1.84it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1038/1545 [09:05<04:20,  1.95it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1039/1545 [09:06<04:29,  1.88it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1040/1545 [09:07<04:34,  1.84it/s]
                                                   
{'loss': 0.0878, 'grad_norm': 2.0161650127192843e-13, 'learning_rate': 3.2686084142394826e-06, 'rewards/chosen': -22.702207565307617, 'rewards/rejected': -70.41169738769531, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 47.7094841003418, 'logps/chosen': -373.5330810546875, 'logps/rejected': -809.3206787109375, 'logits/chosen': -2.131121873855591, 'logits/rejected': -4.659956932067871, 'epoch': 0.67}

 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1040/1545 [09:07<04:34,  1.84it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1041/1545 [09:07<04:37,  1.81it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 1042/1545 [09:08<05:11,  1.61it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1043/1545 [09:09<05:29,  1.52it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1044/1545 [09:09<05:04,  1.65it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1045/1545 [09:10<04:30,  1.85it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1046/1545 [09:10<04:49,  1.73it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1047/1545 [09:11<04:48,  1.73it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1048/1545 [09:11<04:22,  1.89it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1049/1545 [09:12<04:33,  1.81it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1050/1545 [09:12<04:05,  2.01it/s]
                                                   
{'loss': 0.0693, 'grad_norm': 1.5802470443304628e-11, 'learning_rate': 3.2038834951456315e-06, 'rewards/chosen': -28.963123321533203, 'rewards/rejected': -69.37215423583984, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 40.409034729003906, 'logps/chosen': -429.06085205078125, 'logps/rejected': -803.1228637695312, 'logits/chosen': -2.533982038497925, 'logits/rejected': -4.221534252166748, 'epoch': 0.68}

 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1050/1545 [09:12<04:05,  2.01it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1051/1545 [09:13<04:13,  1.95it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1052/1545 [09:13<04:01,  2.04it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1053/1545 [09:14<04:16,  1.92it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1054/1545 [09:14<04:23,  1.86it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1055/1545 [09:15<04:14,  1.93it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1056/1545 [09:15<04:21,  1.87it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1057/1545 [09:16<04:26,  1.83it/s]
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1058/1545 [09:17<04:28,  1.81it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1059/1545 [09:17<04:07,  1.96it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1060/1545 [09:18<04:14,  1.90it/s]
                                                   
{'loss': 0.0598, 'grad_norm': 0.0, 'learning_rate': 3.13915857605178e-06, 'rewards/chosen': -34.7342643737793, 'rewards/rejected': -71.29218292236328, 'rewards/accuracies': 1.0, 'rewards/margins': 36.55791091918945, 'logps/chosen': -516.7052612304688, 'logps/rejected': -824.2127685546875, 'logits/chosen': -3.0466580390930176, 'logits/rejected': -4.670151710510254, 'epoch': 0.69}

 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1060/1545 [09:18<04:14,  1.90it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1061/1545 [09:18<04:19,  1.86it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 1062/1545 [09:19<04:21,  1.85it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1063/1545 [09:19<04:43,  1.70it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1064/1545 [09:20<05:05,  1.57it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1065/1545 [09:21<04:41,  1.70it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1066/1545 [09:21<04:41,  1.70it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1067/1545 [09:22<04:38,  1.72it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1068/1545 [09:22<04:35,  1.73it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1069/1545 [09:23<04:17,  1.85it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1070/1545 [09:23<04:22,  1.81it/s]
                                                   
{'loss': 0.0, 'grad_norm': 6.733911930671688e-20, 'learning_rate': 3.0744336569579293e-06, 'rewards/chosen': -25.6965274810791, 'rewards/rejected': -69.01310729980469, 'rewards/accuracies': 1.0, 'rewards/margins': 43.31658172607422, 'logps/chosen': -386.03173828125, 'logps/rejected': -804.2999877929688, 'logits/chosen': -2.817981004714966, 'logits/rejected': -4.601845741271973, 'epoch': 0.69}

 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1070/1545 [09:23<04:22,  1.81it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1071/1545 [09:24<04:26,  1.78it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1072/1545 [09:24<04:21,  1.81it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1073/1545 [09:25<04:17,  1.83it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1074/1545 [09:26<04:21,  1.80it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1075/1545 [09:26<04:22,  1.79it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1076/1545 [09:27<04:06,  1.90it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1077/1545 [09:27<04:15,  1.83it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1078/1545 [09:28<04:16,  1.82it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1079/1545 [09:28<04:14,  1.83it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1080/1545 [09:29<04:08,  1.87it/s]
                                                   
{'loss': 0.0, 'grad_norm': 4.3298697960381105e-15, 'learning_rate': 3.0097087378640778e-06, 'rewards/chosen': -30.43539047241211, 'rewards/rejected': -81.25576782226562, 'rewards/accuracies': 1.0, 'rewards/margins': 50.82037353515625, 'logps/chosen': -435.7405700683594, 'logps/rejected': -922.3958740234375, 'logits/chosen': -3.113784074783325, 'logits/rejected': -4.786489963531494, 'epoch': 0.7}

 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1080/1545 [09:29<04:08,  1.87it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1081/1545 [09:29<04:16,  1.81it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1082/1545 [09:30<04:14,  1.82it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1083/1545 [09:30<03:58,  1.93it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1084/1545 [09:31<04:07,  1.86it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1085/1545 [09:32<04:12,  1.82it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1086/1545 [09:32<04:11,  1.82it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1087/1545 [09:33<04:09,  1.84it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1088/1545 [09:33<04:12,  1.81it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1089/1545 [09:34<04:12,  1.80it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1090/1545 [09:34<03:54,  1.94it/s]
                                                   
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.9449838187702267e-06, 'rewards/chosen': -32.56324005126953, 'rewards/rejected': -83.65999603271484, 'rewards/accuracies': 1.0, 'rewards/margins': 51.09675979614258, 'logps/chosen': -480.71697998046875, 'logps/rejected': -947.0802612304688, 'logits/chosen': -2.62692928314209, 'logits/rejected': -4.6637372970581055, 'epoch': 0.71}

 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1090/1545 [09:34<03:54,  1.94it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1091/1545 [09:35<04:03,  1.87it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1092/1545 [09:35<04:06,  1.84it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1093/1545 [09:36<04:03,  1.86it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1094/1545 [09:36<03:48,  1.97it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1095/1545 [09:37<03:56,  1.90it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1096/1545 [09:37<04:01,  1.86it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1097/1545 [09:38<03:55,  1.91it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1098/1545 [09:38<03:56,  1.89it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1099/1545 [09:39<03:35,  2.07it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1100/1545 [09:39<03:47,  1.96it/s]
                                                   
{'loss': 0.0, 'grad_norm': 1.8925892415213273e-21, 'learning_rate': 2.880258899676376e-06, 'rewards/chosen': -30.453670501708984, 'rewards/rejected': -84.54370880126953, 'rewards/accuracies': 1.0, 'rewards/margins': 54.09003448486328, 'logps/chosen': -462.65472412109375, 'logps/rejected': -947.7019653320312, 'logits/chosen': -2.7360405921936035, 'logits/rejected': -4.802727699279785, 'epoch': 0.71}

 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 1100/1545 [09:39<03:47,  1.96it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1101/1545 [09:40<03:39,  2.02it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1102/1545 [09:40<03:50,  1.92it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1103/1545 [09:41<04:02,  1.82it/s]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1104/1545 [09:41<03:38,  2.02it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1105/1545 [09:42<03:30,  2.09it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1106/1545 [09:42<03:46,  1.94it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1107/1545 [09:43<03:51,  1.89it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1108/1545 [09:44<03:48,  1.91it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1109/1545 [09:44<03:18,  2.20it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1110/1545 [09:44<03:36,  2.01it/s]
                                                   
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.8155339805825245e-06, 'rewards/chosen': -17.798450469970703, 'rewards/rejected': -69.56123352050781, 'rewards/accuracies': 1.0, 'rewards/margins': 51.76277542114258, 'logps/chosen': -317.5895690917969, 'logps/rejected': -803.85595703125, 'logits/chosen': -2.070510149002075, 'logits/rejected': -4.086310863494873, 'epoch': 0.72}

 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1110/1545 [09:44<03:36,  2.01it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1111/1545 [09:45<03:47,  1.91it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1112/1545 [09:46<03:50,  1.88it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1113/1545 [09:46<03:42,  1.94it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1114/1545 [09:47<03:50,  1.87it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1115/1545 [09:47<03:51,  1.85it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1116/1545 [09:48<05:18,  1.35it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1117/1545 [09:49<05:06,  1.40it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1118/1545 [09:50<04:49,  1.48it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1119/1545 [09:50<04:30,  1.58it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1120/1545 [09:51<04:27,  1.59it/s]
                                                   
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.7508090614886734e-06, 'rewards/chosen': -21.955896377563477, 'rewards/rejected': -73.12271881103516, 'rewards/accuracies': 1.0, 'rewards/margins': 51.166812896728516, 'logps/chosen': -381.0746154785156, 'logps/rejected': -851.5888671875, 'logits/chosen': -2.0508859157562256, 'logits/rejected': -4.157201290130615, 'epoch': 0.72}

 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1120/1545 [09:51<04:27,  1.59it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1121/1545 [09:51<04:25,  1.59it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1122/1545 [09:52<04:16,  1.65it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1123/1545 [09:52<04:07,  1.70it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1124/1545 [09:53<04:10,  1.68it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1125/1545 [09:54<04:03,  1.73it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1126/1545 [09:54<03:45,  1.86it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1127/1545 [09:55<03:53,  1.79it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1128/1545 [09:55<03:53,  1.79it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1129/1545 [09:56<03:49,  1.81it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1130/1545 [09:56<03:49,  1.81it/s]
                                                   
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.686084142394822e-06, 'rewards/chosen': -28.051372528076172, 'rewards/rejected': -90.2176284790039, 'rewards/accuracies': 1.0, 'rewards/margins': 62.1662483215332, 'logps/chosen': -435.447509765625, 'logps/rejected': -1026.18798828125, 'logits/chosen': -2.331815242767334, 'logits/rejected': -4.296026706695557, 'epoch': 0.73}

 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1130/1545 [09:56<03:49,  1.81it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1131/1545 [09:57<03:51,  1.78it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1132/1545 [09:57<03:54,  1.76it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1133/1545 [09:58<03:37,  1.89it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1134/1545 [09:59<03:44,  1.83it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1135/1545 [09:59<03:46,  1.81it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1136/1545 [10:00<03:44,  1.82it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1137/1545 [10:00<03:39,  1.86it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1138/1545 [10:01<03:42,  1.83it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 1139/1545 [10:01<03:44,  1.81it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1140/1545 [10:02<03:30,  1.93it/s]
                                                   
{'loss': 0.0, 'grad_norm': 8.348877145181177e-14, 'learning_rate': 2.621359223300971e-06, 'rewards/chosen': -28.582677841186523, 'rewards/rejected': -89.46198272705078, 'rewards/accuracies': 1.0, 'rewards/margins': 60.879295349121094, 'logps/chosen': -428.8086853027344, 'logps/rejected': -1000.7501831054688, 'logits/chosen': -2.752683162689209, 'logits/rejected': -4.274931907653809, 'epoch': 0.74}

 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1140/1545 [10:02<03:30,  1.93it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1141/1545 [10:02<03:39,  1.84it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1142/1545 [10:03<03:40,  1.83it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1143/1545 [10:03<03:38,  1.84it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1144/1545 [10:04<03:29,  1.92it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1145/1545 [10:04<03:36,  1.85it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1146/1545 [10:05<03:34,  1.86it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1147/1545 [10:05<03:24,  1.94it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1148/1545 [10:06<03:35,  1.85it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1149/1545 [10:07<03:37,  1.82it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1150/1545 [10:07<03:37,  1.81it/s]
                                                   
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.55663430420712e-06, 'rewards/chosen': -30.311452865600586, 'rewards/rejected': -88.15357971191406, 'rewards/accuracies': 1.0, 'rewards/margins': 57.84212112426758, 'logps/chosen': -434.56097412109375, 'logps/rejected': -994.1755981445312, 'logits/chosen': -2.7379162311553955, 'logits/rejected': -4.308984279632568, 'epoch': 0.74}

 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1150/1545 [10:07<03:37,  1.81it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1151/1545 [10:08<03:25,  1.91it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1152/1545 [10:08<03:31,  1.86it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1153/1545 [10:09<03:33,  1.84it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1154/1545 [10:09<03:26,  1.89it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1155/1545 [10:10<03:32,  1.84it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1156/1545 [10:10<03:35,  1.81it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1157/1545 [10:11<03:34,  1.81it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1158/1545 [10:11<03:16,  1.97it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1159/1545 [10:12<03:27,  1.86it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1160/1545 [10:13<03:27,  1.86it/s]
                                                   
{'loss': 5.2559, 'grad_norm': 1.4543533325195312e-05, 'learning_rate': 2.491909385113269e-06, 'rewards/chosen': -36.78795623779297, 'rewards/rejected': -50.73297119140625, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 13.945019721984863, 'logps/chosen': -489.45684814453125, 'logps/rejected': -604.6040649414062, 'logits/chosen': -2.9106650352478027, 'logits/rejected': -3.734529972076416, 'epoch': 0.75}

 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1160/1545 [10:13<03:27,  1.86it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1161/1545 [10:13<03:25,  1.86it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1162/1545 [10:14<03:53,  1.64it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1163/1545 [10:14<03:47,  1.68it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1164/1545 [10:15<03:38,  1.75it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1165/1545 [10:15<03:30,  1.81it/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1166/1545 [10:16<03:31,  1.79it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1167/1545 [10:17<03:31,  1.79it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1168/1545 [10:17<03:16,  1.92it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1169/1545 [10:18<03:23,  1.85it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1170/1545 [10:18<03:23,  1.84it/s]
                                                   
{'loss': 2.6137, 'grad_norm': 4.929390229335695e-14, 'learning_rate': 2.427184466019418e-06, 'rewards/chosen': -25.188079833984375, 'rewards/rejected': -58.6072998046875, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 33.419219970703125, 'logps/chosen': -412.2906188964844, 'logps/rejected': -685.932861328125, 'logits/chosen': -1.7462126016616821, 'logits/rejected': -3.5213539600372314, 'epoch': 0.76}

 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1170/1545 [10:18<03:23,  1.84it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1171/1545 [10:19<03:22,  1.85it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1172/1545 [10:19<03:15,  1.91it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1173/1545 [10:20<03:21,  1.84it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1174/1545 [10:20<03:18,  1.87it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1175/1545 [10:21<03:11,  1.93it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1176/1545 [10:21<03:18,  1.86it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1177/1545 [10:22<03:21,  1.83it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 1178/1545 [10:22<03:22,  1.81it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1179/1545 [10:23<03:08,  1.95it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1180/1545 [10:23<03:17,  1.85it/s]
                                                   
{'loss': 0.6655, 'grad_norm': 0.0002689361572265625, 'learning_rate': 2.3624595469255667e-06, 'rewards/chosen': -22.50977325439453, 'rewards/rejected': -50.339866638183594, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 27.830097198486328, 'logps/chosen': -364.30804443359375, 'logps/rejected': -620.8726806640625, 'logits/chosen': -2.1491000652313232, 'logits/rejected': -3.5545706748962402, 'epoch': 0.76}

 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1180/1545 [10:23<03:17,  1.85it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1181/1545 [10:24<03:19,  1.82it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1182/1545 [10:25<03:12,  1.88it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1183/1545 [10:25<03:15,  1.85it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1184/1545 [10:25<02:57,  2.04it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1185/1545 [10:26<03:04,  1.95it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1186/1545 [10:26<02:57,  2.02it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1187/1545 [10:27<03:06,  1.92it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1188/1545 [10:28<03:13,  1.85it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1189/1545 [10:28<02:54,  2.04it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1190/1545 [10:28<02:49,  2.09it/s]
                                                   
{'loss': 0.6483, 'grad_norm': 6.4375, 'learning_rate': 2.297734627831715e-06, 'rewards/chosen': -29.0391788482666, 'rewards/rejected': -50.39924240112305, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 21.360071182250977, 'logps/chosen': -452.758056640625, 'logps/rejected': -620.355712890625, 'logits/chosen': -2.2422232627868652, 'logits/rejected': -2.9050989151000977, 'epoch': 0.77}

 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1190/1545 [10:28<02:49,  2.09it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1191/1545 [10:29<03:02,  1.94it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1192/1545 [10:29<02:47,  2.11it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1193/1545 [10:30<02:56,  1.99it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1194/1545 [10:30<02:49,  2.07it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1195/1545 [10:31<02:58,  1.96it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1196/1545 [10:32<03:02,  1.91it/s]
 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 1197/1545 [10:32<03:03,  1.89it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1198/1545 [10:33<02:56,  1.96it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1199/1545 [10:33<03:03,  1.89it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1200/1545 [10:34<03:06,  1.85it/s]
                                                   
{'loss': 0.0049, 'grad_norm': 0.119140625, 'learning_rate': 2.2330097087378645e-06, 'rewards/chosen': -24.393327713012695, 'rewards/rejected': -47.943050384521484, 'rewards/accuracies': 1.0, 'rewards/margins': 23.549720764160156, 'logps/chosen': -420.3701171875, 'logps/rejected': -596.0790405273438, 'logits/chosen': -2.2110846042633057, 'logits/rejected': -3.3869075775146484, 'epoch': 0.78}

 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1200/1545 [10:34<03:06,  1.85it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1201/1545 [10:34<02:58,  1.93it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1202/1545 [10:35<03:03,  1.87it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1203/1545 [10:35<03:08,  1.81it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1204/1545 [10:36<03:07,  1.82it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1205/1545 [10:36<02:55,  1.93it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1206/1545 [10:37<03:01,  1.87it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1207/1545 [10:37<03:03,  1.84it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1208/1545 [10:38<02:58,  1.89it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1209/1545 [10:39<03:00,  1.86it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1210/1545 [10:39<03:06,  1.80it/s]
                                                   
{'loss': 0.0022, 'grad_norm': 2.473825588822365e-10, 'learning_rate': 2.1682847896440134e-06, 'rewards/chosen': -17.865224838256836, 'rewards/rejected': -44.9857063293457, 'rewards/accuracies': 1.0, 'rewards/margins': 27.120479583740234, 'logps/chosen': -318.7569274902344, 'logps/rejected': -567.396240234375, 'logits/chosen': -2.13226580619812, 'logits/rejected': -3.3080012798309326, 'epoch': 0.78}

 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1210/1545 [10:39<03:06,  1.80it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1211/1545 [10:40<03:06,  1.79it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1212/1545 [10:40<02:54,  1.91it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1213/1545 [10:41<03:02,  1.82it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1214/1545 [10:41<02:42,  2.04it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1215/1545 [10:41<02:30,  2.20it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 1216/1545 [10:42<02:28,  2.21it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1217/1545 [10:43<02:42,  2.02it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1218/1545 [10:43<02:49,  1.93it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1219/1545 [10:44<02:53,  1.88it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1220/1545 [10:44<02:51,  1.90it/s]
                                                   
{'loss': 1.6899, 'grad_norm': 1152.0, 'learning_rate': 2.103559870550162e-06, 'rewards/chosen': -28.565698623657227, 'rewards/rejected': -47.352169036865234, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 18.78647232055664, 'logps/chosen': -429.1759338378906, 'logps/rejected': -576.1998901367188, 'logits/chosen': -2.689363718032837, 'logits/rejected': -3.7332401275634766, 'epoch': 0.79}

 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1220/1545 [10:44<02:51,  1.90it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1221/1545 [10:45<02:57,  1.83it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1222/1545 [10:45<03:00,  1.79it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1223/1545 [10:46<02:47,  1.92it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1224/1545 [10:46<02:55,  1.83it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1225/1545 [10:47<02:55,  1.82it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1226/1545 [10:47<02:54,  1.83it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1227/1545 [10:48<02:53,  1.83it/s]
 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1228/1545 [10:49<02:54,  1.82it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1229/1545 [10:50<04:09,  1.26it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1230/1545 [10:50<03:27,  1.52it/s]
                                                   
{'loss': 0.8983, 'grad_norm': 2.4400651454925537e-07, 'learning_rate': 2.0388349514563107e-06, 'rewards/chosen': -17.619848251342773, 'rewards/rejected': -43.817138671875, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 26.19728660583496, 'logps/chosen': -310.8443298339844, 'logps/rejected': -550.3328857421875, 'logits/chosen': -2.216456651687622, 'logits/rejected': -3.418893814086914, 'epoch': 0.8}

 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1230/1545 [10:50<03:27,  1.52it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1231/1545 [10:51<03:16,  1.60it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1232/1545 [10:52<03:20,  1.56it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1233/1545 [10:52<03:16,  1.59it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1234/1545 [10:53<02:59,  1.73it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 1235/1545 [10:53<03:00,  1.72it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1236/1545 [10:54<02:57,  1.74it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1237/1545 [10:54<02:53,  1.77it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1238/1545 [10:55<02:52,  1.78it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1239/1545 [10:55<02:55,  1.74it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1240/1545 [10:56<02:54,  1.75it/s]
                                                   
{'loss': 1.4954, 'grad_norm': 1.7394086171407253e-11, 'learning_rate': 1.9741100323624596e-06, 'rewards/chosen': -20.375286102294922, 'rewards/rejected': -44.56410217285156, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 24.18881607055664, 'logps/chosen': -332.5040283203125, 'logps/rejected': -539.218505859375, 'logits/chosen': -2.468999147415161, 'logits/rejected': -3.7816715240478516, 'epoch': 0.8}

 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1240/1545 [10:56<02:54,  1.75it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1241/1545 [10:56<02:43,  1.86it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1242/1545 [10:57<02:28,  2.04it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1243/1545 [10:57<02:37,  1.91it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1244/1545 [10:58<02:40,  1.88it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1245/1545 [10:58<02:31,  1.98it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1246/1545 [10:59<02:37,  1.90it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1247/1545 [11:00<02:42,  1.83it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1248/1545 [11:00<02:38,  1.88it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1249/1545 [11:01<02:42,  1.82it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1250/1545 [11:01<02:45,  1.79it/s]
                                                   
{'loss': 0.0, 'grad_norm': 0.0177001953125, 'learning_rate': 1.9093851132686085e-06, 'rewards/chosen': -10.88851261138916, 'rewards/rejected': -38.96172332763672, 'rewards/accuracies': 1.0, 'rewards/margins': 28.07320785522461, 'logps/chosen': -251.8507537841797, 'logps/rejected': -504.0205993652344, 'logits/chosen': -1.7540420293807983, 'logits/rejected': -3.0294809341430664, 'epoch': 0.81}

 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1250/1545 [11:01<02:45,  1.79it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1251/1545 [11:02<02:46,  1.76it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1252/1545 [11:02<02:42,  1.80it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1253/1545 [11:03<02:45,  1.76it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1254/1545 [11:04<02:46,  1.75it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 1255/1545 [11:04<02:33,  1.89it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1256/1545 [11:05<02:39,  1.82it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1257/1545 [11:05<02:39,  1.81it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1258/1545 [11:06<02:38,  1.81it/s]
 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1259/1545 [11:06<02:35,  1.84it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1260/1545 [11:07<02:38,  1.80it/s]
                                                   
{'loss': 0.0, 'grad_norm': 8.96453857421875e-05, 'learning_rate': 1.8446601941747574e-06, 'rewards/chosen': -16.425739288330078, 'rewards/rejected': -49.47471237182617, 'rewards/accuracies': 1.0, 'rewards/margins': 33.04896926879883, 'logps/chosen': -337.70025634765625, 'logps/rejected': -628.4820556640625, 'logits/chosen': -1.5739221572875977, 'logits/rejected': -3.0157792568206787, 'epoch': 0.82}

 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1260/1545 [11:07<02:38,  1.80it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1261/1545 [11:07<02:41,  1.76it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1262/1545 [11:08<02:20,  2.02it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1263/1545 [11:08<02:26,  1.93it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1264/1545 [11:09<02:30,  1.87it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1265/1545 [11:09<02:31,  1.84it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1266/1545 [11:10<02:22,  1.95it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1267/1545 [11:10<02:29,  1.86it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1268/1545 [11:11<02:30,  1.84it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1269/1545 [11:12<02:27,  1.87it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1270/1545 [11:12<02:29,  1.84it/s]
                                                   
{'loss': 0.2319, 'grad_norm': 1.3096723705530167e-10, 'learning_rate': 1.7799352750809063e-06, 'rewards/chosen': -19.763113021850586, 'rewards/rejected': -38.63679504394531, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 18.873680114746094, 'logps/chosen': -342.18231201171875, 'logps/rejected': -496.32232666015625, 'logits/chosen': -2.013960361480713, 'logits/rejected': -2.9975972175598145, 'epoch': 0.82}

 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1270/1545 [11:12<02:29,  1.84it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1271/1545 [11:13<02:32,  1.80it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1272/1545 [11:13<02:33,  1.78it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1273/1545 [11:14<02:21,  1.92it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1274/1545 [11:14<02:25,  1.86it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1275/1545 [11:15<02:27,  1.84it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1276/1545 [11:15<02:24,  1.86it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1277/1545 [11:16<02:25,  1.84it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1278/1545 [11:17<02:29,  1.78it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1279/1545 [11:17<02:29,  1.77it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1280/1545 [11:17<02:17,  1.93it/s]
                                                   
{'loss': 0.444, 'grad_norm': 7.771561172376096e-16, 'learning_rate': 1.715210355987055e-06, 'rewards/chosen': -16.24677085876465, 'rewards/rejected': -37.7825927734375, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 21.535823822021484, 'logps/chosen': -301.96905517578125, 'logps/rejected': -496.69610595703125, 'logits/chosen': -1.7921947240829468, 'logits/rejected': -2.7485709190368652, 'epoch': 0.83}

 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1280/1545 [11:18<02:17,  1.93it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1281/1545 [11:18<02:24,  1.83it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1282/1545 [11:19<02:23,  1.83it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1283/1545 [11:19<02:20,  1.87it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1284/1545 [11:20<02:17,  1.89it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1285/1545 [11:20<02:21,  1.84it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1286/1545 [11:21<02:21,  1.83it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1287/1545 [11:21<02:12,  1.94it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1288/1545 [11:22<02:01,  2.11it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1289/1545 [11:22<02:08,  2.00it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1290/1545 [11:23<02:11,  1.93it/s]
                                                   
{'loss': 0.066, 'grad_norm': 6.668269634246826e-07, 'learning_rate': 1.650485436893204e-06, 'rewards/chosen': -12.632109642028809, 'rewards/rejected': -42.14348602294922, 'rewards/accuracies': 1.0, 'rewards/margins': 29.511377334594727, 'logps/chosen': -311.40838623046875, 'logps/rejected': -553.1618041992188, 'logits/chosen': -1.4785627126693726, 'logits/rejected': -2.7886316776275635, 'epoch': 0.83}

 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1290/1545 [11:23<02:11,  1.93it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1291/1545 [11:23<02:08,  1.97it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1292/1545 [11:24<02:13,  1.89it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1293/1545 [11:24<02:17,  1.83it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1294/1545 [11:25<02:16,  1.84it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1295/1545 [11:25<02:16,  1.83it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1296/1545 [11:26<02:18,  1.79it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1297/1545 [11:27<02:19,  1.78it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1298/1545 [11:27<02:07,  1.94it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1299/1545 [11:28<02:13,  1.84it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1300/1545 [11:28<02:13,  1.84it/s]
                                                   
{'loss': 0.0, 'grad_norm': 0.01336669921875, 'learning_rate': 1.585760517799353e-06, 'rewards/chosen': -15.508363723754883, 'rewards/rejected': -44.09370803833008, 'rewards/accuracies': 1.0, 'rewards/margins': 28.585346221923828, 'logps/chosen': -305.378662109375, 'logps/rejected': -551.26708984375, 'logits/chosen': -1.9857141971588135, 'logits/rejected': -2.942575454711914, 'epoch': 0.84}

 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1300/1545 [11:28<02:13,  1.84it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1301/1545 [11:29<02:12,  1.84it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1302/1545 [11:29<02:10,  1.86it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1303/1545 [11:30<02:12,  1.82it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1304/1545 [11:30<02:14,  1.79it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1305/1545 [11:31<02:06,  1.90it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1306/1545 [11:31<02:11,  1.82it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1307/1545 [11:32<02:10,  1.82it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1308/1545 [11:33<02:09,  1.82it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1309/1545 [11:33<02:07,  1.86it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1310/1545 [11:34<02:09,  1.82it/s]
                                                   
{'loss': 0.2831, 'grad_norm': 5.438923835754395e-07, 'learning_rate': 1.5210355987055017e-06, 'rewards/chosen': -18.139490127563477, 'rewards/rejected': -49.077415466308594, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 30.93792724609375, 'logps/chosen': -349.8685607910156, 'logps/rejected': -612.5897216796875, 'logits/chosen': -1.8611408472061157, 'logits/rejected': -3.1098215579986572, 'epoch': 0.85}

 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1310/1545 [11:34<02:09,  1.82it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1311/1545 [11:34<02:10,  1.80it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1312/1545 [11:35<02:01,  1.92it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1313/1545 [11:35<02:04,  1.86it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1314/1545 [11:36<01:52,  2.05it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1315/1545 [11:36<01:58,  1.95it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1316/1545 [11:37<01:51,  2.05it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1317/1545 [11:37<01:56,  1.95it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1318/1545 [11:38<01:46,  2.12it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1319/1545 [11:38<01:54,  1.98it/s]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1320/1545 [11:39<01:49,  2.05it/s]
                                                   
{'loss': 0.0, 'grad_norm': 9.75781955236954e-17, 'learning_rate': 1.4563106796116506e-06, 'rewards/chosen': -10.908350944519043, 'rewards/rejected': -47.042503356933594, 'rewards/accuracies': 1.0, 'rewards/margins': 36.134151458740234, 'logps/chosen': -251.7775421142578, 'logps/rejected': -576.4340209960938, 'logits/chosen': -1.4616104364395142, 'logits/rejected': -3.171027898788452, 'epoch': 0.85}

 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1320/1545 [11:39<01:49,  2.05it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1321/1545 [11:39<01:57,  1.91it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1322/1545 [11:40<01:58,  1.87it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1323/1545 [11:40<01:59,  1.86it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1324/1545 [11:41<01:57,  1.88it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1325/1545 [11:41<01:59,  1.84it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1326/1545 [11:42<02:00,  1.81it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1327/1545 [11:42<01:44,  2.08it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1328/1545 [11:43<01:49,  1.97it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1329/1545 [11:43<01:53,  1.90it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1330/1545 [11:44<01:54,  1.88it/s]
                                                   
{'loss': 0.0001, 'grad_norm': 2.34375, 'learning_rate': 1.3915857605177997e-06, 'rewards/chosen': -12.354753494262695, 'rewards/rejected': -41.89970016479492, 'rewards/accuracies': 1.0, 'rewards/margins': 29.54494857788086, 'logps/chosen': -297.7950134277344, 'logps/rejected': -539.6881103515625, 'logits/chosen': -1.1807644367218018, 'logits/rejected': -2.6145424842834473, 'epoch': 0.86}

 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1330/1545 [11:44<01:54,  1.88it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1331/1545 [11:44<01:47,  2.00it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1332/1545 [11:45<01:38,  2.15it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1333/1545 [11:45<01:45,  2.02it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1334/1545 [11:46<01:49,  1.93it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1335/1545 [11:46<01:42,  2.05it/s]
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1336/1545 [11:47<01:48,  1.93it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1337/1545 [11:47<01:50,  1.88it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1338/1545 [11:48<01:49,  1.89it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1339/1545 [11:49<01:47,  1.91it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1340/1545 [11:49<01:50,  1.85it/s]
                                                   
{'loss': 0.0001, 'grad_norm': 1.2747477740049362e-08, 'learning_rate': 1.3268608414239483e-06, 'rewards/chosen': -7.87436580657959, 'rewards/rejected': -39.97648239135742, 'rewards/accuracies': 1.0, 'rewards/margins': 32.10211944580078, 'logps/chosen': -211.8810577392578, 'logps/rejected': -505.80975341796875, 'logits/chosen': -1.697493314743042, 'logits/rejected': -3.0915396213531494, 'epoch': 0.87}

 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1340/1545 [11:49<01:50,  1.85it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1341/1545 [11:50<01:52,  1.82it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1342/1545 [11:50<01:46,  1.90it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1343/1545 [11:52<02:47,  1.20it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1344/1545 [11:52<02:30,  1.33it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1345/1545 [11:53<02:16,  1.47it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1346/1545 [11:53<02:09,  1.53it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1347/1545 [11:54<02:05,  1.57it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1348/1545 [11:54<02:00,  1.64it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1349/1545 [11:55<01:48,  1.80it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1350/1545 [11:56<01:49,  1.78it/s]
                                                   
{'loss': 0.0, 'grad_norm': 5.327165126800537e-07, 'learning_rate': 1.2621359223300972e-06, 'rewards/chosen': -13.375717163085938, 'rewards/rejected': -49.55775833129883, 'rewards/accuracies': 1.0, 'rewards/margins': 36.18204116821289, 'logps/chosen': -304.5226135253906, 'logps/rejected': -614.50341796875, 'logits/chosen': -1.5832102298736572, 'logits/rejected': -3.0017504692077637, 'epoch': 0.87}

 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1350/1545 [11:56<01:49,  1.78it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1351/1545 [11:56<01:48,  1.78it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1352/1545 [11:57<01:44,  1.85it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1353/1545 [11:57<01:43,  1.86it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1354/1545 [11:58<01:45,  1.80it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1355/1545 [11:58<01:46,  1.78it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1356/1545 [11:59<01:39,  1.90it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1357/1545 [11:59<01:30,  2.07it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1358/1545 [12:00<01:36,  1.94it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1359/1545 [12:00<01:38,  1.88it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1360/1545 [12:01<01:30,  2.04it/s]
                                                   
{'loss': 0.0, 'grad_norm': 0.0023956298828125, 'learning_rate': 1.197411003236246e-06, 'rewards/chosen': -8.20715045928955, 'rewards/rejected': -40.68087387084961, 'rewards/accuracies': 1.0, 'rewards/margins': 32.47372817993164, 'logps/chosen': -225.0033416748047, 'logps/rejected': -529.2271118164062, 'logits/chosen': -1.879974603652954, 'logits/rejected': -2.8071045875549316, 'epoch': 0.88}

 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1360/1545 [12:01<01:30,  2.04it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1361/1545 [12:01<01:35,  1.93it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1362/1545 [12:02<01:26,  2.12it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1363/1545 [12:02<01:31,  1.99it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1364/1545 [12:03<01:26,  2.10it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1365/1545 [12:03<01:32,  1.94it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1366/1545 [12:04<01:34,  1.90it/s]
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1367/1545 [12:04<01:33,  1.91it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1368/1545 [12:05<01:33,  1.89it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1369/1545 [12:05<01:35,  1.85it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1370/1545 [12:06<01:36,  1.81it/s]
                                                   
{'loss': 0.0, 'grad_norm': 3.245077095925808e-09, 'learning_rate': 1.132686084142395e-06, 'rewards/chosen': -13.201225280761719, 'rewards/rejected': -47.207088470458984, 'rewards/accuracies': 1.0, 'rewards/margins': 34.005863189697266, 'logps/chosen': -289.7091369628906, 'logps/rejected': -593.4539794921875, 'logits/chosen': -1.5826839208602905, 'logits/rejected': -2.911431074142456, 'epoch': 0.89}

 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1370/1545 [12:06<01:36,  1.81it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1371/1545 [12:06<01:30,  1.92it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1372/1545 [12:07<01:33,  1.85it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1373/1545 [12:08<01:33,  1.84it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1374/1545 [12:08<01:32,  1.85it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1375/1545 [12:09<01:30,  1.88it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1376/1545 [12:09<01:32,  1.83it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1377/1545 [12:10<01:31,  1.83it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1378/1545 [12:10<01:25,  1.96it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1379/1545 [12:11<01:28,  1.87it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1380/1545 [12:11<01:29,  1.84it/s]
                                                   
{'loss': 0.0046, 'grad_norm': 2.625, 'learning_rate': 1.0679611650485437e-06, 'rewards/chosen': -19.794082641601562, 'rewards/rejected': -42.44209671020508, 'rewards/accuracies': 1.0, 'rewards/margins': 22.648014068603516, 'logps/chosen': -388.0662841796875, 'logps/rejected': -544.12451171875, 'logits/chosen': -1.7488387823104858, 'logits/rejected': -2.9612879753112793, 'epoch': 0.89}

 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1380/1545 [12:11<01:29,  1.84it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1381/1545 [12:12<01:28,  1.86it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1382/1545 [12:12<01:21,  2.00it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1383/1545 [12:13<01:25,  1.89it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1384/1545 [12:13<01:27,  1.84it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1385/1545 [12:14<01:24,  1.90it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1386/1545 [12:14<01:16,  2.09it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1387/1545 [12:15<01:20,  1.95it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1388/1545 [12:15<01:22,  1.90it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1389/1545 [12:16<01:19,  1.96it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1390/1545 [12:16<01:22,  1.88it/s]
                                                   
{'loss': 0.0058, 'grad_norm': 75.5, 'learning_rate': 1.0032362459546926e-06, 'rewards/chosen': -14.03381633758545, 'rewards/rejected': -39.99800109863281, 'rewards/accuracies': 1.0, 'rewards/margins': 25.964187622070312, 'logps/chosen': -306.3893127441406, 'logps/rejected': -523.348388671875, 'logits/chosen': -1.803492784500122, 'logits/rejected': -2.9958126544952393, 'epoch': 0.9}

 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1390/1545 [12:16<01:22,  1.88it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1391/1545 [12:17<01:23,  1.84it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1392/1545 [12:18<01:22,  1.85it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1393/1545 [12:18<01:16,  1.98it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1394/1545 [12:19<01:21,  1.86it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1395/1545 [12:19<01:21,  1.84it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1396/1545 [12:20<01:17,  1.91it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1397/1545 [12:20<01:18,  1.89it/s]
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1398/1545 [12:21<01:19,  1.85it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1399/1545 [12:21<01:19,  1.85it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1400/1545 [12:22<01:13,  1.98it/s]
                                                   
{'loss': 0.0, 'grad_norm': 4.298783551348606e-13, 'learning_rate': 9.385113268608415e-07, 'rewards/chosen': -17.3962345123291, 'rewards/rejected': -49.274452209472656, 'rewards/accuracies': 1.0, 'rewards/margins': 31.878215789794922, 'logps/chosen': -298.85546875, 'logps/rejected': -580.9243774414062, 'logits/chosen': -1.9781240224838257, 'logits/rejected': -3.3585548400878906, 'epoch': 0.91}

 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1400/1545 [12:22<01:13,  1.98it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1401/1545 [12:22<01:15,  1.90it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1402/1545 [12:23<01:16,  1.88it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1403/1545 [12:23<01:15,  1.88it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1404/1545 [12:24<01:12,  1.95it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1405/1545 [12:24<01:06,  2.11it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1406/1545 [12:25<01:09,  1.99it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1407/1545 [12:25<01:10,  1.95it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1408/1545 [12:26<01:10,  1.93it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1409/1545 [12:26<01:13,  1.86it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1410/1545 [12:27<01:13,  1.83it/s]
                                                   
{'loss': 0.0, 'grad_norm': 1.4137996329210978e-16, 'learning_rate': 8.737864077669904e-07, 'rewards/chosen': -10.19818115234375, 'rewards/rejected': -47.04527282714844, 'rewards/accuracies': 1.0, 'rewards/margins': 36.847084045410156, 'logps/chosen': -266.72540283203125, 'logps/rejected': -583.9825439453125, 'logits/chosen': -1.446902871131897, 'logits/rejected': -3.2011420726776123, 'epoch': 0.91}

 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1410/1545 [12:27<01:13,  1.83it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1411/1545 [12:27<01:10,  1.89it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1412/1545 [12:28<01:12,  1.84it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1413/1545 [12:29<01:12,  1.82it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1414/1545 [12:29<01:11,  1.83it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1415/1545 [12:30<01:09,  1.86it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1416/1545 [12:30<01:12,  1.78it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1417/1545 [12:31<01:11,  1.79it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1418/1545 [12:31<01:06,  1.90it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1419/1545 [12:32<01:08,  1.85it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1420/1545 [12:32<01:08,  1.82it/s]
                                                   
{'loss': 0.0013, 'grad_norm': 1.0244548320770264e-07, 'learning_rate': 8.090614886731392e-07, 'rewards/chosen': -20.795812606811523, 'rewards/rejected': -50.17866897583008, 'rewards/accuracies': 1.0, 'rewards/margins': 29.38285255432129, 'logps/chosen': -362.2292785644531, 'logps/rejected': -605.4534301757812, 'logits/chosen': -2.1136467456817627, 'logits/rejected': -3.451559543609619, 'epoch': 0.92}

 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1420/1545 [12:32<01:08,  1.82it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1421/1545 [12:33<01:08,  1.82it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1422/1545 [12:34<01:06,  1.85it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1423/1545 [12:34<01:07,  1.81it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1424/1545 [12:35<01:07,  1.79it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1425/1545 [12:35<01:03,  1.90it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1426/1545 [12:36<01:05,  1.82it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1427/1545 [12:36<01:04,  1.82it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1428/1545 [12:37<01:04,  1.81it/s]
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1429/1545 [12:37<01:02,  1.85it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1430/1545 [12:38<01:04,  1.80it/s]
                                                   
{'loss': 0.0001, 'grad_norm': 7.486343383789062e-05, 'learning_rate': 7.443365695792882e-07, 'rewards/chosen': -17.025291442871094, 'rewards/rejected': -57.489990234375, 'rewards/accuracies': 1.0, 'rewards/margins': 40.46469497680664, 'logps/chosen': -336.61431884765625, 'logps/rejected': -702.4564208984375, 'logits/chosen': -1.5203006267547607, 'logits/rejected': -3.2374939918518066, 'epoch': 0.93}

 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1430/1545 [12:38<01:04,  1.80it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1431/1545 [12:39<01:04,  1.78it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1432/1545 [12:39<00:59,  1.89it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1433/1545 [12:40<01:01,  1.82it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1434/1545 [12:40<01:01,  1.80it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1435/1545 [12:41<01:01,  1.79it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1436/1545 [12:41<00:58,  1.86it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1437/1545 [12:42<00:59,  1.82it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1438/1545 [12:42<00:59,  1.80it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1439/1545 [12:43<00:54,  1.93it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1440/1545 [12:43<00:57,  1.84it/s]
                                                   
{'loss': 0.0775, 'grad_norm': 3.4421682357788086e-06, 'learning_rate': 6.79611650485437e-07, 'rewards/chosen': -16.205768585205078, 'rewards/rejected': -38.98480224609375, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 22.779033660888672, 'logps/chosen': -307.8979797363281, 'logps/rejected': -502.49041748046875, 'logits/chosen': -2.063363552093506, 'logits/rejected': -3.2290992736816406, 'epoch': 0.93}

 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1440/1545 [12:43<00:57,  1.84it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1441/1545 [12:44<00:58,  1.79it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1442/1545 [12:44<00:56,  1.81it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1443/1545 [12:45<00:55,  1.85it/s]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1444/1545 [12:46<00:55,  1.83it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1445/1545 [12:46<00:55,  1.81it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1446/1545 [12:47<00:51,  1.93it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1447/1545 [12:47<00:52,  1.87it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1448/1545 [12:48<00:52,  1.83it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1449/1545 [12:48<00:53,  1.78it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1450/1545 [12:49<00:51,  1.84it/s]
                                                   
{'loss': 0.0001, 'grad_norm': 1.2168044349891716e-13, 'learning_rate': 6.148867313915858e-07, 'rewards/chosen': -18.26565933227539, 'rewards/rejected': -48.65673828125, 'rewards/accuracies': 1.0, 'rewards/margins': 30.391077041625977, 'logps/chosen': -316.24761962890625, 'logps/rejected': -603.7527465820312, 'logits/chosen': -2.136662006378174, 'logits/rejected': -3.263352870941162, 'epoch': 0.94}

 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1450/1545 [12:49<00:51,  1.84it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1451/1545 [12:49<00:53,  1.75it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1452/1545 [12:50<00:53,  1.74it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1453/1545 [12:50<00:49,  1.85it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1454/1545 [12:51<00:51,  1.77it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1455/1545 [12:52<00:50,  1.78it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1456/1545 [12:53<01:13,  1.21it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1457/1545 [12:54<01:06,  1.32it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1458/1545 [12:54<00:56,  1.55it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1459/1545 [12:55<00:53,  1.60it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1460/1545 [12:55<00:48,  1.75it/s]
                                                   
{'loss': 0.0, 'grad_norm': 2.0469737016526324e-16, 'learning_rate': 5.501618122977346e-07, 'rewards/chosen': -17.234251022338867, 'rewards/rejected': -46.855690002441406, 'rewards/accuracies': 1.0, 'rewards/margins': 29.621444702148438, 'logps/chosen': -348.11370849609375, 'logps/rejected': -579.3179321289062, 'logits/chosen': -1.4622868299484253, 'logits/rejected': -3.198202133178711, 'epoch': 0.94}

 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1460/1545 [12:55<00:48,  1.75it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1461/1545 [12:56<00:49,  1.69it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1462/1545 [12:56<00:49,  1.69it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1463/1545 [12:57<00:47,  1.72it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1464/1545 [12:57<00:45,  1.77it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1465/1545 [12:58<00:45,  1.75it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1466/1545 [12:59<00:44,  1.76it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1467/1545 [12:59<00:41,  1.89it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1468/1545 [13:00<00:42,  1.81it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1469/1545 [13:00<00:41,  1.81it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1470/1545 [13:01<00:40,  1.83it/s]
                                                   
{'loss': 0.0, 'grad_norm': 1.4921397450962104e-13, 'learning_rate': 4.854368932038835e-07, 'rewards/chosen': -14.240861892700195, 'rewards/rejected': -46.95703887939453, 'rewards/accuracies': 1.0, 'rewards/margins': 32.7161750793457, 'logps/chosen': -306.51873779296875, 'logps/rejected': -581.2691040039062, 'logits/chosen': -1.777173638343811, 'logits/rejected': -3.262760877609253, 'epoch': 0.95}

 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1470/1545 [13:01<00:40,  1.83it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1471/1545 [13:01<00:35,  2.08it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1472/1545 [13:02<00:37,  1.96it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1473/1545 [13:02<00:37,  1.91it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1474/1545 [13:03<00:36,  1.94it/s]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1475/1545 [13:03<00:36,  1.91it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1476/1545 [13:04<00:37,  1.86it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1477/1545 [13:04<00:37,  1.81it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1478/1545 [13:05<00:34,  1.93it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1479/1545 [13:05<00:35,  1.86it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1480/1545 [13:06<00:35,  1.84it/s]
                                                   
{'loss': 0.001, 'grad_norm': 5.186961971048731e-13, 'learning_rate': 4.207119741100324e-07, 'rewards/chosen': -15.928776741027832, 'rewards/rejected': -49.93461608886719, 'rewards/accuracies': 1.0, 'rewards/margins': 34.00584030151367, 'logps/chosen': -290.5811767578125, 'logps/rejected': -614.055419921875, 'logits/chosen': -2.113398551940918, 'logits/rejected': -3.1592159271240234, 'epoch': 0.96}

 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1480/1545 [13:06<00:35,  1.84it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1481/1545 [13:06<00:34,  1.83it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1482/1545 [13:07<00:34,  1.85it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1483/1545 [13:08<00:34,  1.82it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1484/1545 [13:08<00:33,  1.83it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1485/1545 [13:09<00:31,  1.93it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1486/1545 [13:09<00:31,  1.88it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1487/1545 [13:10<00:31,  1.84it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1488/1545 [13:10<00:31,  1.83it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1489/1545 [13:11<00:29,  1.87it/s]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1490/1545 [13:11<00:30,  1.80it/s]
                                                   
{'loss': 0.3793, 'grad_norm': 8.836247705756861e-18, 'learning_rate': 3.5598705501618125e-07, 'rewards/chosen': -12.482555389404297, 'rewards/rejected': -39.87023162841797, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 27.387676239013672, 'logps/chosen': -264.73980712890625, 'logps/rejected': -496.36474609375, 'logits/chosen': -1.77366042137146, 'logits/rejected': -3.205986499786377, 'epoch': 0.96}

 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1490/1545 [13:11<00:30,  1.80it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1491/1545 [13:12<00:30,  1.77it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1492/1545 [13:12<00:28,  1.89it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1493/1545 [13:13<00:28,  1.83it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1494/1545 [13:14<00:28,  1.82it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1495/1545 [13:14<00:27,  1.82it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1496/1545 [13:15<00:26,  1.84it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1497/1545 [13:15<00:26,  1.81it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1498/1545 [13:16<00:25,  1.81it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1499/1545 [13:16<00:23,  1.95it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1500/1545 [13:17<00:23,  1.89it/s]
                                                   
{'loss': 0.0, 'grad_norm': 4.729550084903167e-14, 'learning_rate': 2.9126213592233014e-07, 'rewards/chosen': -16.625436782836914, 'rewards/rejected': -45.96239471435547, 'rewards/accuracies': 1.0, 'rewards/margins': 29.336956024169922, 'logps/chosen': -287.1039733886719, 'logps/rejected': -579.9110107421875, 'logits/chosen': -2.09769868850708, 'logits/rejected': -3.411703109741211, 'epoch': 0.97}

 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1500/1545 [13:17<00:23,  1.89it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1501/1545 [13:17<00:23,  1.85it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1502/1545 [13:18<00:21,  2.04it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1503/1545 [13:18<00:19,  2.12it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1504/1545 [13:19<00:20,  2.00it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1505/1545 [13:19<00:20,  1.94it/s]
 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1506/1545 [13:20<00:20,  1.91it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1507/1545 [13:20<00:18,  2.04it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1508/1545 [13:21<00:19,  1.93it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1509/1545 [13:21<00:17,  2.11it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1510/1545 [13:22<00:17,  2.02it/s]
                                                   
{'loss': 0.0693, 'grad_norm': 4.041939973831177e-07, 'learning_rate': 2.26537216828479e-07, 'rewards/chosen': -13.022150039672852, 'rewards/rejected': -43.39824676513672, 'rewards/accuracies': 0.8999999761581421, 'rewards/margins': 30.3760929107666, 'logps/chosen': -266.9024963378906, 'logps/rejected': -551.7422485351562, 'logits/chosen': -1.764814019203186, 'logits/rejected': -3.2696170806884766, 'epoch': 0.98}

 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1510/1545 [13:22<00:17,  2.02it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1511/1545 [13:22<00:16,  2.11it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1512/1545 [13:23<00:16,  1.94it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1513/1545 [13:23<00:16,  1.91it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1514/1545 [13:24<00:16,  1.92it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1515/1545 [13:24<00:15,  1.90it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1516/1545 [13:25<00:15,  1.86it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1517/1545 [13:25<00:15,  1.84it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1518/1545 [13:26<00:13,  2.02it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1519/1545 [13:26<00:11,  2.20it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1520/1545 [13:27<00:12,  2.02it/s]
                                                   
{'loss': 0.0007, 'grad_norm': 6.28125, 'learning_rate': 1.6181229773462782e-07, 'rewards/chosen': -14.050992965698242, 'rewards/rejected': -44.175750732421875, 'rewards/accuracies': 1.0, 'rewards/margins': 30.124755859375, 'logps/chosen': -316.6773376464844, 'logps/rejected': -552.1802978515625, 'logits/chosen': -1.1631816625595093, 'logits/rejected': -3.36537504196167, 'epoch': 0.98}

 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1520/1545 [13:27<00:12,  2.02it/s]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1521/1545 [13:27<00:12,  1.94it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1522/1545 [13:28<00:11,  1.93it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1523/1545 [13:28<00:11,  1.91it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1524/1545 [13:29<00:11,  1.85it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1525/1545 [13:29<00:09,  2.05it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1526/1545 [13:30<00:09,  2.09it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1527/1545 [13:30<00:09,  1.93it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1528/1545 [13:31<00:09,  1.87it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1529/1545 [13:32<00:08,  1.85it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1530/1545 [13:32<00:07,  1.97it/s]
                                                   
{'loss': 0.0, 'grad_norm': 6.714628852932947e-13, 'learning_rate': 9.70873786407767e-08, 'rewards/chosen': -18.606670379638672, 'rewards/rejected': -48.73549270629883, 'rewards/accuracies': 1.0, 'rewards/margins': 30.12882423400879, 'logps/chosen': -354.6220703125, 'logps/rejected': -609.2911376953125, 'logits/chosen': -1.5294870138168335, 'logits/rejected': -3.3688862323760986, 'epoch': 0.99}

 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1530/1545 [13:32<00:07,  1.97it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1531/1545 [13:33<00:07,  1.88it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1532/1545 [13:33<00:06,  1.86it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1533/1545 [13:33<00:05,  2.06it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1534/1545 [13:34<00:05,  2.16it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1535/1545 [13:34<00:05,  1.99it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1536/1545 [13:35<00:04,  1.90it/s]
 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1537/1545 [13:36<00:04,  1.93it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1538/1545 [13:36<00:03,  1.81it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1539/1545 [13:37<00:03,  1.73it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1540/1545 [13:37<00:02,  1.77it/s]
                                                   
{'loss': 1.0373, 'grad_norm': 9.86623976961809e-18, 'learning_rate': 3.2362459546925574e-08, 'rewards/chosen': -16.71319007873535, 'rewards/rejected': -44.19614028930664, 'rewards/accuracies': 0.800000011920929, 'rewards/margins': 27.482952117919922, 'logps/chosen': -343.7029724121094, 'logps/rejected': -564.3067016601562, 'logits/chosen': -1.7413575649261475, 'logits/rejected': -2.7366671562194824, 'epoch': 1.0}

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1540/1545 [13:37<00:02,  1.77it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1541/1545 [13:38<00:02,  1.77it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1542/1545 [13:39<00:01,  1.73it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1543/1545 [13:39<00:01,  1.72it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1544/1545 [13:40<00:00,  1.82it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1545/1545 [13:40<00:00,  1.79it/s]
                                                   
{'train_runtime': 832.5695, 'train_samples_per_second': 1.856, 'train_steps_per_second': 1.856, 'train_loss': 0.6777341416610262, 'epoch': 1.0}

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1545/1545 [13:52<00:00,  1.79it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1545/1545 [13:52<00:00,  1.86it/s]