gx-ai-architect
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -72,7 +72,7 @@ The prompts space for preference tuning were uniformly sampled by source from th
|
|
72 |
|
73 |
The preference tuned version of Merlinite-7B-pt shows overall all performance enhancement across the board, with no alignment tax observed, as shown in our evaluation. Surprisingly, we find improvements in mathematical ability measured by GSM8K and MT-Bench, which differs from studies observing decreased math/reasoning after RLHF alignment.
|
74 |
|
75 |
-
We also observe a clear correlation between the Mixtral DPO reward scores and MT-Bench scores, as
|
76 |
|
77 |
The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.
|
78 |
|
|
|
72 |
|
73 |
The preference tuned version of Merlinite-7B-pt shows overall all performance enhancement across the board, with no alignment tax observed, as shown in our evaluation. Surprisingly, we find improvements in mathematical ability measured by GSM8K and MT-Bench, which differs from studies observing decreased math/reasoning after RLHF alignment.
|
74 |
|
75 |
+
We also observe a clear correlation between the Mixtral DPO reward scores and MT-Bench scores, as shown in chart above. The reward score of Best-of-N sampled batch keeps improving til Rejection Sampling Round-2. Model saturates at Rejection sampling round 3, no longer giving improvements on either MT-Bench or Mixtral-DPO rewards.
|
76 |
|
77 |
The final Merlinite-7B-pt is the peak checkpoint measured by both Batch-Reward and MT-Bench.
|
78 |
|