liang.zhao
commited on
Commit
•
1177f07
1
Parent(s):
b0a400b
update model and config
Browse files
README.md
CHANGED
@@ -19,18 +19,13 @@ pipeline_tag: text-generation
|
|
19 |
# Training Details
|
20 |
|
21 |
|
22 |
-
Skywork-Critic-Llama3.1-8B is built on Meta [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and fine-
|
|
|
|
|
|
|
|
|
23 |
|
24 |
-
|
25 |
-
2. [OffsetBias](https://huggingface.co/datasets/NCSOFT/offsetbias)
|
26 |
-
3. [WildGuard (adversarial)](https://huggingface.co/allenai/wildguard)
|
27 |
-
4. Magpie DPO series:
|
28 |
-
- [Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1)
|
29 |
-
- [Pro (Llama-3.1)](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1)
|
30 |
-
- [Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1)
|
31 |
-
- [Air](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-DPO-100K-v0.1)
|
32 |
-
(We use a high-quality subset of this data collection. For more details, please refer to our [Skywork-Reward-Preference-80K-v0.1 dataset.](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1))
|
33 |
-
Additionally, the model is trained on in-house human annotation data, synthetic data similar to the [**self-taught**](https://arxiv.org/abs/2408.02666) approach, and critic-related chat data. The training employs instruction-tuning methodology, focusing on pairwise preference evaluation and general chat tasks. We have conducted a thorough verification process to ensure our training dataset does not contain any test set information from RewardBench, maintaining the integrity of our evaluation results.
|
34 |
|
35 |
|
36 |
# RewardBench Leaderboard for Generative Models
|
@@ -138,7 +133,7 @@ If you find our work helpful, please feel free to cite us using the following Bi
|
|
138 |
```bibtex
|
139 |
@misc{skyworkcritic2024,
|
140 |
title={Skywork Critic Model Series},
|
141 |
-
author={Shiwen, Tu and Liang, Zhao and Liu, Chris Yuhao and Zeng, Liang and Liu Yang},
|
142 |
year={2024},
|
143 |
month={September},
|
144 |
howpublished={\url{https://huggingface.co/Skywork}},
|
|
|
19 |
# Training Details
|
20 |
|
21 |
|
22 |
+
Skywork-Critic-Llama3.1-8B is built on Meta [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and has undergone fine-tuning using a diverse array of high-quality datasets, including:
|
23 |
+
- **Cleaned open-source data**: We utilize a high-quality subset of [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2), [OffsetBias](https://huggingface.co/datasets/NCSOFT/offsetbias), [WildGuard (adversarial)](https://huggingface.co/allenai/wildguard) and Magpie DPO series([Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1),[Pro (Llama-3.1)](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1),[Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1),[Air](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-DPO-100K-v0.1)). For more details, please refer to our [Skywork-Reward-Preference-80K-v0.1 dataset.](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1).
|
24 |
+
- **In-house human annotation data**: This includes both pointwise scoring across many dimensions for a single response and pairwise comparisons between two responses. Each dimension incorporates a rationale for the assigned score.
|
25 |
+
- **Synthetic critic data**: We use a similar appoarch to [**self-taught**](https://arxiv.org/abs/2408.02666). Specifically, we employed two methods to generate inferior responses for a given instruction: 1) Creating a similar instruction and then generating a response for this new instruction. 2) Introducing subtle errors into high-quality responses.
|
26 |
+
- **Critic-related chat data**: We incorporate critic-related chat data to maintain the model's conversational capabilities.
|
27 |
|
28 |
+
The training employs instruction-tuning methodology, focusing on pairwise preference evaluation and general chat tasks. We have conducted a thorough verification process to ensure our training dataset does not contain any test set information from RewardBench, maintaining the integrity of our evaluation results.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
|
31 |
# RewardBench Leaderboard for Generative Models
|
|
|
133 |
```bibtex
|
134 |
@misc{skyworkcritic2024,
|
135 |
title={Skywork Critic Model Series},
|
136 |
+
author={Shiwen, Tu and Liang, Zhao and Liu, Chris Yuhao and Zeng, Liang and Liu, Yang},
|
137 |
year={2024},
|
138 |
month={September},
|
139 |
howpublished={\url{https://huggingface.co/Skywork}},
|