Update README.md
Browse files
README.md
CHANGED
@@ -21,7 +21,7 @@ pipeline_tag: text-generation
|
|
21 |
|
22 |
Skywork-Critic-Llama3.1-70B and Skywork-Critic-Llama3.1-8B are built on Meta [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) and [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) respectively. These models have undergone fine-tuning using a diverse array of high-quality datasets, including:
|
23 |
- **Cleaned open-source data**: We utilize a high-quality subset of [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2), [OffsetBias](https://huggingface.co/datasets/NCSOFT/offsetbias), [WildGuard (adversarial)](https://huggingface.co/allenai/wildguard) and Magpie DPO series([Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1),[Pro (Llama-3.1)](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1),[Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1),[Air](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-DPO-100K-v0.1)). For more details, please refer to our [Skywork-Reward-Preference-80K-v0.1 dataset](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1). Additionally, we integrate several open-source, high-quality critic datasets such as [Open-Critic-GPT](https://huggingface.co/datasets/Vezora/Open-Critic-GPT) into our training process.
|
24 |
-
- **In-house human annotation data**: This includes both pointwise scoring across many dimensions for a single response and pairwise comparisons between two responses. Each dimension incorporates a rationale for the assigned score.
|
25 |
- **Synthetic critic data**: We use a similar appoarch to [**self-taught**](https://arxiv.org/abs/2408.02666). Specifically, we employed two methods to generate inferior responses for a given instruction: 1) Creating a similar instruction and then generating a response for this new instruction. 2) Introducing subtle errors into high-quality responses.
|
26 |
- **Critic-related chat data**: We incorporate critic-related chat data to maintain the model's conversational capabilities.
|
27 |
|
@@ -34,11 +34,13 @@ We evaluate our models on [RewardBench](https://huggingface.co/spaces/allenai/re
|
|
34 |
|
35 |
As of September 2024, Skywork-Critic-Llama3.1-70B **ranks first** on RewardBench for generative models across all sizes, while Skywork-Critic-Llama3.1-8B tops the list for generative models under 10B parameters. (Note: An asterisk (*) indicates an open-source model.)
|
36 |
|
|
|
37 |
| Model | Chat | Chat Hard | Safety | Reasoning | Overall Score |
|
38 |
| ------------------------------- | :---: | :-------: | :----: | :-------: | :---: |
|
39 |
| **Skywork-Critic-Llama3.1-70B** * | **96.6** | **87.9** | **93.1** | **95.5** | **93.3** |
|
40 |
| Salesforce/SFR-LLaMa-3.1-70B-Judge-r | 96.9 | 84.8 | 91.6 | 97.6 | 92.7 |
|
41 |
| Salesforce/SFR-nemo-12B-Judge-r | 97.2 | 82.2 | 86.5 | 95.1 | 90.3 |
|
|
|
42 |
| **Skywork-Critic-Llama3.1-8B** * | **93.6** | **81.4** | **91.1** | **89.8** | **89.0** |
|
43 |
| Salesforce/SFR-LLaMa-3.1-8B-Judge-r | 95.5 | 77.7 | 86.2 | 95.1 | 88.7 |
|
44 |
| facebook/Self-taught-Llama-3-70B | 96.9 | 84.0 | 91.1 | 82.5 | 88.6 |
|
@@ -50,6 +52,8 @@ As of September 2024, Skywork-Critic-Llama3.1-70B **ranks first** on RewardBench
|
|
50 |
| meta-llama/Meta-Llama-3.1-70B-Instruct * | 97.2 | 70.2 | 82.8 | 86.0 | 84.0 |
|
51 |
| NCSOFT/Llama-3-OffsetBias-8B * | 92.5 | 80.3 | 86.8 | 76.4 | 84.0 |
|
52 |
|
|
|
|
|
53 |
|
54 |
# Demo Code
|
55 |
Below is an example of obtaining the critic of two conversations.
|
|
|
21 |
|
22 |
Skywork-Critic-Llama3.1-70B and Skywork-Critic-Llama3.1-8B are built on Meta [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) and [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) respectively. These models have undergone fine-tuning using a diverse array of high-quality datasets, including:
|
23 |
- **Cleaned open-source data**: We utilize a high-quality subset of [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2), [OffsetBias](https://huggingface.co/datasets/NCSOFT/offsetbias), [WildGuard (adversarial)](https://huggingface.co/allenai/wildguard) and Magpie DPO series([Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1),[Pro (Llama-3.1)](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1),[Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1),[Air](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-DPO-100K-v0.1)). For more details, please refer to our [Skywork-Reward-Preference-80K-v0.1 dataset](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1). Additionally, we integrate several open-source, high-quality critic datasets such as [Open-Critic-GPT](https://huggingface.co/datasets/Vezora/Open-Critic-GPT) into our training process.
|
24 |
+
- **In-house human annotation data**: This includes both pointwise scoring across many dimensions for a single response and pairwise comparisons between two responses. Each dimension incorporates a rationale for the assigned score. Please note that manually labeled data is very expensive to obtain. We only have a few hundred manually labeled data points, all of which are in Chinese, so the ability to perform single rating might not be particularly strong.
|
25 |
- **Synthetic critic data**: We use a similar appoarch to [**self-taught**](https://arxiv.org/abs/2408.02666). Specifically, we employed two methods to generate inferior responses for a given instruction: 1) Creating a similar instruction and then generating a response for this new instruction. 2) Introducing subtle errors into high-quality responses.
|
26 |
- **Critic-related chat data**: We incorporate critic-related chat data to maintain the model's conversational capabilities.
|
27 |
|
|
|
34 |
|
35 |
As of September 2024, Skywork-Critic-Llama3.1-70B **ranks first** on RewardBench for generative models across all sizes, while Skywork-Critic-Llama3.1-8B tops the list for generative models under 10B parameters. (Note: An asterisk (*) indicates an open-source model.)
|
36 |
|
37 |
+
|
38 |
| Model | Chat | Chat Hard | Safety | Reasoning | Overall Score |
|
39 |
| ------------------------------- | :---: | :-------: | :----: | :-------: | :---: |
|
40 |
| **Skywork-Critic-Llama3.1-70B** * | **96.6** | **87.9** | **93.1** | **95.5** | **93.3** |
|
41 |
| Salesforce/SFR-LLaMa-3.1-70B-Judge-r | 96.9 | 84.8 | 91.6 | 97.6 | 92.7 |
|
42 |
| Salesforce/SFR-nemo-12B-Judge-r | 97.2 | 82.2 | 86.5 | 95.1 | 90.3 |
|
43 |
+
| **Skywork-Critic-Llama3.1-70B** # | **94.4** | **82.9** | **89.7** | **90.2** | **89.3** |
|
44 |
| **Skywork-Critic-Llama3.1-8B** * | **93.6** | **81.4** | **91.1** | **89.8** | **89.0** |
|
45 |
| Salesforce/SFR-LLaMa-3.1-8B-Judge-r | 95.5 | 77.7 | 86.2 | 95.1 | 88.7 |
|
46 |
| facebook/Self-taught-Llama-3-70B | 96.9 | 84.0 | 91.1 | 82.5 | 88.6 |
|
|
|
52 |
| meta-llama/Meta-Llama-3.1-70B-Instruct * | 97.2 | 70.2 | 82.8 | 86.0 | 84.0 |
|
53 |
| NCSOFT/Llama-3-OffsetBias-8B * | 92.5 | 80.3 | 86.8 | 76.4 | 84.0 |
|
54 |
|
55 |
+
For the Skywork-Critic-Llama3.1-70B model, we tested two types of prompts. The first simply asks the model to determine whether the response from model A or B is better, while the second prompt, using # to indicate this prompt, requires the model not only to choose the better response but also to provide specific reasoning. Surprisingly, the first approach yielded higher accuracy. Accurately generating critique explanations remains a challenge for the critic model and will be a key focus of our future research.
|
56 |
+
|
57 |
|
58 |
# Demo Code
|
59 |
Below is an example of obtaining the critic of two conversations.
|