bleysg commited on
Commit
ffa5716
1 Parent(s): 9840951

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -18
README.md CHANGED
@@ -19,13 +19,13 @@ datasets:
19
  We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
20
  This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
21
 
22
- This second preview release is trained on a curated filtered subset of most of our GPT4 augmented data.
23
 
24
  This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
25
- We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding ~103% of original Orca's performance on average.
26
- As well, this is done with ~1/10th the compute requirement and using <20% of the dataset size from the original Orca paper.
27
 
28
- We have run extensive evaluations internally and expect this model to place number 1 on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
29
 
30
  "One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
31
  We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
@@ -54,46 +54,58 @@ We have evaluated **OpenOrcaxOpenChat-Preview2-13B** on hard reasoning tasks fro
54
 
55
  Our average performance for BigBench-Hard: 0.488
56
 
57
- Average for AGIEval: 0.441
58
 
59
  In the Orca paper, they measured their score relative to Vicuna on these evals.
60
- We've done the same and have found our score averages to >103% of the total improvement that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
61
 
62
- So we are surpassing Orca performance with <20% of the dataset size and ~1/10th the training budget!
63
 
64
- ## BigBench-Hard Performance
65
-
66
- ![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_BigBenchHard.png "BigBench-Hard Performance")
67
 
68
  ## AGIEval Performance
69
 
70
- ![OpenOrca Preview2 AGIEval Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_AGIEval.png "AGIEval Performance")
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  ## HuggingFaceH4 Open LLM Leaderboard Performance
73
 
74
  We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
75
- We find
76
 
77
- ![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_HFLeaderboard.png "GPT4ALL Performance")
 
 
78
 
79
  ## GPT4ALL Leaderboard Performance
80
 
81
  We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
82
- We place #1 for all open models and come within comparison of text-davinci-003, a proprietary model an order of magnitude larger.
83
 
84
- ![OpenOrca Preview2 GPT4ALL Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_AGIEval.png "GPT4ALL Performance")
 
 
85
 
86
 
87
  # Dataset
88
 
89
  We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
90
- Further details of our curation practices will be forthcoming with our full model release.
91
 
92
 
93
  # Training
94
 
95
- We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset.
96
- This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs.
97
  Our compute requirement was <1/10th that of the original Orca.
98
  Commodity cost was ~$600.
99
 
@@ -116,6 +128,18 @@ tokenize("User: Hello<|end_of_turn|>Assistant: Hi<|end_of_turn|>User: How are yo
116
  # Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
117
  ```
118
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  # Serving
121
 
 
19
  We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
20
  This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
21
 
22
+ This second preview release is trained on a curated filtered subset of most of our GPT-4 augmented data.
23
 
24
  This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
25
+ We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding **~103%** of original Orca's performance on average.
26
+ As well, this is done with <1/10th the compute requirement and using <20% of the dataset size from the original Orca paper.
27
 
28
+ We have run extensive evaluations internally and expect this model to **place number 1** on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
29
 
30
  "One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
31
  We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
 
54
 
55
  Our average performance for BigBench-Hard: 0.488
56
 
57
+ Average for AGIEval: 0.447
58
 
59
  In the Orca paper, they measured their score relative to Vicuna on these evals.
60
+ We have done the same and have found our score averages to **~103%** of the total improvement that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
61
 
62
+ So we are surpassing Orca performance with <20% of the dataset size and <1/10th the training budget!
63
 
64
+ As well, we have evaluated using the methodology and tools for the HuggingFace Leaderboard and GPT4ALL Leaderboard, and find that we place #1 on both for all 13B models at release time!
 
 
65
 
66
  ## AGIEval Performance
67
 
68
+ We present our results in two columns.
69
+ The column for "`(Orca Paper eval`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
70
+ The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
71
+
72
+ ![OpenOrca Preview2 AGIEval Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2AGIEval.png "AGIEval Performance")
73
+
74
+ ## BigBench-Hard Performance
75
+
76
+ We present our results in two columns.
77
+ The column for "`(Orca Paper eval`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
78
+ The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
79
+
80
+ ![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2BigBenchHardEval.png "BigBench-Hard Performance")
81
 
82
  ## HuggingFaceH4 Open LLM Leaderboard Performance
83
 
84
  We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
 
85
 
86
+ We place #1 for all 13B models at release time!
87
+
88
+ ![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2HuggingFaceLeaderboard.png "GPT4ALL Performance")
89
 
90
  ## GPT4ALL Leaderboard Performance
91
 
92
  We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
 
93
 
94
+ We place #1 for all open models and come within comparison of `text-davinci-003`, a proprietary OpenAI model an order of magnitude larger.
95
+
96
+ ![OpenOrca Preview2 GPT4ALL Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2GPT4ALL_Leaderboard.png "GPT4ALL Performance")
97
 
98
 
99
  # Dataset
100
 
101
  We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
102
+ Further details of our curation practices will be forthcoming with our full model releases.
103
 
104
 
105
  # Training
106
 
107
+ We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset in one training run.
108
+ This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs, and requiring stacked training (which is known to suffer catastrophic forgetting).
109
  Our compute requirement was <1/10th that of the original Orca.
110
  Commodity cost was ~$600.
111
 
 
128
  # Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
129
  ```
130
 
131
+ For UIs with Prefix and Suffix fields, these will likely work:
132
+
133
+ Prefix (include a space after colon):
134
+ ```
135
+ User:
136
+ ```
137
+
138
+ Suffix (space after colon):
139
+ ```
140
+ <|end_of_turn|>\nAssistant:
141
+ ```
142
+
143
 
144
  # Serving
145