Open-Orca
/

OpenOrcaxOpenChat-Preview2-13B

@@ -19,13 +19,13 @@ datasets:
 We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
 This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
-This second preview release is trained on a curated filtered subset of most of our GPT4 augmented data.
 This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
-We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding ~103% of original Orca's performance on average.
-As well, this is done with ~1/10th the compute requirement and using <20% of the dataset size from the original Orca paper.
-We have run extensive evaluations internally and expect this model to place number 1 on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
 "One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
 We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
@@ -54,46 +54,58 @@ We have evaluated **OpenOrcaxOpenChat-Preview2-13B** on hard reasoning tasks fro
 Our average performance for BigBench-Hard: 0.488
-Average for AGIEval: 0.441
 In the Orca paper, they measured their score relative to Vicuna on these evals.
-We've done the same and have found our score averages to >103% of the total improvement that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
-So we are surpassing Orca performance with <20% of the dataset size and ~1/10th the training budget!
-## BigBench-Hard Performance
-![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_BigBenchHard.png "BigBench-Hard Performance")
 ## AGIEval Performance
-![OpenOrca Preview2 AGIEval Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_AGIEval.png "AGIEval Performance")
 ## HuggingFaceH4 Open LLM Leaderboard Performance
 We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
-We find
-![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_HFLeaderboard.png "GPT4ALL Performance")
 ## GPT4ALL Leaderboard Performance
 We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
-We place #1 for all open models and come within comparison of text-davinci-003, a proprietary model an order of magnitude larger.
-![OpenOrca Preview2 GPT4ALL Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/OO_Preview2_AGIEval.png "GPT4ALL Performance")
 # Dataset
 We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
-Further details of our curation practices will be forthcoming with our full model release.
 # Training
-We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset.
-This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs.
 Our compute requirement was <1/10th that of the original Orca.
 Commodity cost was ~$600.
@@ -116,6 +128,18 @@ tokenize("User: Hello<|end_of_turn|>Assistant: Hi<|end_of_turn|>User: How are yo
 # Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
 ```
 # Serving

 We have used our own [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) to fine-tune Llama2-13B using [OpenChat](https://huggingface.co/openchat) packing and conditional behavior cloning.
 This dataset is our attempt to reproduce the dataset generated for Microsoft Research's [Orca Paper](https://arxiv.org/abs/2306.02707).
+This second preview release is trained on a curated filtered subset of most of our GPT-4 augmented data.
 This release highlights that our dataset and training methods have surpassed performance parity with the Orca paper.
+We measured this with BigBench-Hard and AGIEval results with the same methods as used in the Orca paper, finding **~103%** of original Orca's performance on average.
+As well, this is done with <1/10th the compute requirement and using <20% of the dataset size from the original Orca paper.
+We have run extensive evaluations internally and expect this model to **place number 1** on both the HuggingFaceH4 Open LLM Leaderboard and the GPT4ALL Leaderboard for 13B models.
 "One" of [OpenChat](https://huggingface.co/openchat) has joined our team, and we'd like to provide special thanks for their training of this model!
 We have utilized OpenChat conditional behavior cloning and [MultiPack algorithm](https://github.com/imoneoi/multipack_sampler) which achieves 99.85% bin-packing efficiency on our dataset.
 Our average performance for BigBench-Hard: 0.488
+Average for AGIEval: 0.447
 In the Orca paper, they measured their score relative to Vicuna on these evals.
+We have done the same and have found our score averages to **~103%** of the total improvement that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
+So we are surpassing Orca performance with <20% of the dataset size and <1/10th the training budget!
+As well, we have evaluated using the methodology and tools for the HuggingFace Leaderboard and GPT4ALL Leaderboard, and find that we place #1 on both for all 13B models at release time!
 ## AGIEval Performance
+We present our results in two columns.
+The column for "`(Orca Paper eval`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
+The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
+![OpenOrca Preview2 AGIEval Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2AGIEval.png "AGIEval Performance")
+## BigBench-Hard Performance
+We present our results in two columns.
+The column for "`(Orca Paper eval`" uses the methods outlined in the Orca paper, so as to be a direct apples-to-apples comparison with the results from the paper.
+The column for "`(HF Leaderboard eval)`" uses EleutherAI's LM Evaluation Harness with settings outlined by HuggingFace. These results are not comparable to the other columns, as the methods are different.
+![OpenOrca Preview2 BigBench-Hard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2BigBenchHardEval.png "BigBench-Hard Performance")
 ## HuggingFaceH4 Open LLM Leaderboard Performance
 We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) evals.
+We place #1 for all 13B models at release time!
+![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2HuggingFaceLeaderboard.png "GPT4ALL Performance")
 ## GPT4ALL Leaderboard Performance
 We have tested using parameters matching the GPT4ALL Benchmark Suite and report our results and placement vs their official reporting below.
+We place #1 for all open models and come within comparison of `text-davinci-003`, a proprietary OpenAI model an order of magnitude larger.
+![OpenOrca Preview2 GPT4ALL Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2GPT4ALL_Leaderboard.png "GPT4ALL Performance")
 # Dataset
 We used a curated, filtered selection of most of the GPT-4 augmented data from our OpenOrca dataset, which aims to reproduce the Orca Research Paper dataset.
+Further details of our curation practices will be forthcoming with our full model releases.
 # Training
+We trained with 8x A100-80G GPUs for 46 hours, completing 5 epochs of full fine tuning on our dataset in one training run.
+This contrasts with the 20x A100-80G GPUs for 200 hours used in the Orca paper, for only 3 epochs, and requiring stacked training (which is known to suffer catastrophic forgetting).
 Our compute requirement was <1/10th that of the original Orca.
 Commodity cost was ~$600.
 # Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
 ```
+For UIs with Prefix and Suffix fields, these will likely work:
+Prefix (include a space after colon):
+```
+User:
+```
+Suffix (space after colon):
+```
+<|end_of_turn|>\nAssistant:
+```
 # Serving