Adding Evaluation Results

This is an automated PR created with https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr

The purpose of this PR is to add evaluation results from the Open LLM Leaderboard to your model card.

If you encounter any issues, please report them to https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr/discussions

Files changed (1) hide show

README.md +117 -1

README.md CHANGED Viewed

@@ -1,5 +1,108 @@
 ---
 license: mit
 ---
 Ok so this guy offers [this challenge](https://www.reddit.com/r/ArtificialInteligence/comments/1akestf/day_3_prove_i_am_full_of_bs_and_my_dataset_doesnt/) and I don't actually have a lot going on in my life right now. So I'm like fine. Your idea looks interesting. I have no idea why you're spamming it. It does not appear you make any money from this. Why would you offer to pay for our fine-tuning if we don't like the results after fine-tuning on your data? Does this thing trojan horse in some crazy thing that lets you control all robots later even though it improves performance now? I dunno. I don't even know if I'm doing this right. It says fine-tune your model on it. But I don't know if that means make my model first and then fine-tune using his thing or if I can just sprinkle it into mine and cross my fingers? I'm just going to sprinkle in his data and just cross my fingers.
@@ -34,4 +137,17 @@ I talked to it. It's ok I guess. I'm a little suspicious of its ability to liter
 I trained from my workstation. I have 2x 3090's and an AMD 5900x. Chicago power is 15¢/kWh. Each 3090 draw about 350 watts and the rest of the system probably draws maybe 200 watts or so. But then my room gets hot and I have to turn on the overhead fan and kick on the HVAC vent fan with the windows open or else my place gets really hot even in the middle of winter. We'll call it a kilowatt even since we're not billing wear and tear on the cards. I think you have to depreciate those by time anyway and not usage. At least for tax purposes. Anyway, dataset prep and training took about 3 hours in-total. Looking at raw data sizes, the pfaf data was about 500kb and my data around 2.1mb. So if we calculate that out, we get 3 * 0.15 * (500/(2100+500)) = 0.0865 to get the portion of the fine-tuning attributable to PFAF (someone check my math. I'm stoned.). I think that I feel like this guy owes me 9 cents, but I'm not gonna be petty about it. You can't give fractions of a penny. We'll call it 8 cents. If the scores don't improve.
-(We'll see probably tomorrow or so if the leaderboard updates if this dataset does anything worth exploring just by dumping it in as suggested by the guy. Compare it to TacoBeLLM and Palworld-SME-13b on the leaderboard for bots I made similar ways.)

 ---
 license: mit
+model-index:
+- name: ASTS-PFAF
+ results:
+ - task:
+ type: text-generation
+ name: Text Generation
+ dataset:
+ name: AI2 Reasoning Challenge (25-Shot)
+ type: ai2_arc
+ config: ARC-Challenge
+ split: test
+ args:
+ num_few_shot: 25
+ metrics:
+ - type: acc_norm
+ value: 61.26
+ name: normalized accuracy
+ source:
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
+ name: Open LLM Leaderboard
+ - task:
+ type: text-generation
+ name: Text Generation
+ dataset:
+ name: HellaSwag (10-Shot)
+ type: hellaswag
+ split: validation
+ args:
+ num_few_shot: 10
+ metrics:
+ - type: acc_norm
+ value: 82.94
+ name: normalized accuracy
+ source:
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
+ name: Open LLM Leaderboard
+ - task:
+ type: text-generation
+ name: Text Generation
+ dataset:
+ name: MMLU (5-Shot)
+ type: cais/mmlu
+ config: all
+ split: test
+ args:
+ num_few_shot: 5
+ metrics:
+ - type: acc
+ value: 58.96
+ name: accuracy
+ source:
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
+ name: Open LLM Leaderboard
+ - task:
+ type: text-generation
+ name: Text Generation
+ dataset:
+ name: TruthfulQA (0-shot)
+ type: truthful_qa
+ config: multiple_choice
+ split: validation
+ args:
+ num_few_shot: 0
+ metrics:
+ - type: mc2
+ value: 43.74
+ source:
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
+ name: Open LLM Leaderboard
+ - task:
+ type: text-generation
+ name: Text Generation
+ dataset:
+ name: Winogrande (5-shot)
+ type: winogrande
+ config: winogrande_xl
+ split: validation
+ args:
+ num_few_shot: 5
+ metrics:
+ - type: acc
+ value: 76.87
+ name: accuracy
+ source:
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
+ name: Open LLM Leaderboard
+ - task:
+ type: text-generation
+ name: Text Generation
+ dataset:
+ name: GSM8k (5-shot)
+ type: gsm8k
+ config: main
+ split: test
+ args:
+ num_few_shot: 5
+ metrics:
+ - type: acc
+ value: 23.81
+ name: accuracy
+ source:
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
+ name: Open LLM Leaderboard
 ---
 Ok so this guy offers [this challenge](https://www.reddit.com/r/ArtificialInteligence/comments/1akestf/day_3_prove_i_am_full_of_bs_and_my_dataset_doesnt/) and I don't actually have a lot going on in my life right now. So I'm like fine. Your idea looks interesting. I have no idea why you're spamming it. It does not appear you make any money from this. Why would you offer to pay for our fine-tuning if we don't like the results after fine-tuning on your data? Does this thing trojan horse in some crazy thing that lets you control all robots later even though it improves performance now? I dunno. I don't even know if I'm doing this right. It says fine-tune your model on it. But I don't know if that means make my model first and then fine-tune using his thing or if I can just sprinkle it into mine and cross my fingers? I'm just going to sprinkle in his data and just cross my fingers.
 I trained from my workstation. I have 2x 3090's and an AMD 5900x. Chicago power is 15¢/kWh. Each 3090 draw about 350 watts and the rest of the system probably draws maybe 200 watts or so. But then my room gets hot and I have to turn on the overhead fan and kick on the HVAC vent fan with the windows open or else my place gets really hot even in the middle of winter. We'll call it a kilowatt even since we're not billing wear and tear on the cards. I think you have to depreciate those by time anyway and not usage. At least for tax purposes. Anyway, dataset prep and training took about 3 hours in-total. Looking at raw data sizes, the pfaf data was about 500kb and my data around 2.1mb. So if we calculate that out, we get 3 * 0.15 * (500/(2100+500)) = 0.0865 to get the portion of the fine-tuning attributable to PFAF (someone check my math. I'm stoned.). I think that I feel like this guy owes me 9 cents, but I'm not gonna be petty about it. You can't give fractions of a penny. We'll call it 8 cents. If the scores don't improve.
+(We'll see probably tomorrow or so if the leaderboard updates if this dataset does anything worth exploring just by dumping it in as suggested by the guy. Compare it to TacoBeLLM and Palworld-SME-13b on the leaderboard for bots I made similar ways.)
+# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
+Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_ericpolewski__ASTS-PFAF)
+| Metric |Value|
+|---------------------------------|----:|
+|Avg. |57.93|
+|AI2 Reasoning Challenge (25-Shot)|61.26|
+|HellaSwag (10-Shot) |82.94|
+|MMLU (5-Shot) |58.96|
+|TruthfulQA (0-shot) |43.74|
+|Winogrande (5-shot) |76.87|
+|GSM8k (5-shot) |23.81|