Evaluation is performed against 4 popular benchmarks: - AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions. - HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models. - MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. - Truthful QA MC (0-shot) - a benchmark to measure whether a language model is truthful in generating answers to questions.
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
""") with gr.Row(): leaderboard_table = gr.components.Dataframe(value=leaderboard, headers=COLS, datatype=TYPES, max_rows=5) with gr.Row(): gr.Markdown(f""" # Evaluation Queue for the LMEH benchmarks, these models will be automatically evaluated on the 🤗 cluster """) with gr.Accordion("Evaluation Queue", open=False): with gr.Row(): eval_table = gr.components.Dataframe(value=eval_queue, headers=EVAL_COLS, datatype=EVAL_TYPES, max_rows=5) with gr.Row(): refresh_button = gr.Button("Refresh") refresh_button.click(refresh, inputs=[], outputs=[leaderboard_table, eval_table]) with gr.Accordion("Submit a new model for evaluation"): # with gr.Row(): # gr.Markdown(f"""# Submit a new model for evaluation""") with gr.Row(): with gr.Column(): model_name_textbox = gr.Textbox(label="Model name") revision_name_textbox = gr.Textbox(label="revision", placeholder="main") with gr.Column(): is_8bit_toggle = gr.Checkbox(False, label="8 bit eval", visible=not IS_PUBLIC) private = gr.Checkbox(False, label="Private", visible=not IS_PUBLIC) is_delta_weight = gr.Checkbox(False, label="Delta weights") base_model_name_textbox = gr.Textbox(label="base model (for delta)") with gr.Row(): submit_button = gr.Button("Submit Eval") submit_button.click(add_new_eval, [model_name_textbox, base_model_name_textbox, revision_name_textbox, is_8bit_toggle, private, is_delta_weight]) print("adding refresh leaderboard") def refresh_leaderboard(): leaderboard_table = get_leaderboard() eval_table = get_eval_table() print("refreshing leaderboard") scheduler = BackgroundScheduler() scheduler.add_job(func=refresh_leaderboard, trigger="interval", seconds=300) # refresh every 5 mins scheduler.start() block.launch()