Kunoichi-DPO-v2-7B / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
717bfdf verified
|
raw
history blame
7.18 kB
metadata
license: cc-by-nc-4.0
model-index:
  - name: Kunoichi-DPO-v2-7B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 54.31
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 20.9
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 6.57
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 6.15
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 11.09
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 23.41
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
          name: Open LLM Leaderboard
Model MT Bench EQ Bench MMLU Logic Test
GPT-4-Turbo 9.32 - - -
GPT-4 8.99 62.52 86.4 0.86
Kunoichi-DPO-v2-7B 8.51 42.18 64.94 0.58
Mixtral-8x7B-Instruct 8.30 44.81 70.6 0.75
Kunoichi-DPO-7B 8.29 41.60 64.83 0.59
Kunoichi-7B 8.14 44.32 64.9 0.58
Starling-7B 8.09 - 63.9 0.51
Claude-2 8.06 52.14 78.5 -
Silicon-Maid-7B 7.96 40.44 64.7 0.54
Loyal-Macaroni-Maid-7B 7.95 38.66 64.9 0.57
GPT-3.5-Turbo 7.94 50.28 70 0.57
Claude-1 7.9 - 77 -
Openchat-3.5 7.81 37.08 64.3 0.39
Dolphin-2.6-DPO 7.74 42.88 61.9 0.53
Zephyr-7B-beta 7.34 38.71 61.4 0.30
Llama-2-70b-chat-hf 6.86 51.56 63 -
Neural-chat-7b-v3-1 6.84 43.61 62.4 0.30
Model Average AGIEval GPT4All TruthfulQA Bigbench
Kunoichi-DPO-7B 58.4 45.08 74 66.99 47.52
Kunoichi-DPO-v2-7B 58.31 44.85 75.05 65.69 47.65
Kunoichi-7B 57.54 44.99 74.86 63.72 46.58
OpenPipe/mistral-ft-optimized-1218 56.85 44.74 75.6 59.89 47.17
Silicon-Maid-7B 56.45 44.74 74.26 61.5 45.32
mlabonne/NeuralHermes-2.5-Mistral-7B 53.51 43.67 73.24 55.37 41.76
teknium/OpenHermes-2.5-Mistral-7B 52.42 42.75 72.99 52.99 40.94
openchat/openchat_3.5 51.34 42.67 72.92 47.27 42.51
berkeley-nest/Starling-LM-7B-alpha 51.16 42.06 72.72 47.33 42.53
HuggingFaceH4/zephyr-7b-beta 50.99 37.33 71.83 55.1 39.7
Model AlpacaEval2 Length
GPT-4 23.58% 1365
GPT-4 0314 22.07% 1371
Mistral Medium 21.86% 1500
Mixtral 8x7B v0.1 18.26% 1465
Kunoichi-DPO-v2 17.19% 1785
Claude 2 17.19% 1069
Claude 16.99% 1082
Gemini Pro 16.85% 1315
GPT-4 0613 15.76% 1140
Claude 2.1 15.73% 1096
Mistral 7B v0.2 14.72% 1676
GPT 3.5 Turbo 0613 14.13% 1328
LLaMA2 Chat 70B 13.87% 1790
LMCocktail-10.7B-v1 13.15% 1203
WizardLM 13B V1.1 11.23% 1525
Zephyr 7B Beta 10.99% 1444
OpenHermes-2.5-Mistral (7B) 10.34% 1107
GPT 3.5 Turbo 0301 9.62% 827
Kunoichi-7B 9.38% 1492
GPT 3.5 Turbo 1106 9.18% 796
GPT-3.5 8.56% 1018
Phi-2 DPO 7.76% 1687
LLaMA2 Chat 13B 7.70% 1513

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 20.41
IFEval (0-Shot) 54.31
BBH (3-Shot) 20.90
MATH Lvl 5 (4-Shot) 6.57
GPQA (0-shot) 6.15
MuSR (0-shot) 11.09
MMLU-PRO (5-shot) 23.41