gbueno86's picture
Adding Evaluation Results (#1)
d3beaf3 verified
|
raw
history blame
7.36 kB
metadata
language:
  - en
license: llama3.1
library_name: transformers
tags:
  - mergekit
  - merge
base_model:
  - meta-llama/Meta-Llama-3.1-70B-Instruct
  - NousResearch/Hermes-3-Llama-3.1-70B
  - abacusai/Dracarys-Llama-3.1-70B-Instruct
  - VAGOsolutions/Llama-3.1-SauerkrautLM-70b-Instruct
model-index:
  - name: Brinebreath-Llama-3.1-70B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 55.33
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 55.46
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 29.98
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 12.86
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 17.49
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 46.62
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard

image/png Brinebreath-Llama-3.1-70B

I made this since I started having some problems with Cathallama. This seems to behave well during some days testing.

Notable Performance

  • 7% overall success rate increase on MMLU-PRO over LLaMA 3.1 70b at Q4_0
  • Strong performance in MMLU-PRO categories overall
  • Great performance during manual testing

Creation workflow

Models merged

  • meta-llama/Meta-Llama-3.1-70B-Instruct
  • NousResearch/Hermes-3-Llama-3.1-70B
  • abacusai/Dracarys-Llama-3.1-70B-Instruct
  • VAGOsolutions/Llama-3.1-SauerkrautLM-70b-Instruct
flowchart TD
    A[Hermes 3] -->|Merge with| B[Meta-Llama-3.1]
    C[Dracarys] -->|Merge with| D[Meta-Llama-3.1]
    B -->| | E[Merge]
    D -->| | E[Merge]
    G[SauerkrautLM] -->|Merge with| E[Merge]
    E[Merge] -->| | F[Brinebreath]

image/png

Testing

Hyperparameters

  • Temperature: 0.0 for automated, 0.9 for manual
  • Penalize repeat sequence: 1.05
  • Consider N tokens for penalize: 256
  • Penalize repetition of newlines
  • Top-K sampling: 40
  • Top-P sampling: 0.95
  • Min-P sampling: 0.05

LLaMAcpp Version

  • b3600-1-g2339a0be
  • -fa -ngl -1 -ctk f16 --no-mmap

Tested Files

  • Brinebreath-Llama-3.1-70B.Q4_0.gguf
  • Meta-Llama-3.1-70B-Instruct.Q4_0.gguf

Manual testing

Category Test Case Brinebreath-Llama-3.1-70B.Q4_0.gguf Meta-Llama-3.1-70B-Instruct.Q4_0.gguf
Common Sense Ball on cup OK OK
Big duck small horse OK OK
Killers OK OK
Strawberry r's KO KO
9.11 or 9.9 bigger KO KO
Dragon or lens KO KO
Shirts OK KO
Sisters OK KO
Jane faster OK OK
Programming JSON OK OK
Python snake game OK KO
Math Door window combination OK KO
Smoke Poem OK OK
Story OK OK

Note: See sample_generations.txt on the main folder of the repo for the raw generations.

MMLU-PRO

Model Success %
Brinebreath-3.1-70B.Q4_0.gguf 49.0%
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf 42.0%
MMLU-PRO category Brinebreath-3.1-70B.Q4_0.gguf Meta-Llama-3.1-70B-Instruct.Q4_0.gguf
Business 45.0% 40.0%
Law 40.0% 35.0%
Psychology 85.0% 80.0%
Biology 80.0% 75.0%
Chemistry 50.0% 45.0%
History 65.0% 60.0%
Other 55.0% 50.0%
Health 70.0% 65.0%
Economics 80.0% 75.0%
Math 35.0% 30.0%
Physics 45.0% 40.0%
Computer Science 60.0% 55.0%
Philosophy 50.0% 45.0%
Engineering 45.0% 40.0%

Note: MMLU-PRO Overall tested with 100 questions. Categories testes with 20 questions from each category.

PubmedQA

Model Name Success%
Brinebreath-3.1-70B.Q4_0.gguf 71.00%
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf 68.00%

Note: PubmedQA tested with 100 questions.

Request

If you are hiring in the EU or can sponsor a visa, PM me :D

PS. Thank you mradermacher for the GGUFs!

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 36.29
IFEval (0-Shot) 55.33
BBH (3-Shot) 55.46
MATH Lvl 5 (4-Shot) 29.98
GPQA (0-shot) 12.86
MuSR (0-shot) 17.49
MMLU-PRO (5-shot) 46.62