metadata

language:
  - en
license: llama3.1
library_name: transformers
tags:
  - mergekit
  - merge
base_model:
  - meta-llama/Meta-Llama-3.1-70B-Instruct
  - NousResearch/Hermes-3-Llama-3.1-70B
  - abacusai/Dracarys-Llama-3.1-70B-Instruct
  - VAGOsolutions/Llama-3.1-SauerkrautLM-70b-Instruct
model-index:
  - name: Brinebreath-Llama-3.1-70B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 55.33
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 55.46
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 29.98
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 12.86
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 17.49
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 46.62
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
          name: Open LLM Leaderboard

Brinebreath-Llama-3.1-70B

I made this since I started having some problems with Cathallama. This seems to behave well during some days testing.

Notable Performance

7% overall success rate increase on MMLU-PRO over LLaMA 3.1 70b at Q4_0
Strong performance in MMLU-PRO categories overall
Great performance during manual testing

Creation workflow

Models merged

meta-llama/Meta-Llama-3.1-70B-Instruct
NousResearch/Hermes-3-Llama-3.1-70B
abacusai/Dracarys-Llama-3.1-70B-Instruct
VAGOsolutions/Llama-3.1-SauerkrautLM-70b-Instruct

flowchart TD
    A[Hermes 3] -->|Merge with| B[Meta-Llama-3.1]
    C[Dracarys] -->|Merge with| D[Meta-Llama-3.1]
    B -->| | E[Merge]
    D -->| | E[Merge]
    G[SauerkrautLM] -->|Merge with| E[Merge]
    E[Merge] -->| | F[Brinebreath]

Testing

Hyperparameters

Temperature: 0.0 for automated, 0.9 for manual
Penalize repeat sequence: 1.05
Consider N tokens for penalize: 256
Penalize repetition of newlines
Top-K sampling: 40
Top-P sampling: 0.95
Min-P sampling: 0.05

LLaMAcpp Version

b3600-1-g2339a0be
-fa -ngl -1 -ctk f16 --no-mmap

Tested Files

Brinebreath-Llama-3.1-70B.Q4_0.gguf
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf

Manual testing

Category	Test Case	Brinebreath-Llama-3.1-70B.Q4_0.gguf	Meta-Llama-3.1-70B-Instruct.Q4_0.gguf
Common Sense	Ball on cup	OK	OK
	Big duck small horse	OK	OK
	Killers	OK	OK
	Strawberry r's	KO	KO
	9.11 or 9.9 bigger	KO	KO
	Dragon or lens	KO	KO
	Shirts	OK	KO
	Sisters	OK	KO
	Jane faster	OK	OK
Programming	JSON	OK	OK
	Python snake game	OK	KO
Math	Door window combination	OK	KO
Smoke	Poem	OK	OK
	Story	OK	OK

Note: See sample_generations.txt on the main folder of the repo for the raw generations.

MMLU-PRO

Model	Success %
Brinebreath-3.1-70B.Q4_0.gguf	49.0%
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf	42.0%

MMLU-PRO category	Brinebreath-3.1-70B.Q4_0.gguf	Meta-Llama-3.1-70B-Instruct.Q4_0.gguf
Business	45.0%	40.0%
Law	40.0%	35.0%
Psychology	85.0%	80.0%
Biology	80.0%	75.0%
Chemistry	50.0%	45.0%
History	65.0%	60.0%
Other	55.0%	50.0%
Health	70.0%	65.0%
Economics	80.0%	75.0%
Math	35.0%	30.0%
Physics	45.0%	40.0%
Computer Science	60.0%	55.0%
Philosophy	50.0%	45.0%
Engineering	45.0%	40.0%

Note: MMLU-PRO Overall tested with 100 questions. Categories testes with 20 questions from each category.

PubmedQA

Model Name	Success%
Brinebreath-3.1-70B.Q4_0.gguf	71.00%
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf	68.00%

Note: PubmedQA tested with 100 questions.

Request

If you are hiring in the EU or can sponsor a visa, PM me :D

PS. Thank you mradermacher for the GGUFs!

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	36.29
IFEval (0-Shot)	55.33
BBH (3-Shot)	55.46
MATH Lvl 5 (4-Shot)	29.98
GPQA (0-shot)	12.86
MuSR (0-shot)	17.49
MMLU-PRO (5-shot)	46.62