chargoddard's picture
Update README.md
39e5852
|
raw
history blame
999 Bytes
metadata
license: llama2
tags:
  - merge
  - mergekit

Llama 2 13b is a pretty decent language model. You know what's probably better? Two Llama 2 13b models. In a trenchcoat.

Produced by bakllama.py with this config file:

layer_slices:
  - model: TheBloke/Llama-2-13B-fp16
    start: 0
    end: 40
  - model: TheBloke/Llama-2-13B-fp16
    start: 0
    end: 40

No fine tuning was done on this model. Yes, it's still coherent somehow.

Benchmark results:

Benchmark Llama2-13b Llama2-26b-tcs Percent Change
ARC 59.3 55.03 -7.2%
HellaSwag 82.15 79.9 -2.74%
MMLU 55.67 53.73 -3.48%
TruthfulQA 37.39 40.48 +5.59%
Average 58.63 57.29 -2.29%
Average Minus TQA 65.70 62.85 -4.34%

This tells us two very important things:

  1. TruthfulQA is a perfect benchmark in every way.
  2. Llama models are amazingly robust to being fed their own output.