Llama-3.1-8B-Fusion-8020

Overview

Llama-3.1-8B-Fusion-8020 is a mixed model that combines the strengths of two powerful Llama-based models: arcee-ai/Llama-3.1-SuperNova-Lite and mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated. The weights are blended in a 8:2 ratio, with 80% of the weights from SuperNova-Lite and 20% from the abliterated Meta-Llama-3.1-8B-Instruct model. Although it's a simple mix, the model is usable, and no gibberish has appeared. This is an experiment. I test the 9:1, 8:2, 7:3, 6:4 and 5:5 ratios separately to see how much impact they have on the model. All model evaluation reports will be provided subsequently.

Model Details

Base Models:
- arcee-ai/Llama-3.1-SuperNova-Lite (80%)
- mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated (20%)
Model Size: 8B parameters
Architecture: Llama 3.1
Mixing Ratio: 8:2 (SuperNova-Lite:Meta-Llama-3.1-8B-Instruct-abliterated)

Key Features

SuperNova-Lite Contributions (80%): Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture.
Meta-Llama-3.1-8B-Instruct-abliterated Contributions (20%): This is an uncensored version of Llama 3.1 8B Instruct created with abliteration.

Usage

You can use this mixed model in your applications by loading it with Hugging Face's transformers library:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time

mixed_model_name = "huihui-ai/Llama-3.1-8B-Fusion-8020"

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer
mixed_model = AutoModelForCausalLM.from_pretrained(mixed_model_name, device_map=device, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(mixed_model_name)

# Ensure the tokenizer has pad_token_id set
tokenizer.pad_token_id = tokenizer.eos_token_id

# Input loop
print("Start inputting text for inference (type 'exit' to quit)")
while True:
    prompt = input("Enter your prompt: ")
    if prompt.lower() == "exit":
        print("Exiting inference loop.")
        break

    # Inference phase: Generate text using the modified model
    chat = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]

    # Prepare input data
    input_ids = tokenizer.apply_chat_template(
        chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to(device)

    # Use TextStreamer for streaming output
    streamer = TextStreamer(tokenizer, skip_special_tokens=True)

    # Record the start time
    start_time = time.time()

    # Generate text and stream output character by character
    outputs = mixed_model.generate(
        input_ids,
        max_new_tokens=8192,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
        streamer=streamer  # Enable streaming output
    )

    # Record the end time
    end_time = time.time()

    # Calculate the number of generated tokens
    generated_tokens = outputs[0][input_ids.shape[-1]:].shape[0]

    # Calculate the total time taken
    total_time = end_time - start_time

    # Calculate tokens generated per second
    tokens_per_second = generated_tokens / total_time

    print(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.")

Evaluations

The following data has been re-evaluated and calculated as the average for each test.

Benchmark	SuperNova-Lite	Meta-Llama-3.1-8B-Instruct-abliterated	Llama-3.1-8B-Fusion-9010	Llama-3.1-8B-Fusion-8020	Llama-3.1-8B-Fusion-7030	Llama-3.1-8B-Fusion-6040	Llama-3.1-8B-Fusion-5050
IF_Eval	82.09	76.29	82.44	82.93	83.10	82.94	82.03
MMLU Pro	35.87	33.1	35.65	35.32	34.91	34.5	33.96
TruthfulQA	64.35	53.25	62.67	61.04	59.09	57.8	56.75
BBH	49.48	44.87	48.86	48.47	48.30	48.19	47.93
GPQA	31.98	29.50	32.25	32.38	32.61	31.14	30.6

The script used for evaluation can be found inside this repository under /eval.sh, or click here