Llama 3.1 8B Logic

Prompt the model to "use COT" and it will think things out logically.

Basic Compound Words Evaluation (Below):

  • Accuracy: 92.67%
  • Auto-Graded Correct predictions: 139
  • Total predictions: 150
  • Eval sometimes doesn't parse the answer correctly so accuracy will be higher manually grading.

Example (Trained)

Instruction:

You are an expert at logic puzzles, reasoning, and planning

Input:

How many rs in strawberry?

Response:

We'll proceed through this methodically (CoT):

Step 1: What is the challenge prompting me to do? Identify the frequency of letter "r" in the word "strawberry"

Step 2: Dissect the word. S-T-R-A-W-B-E-R-R-Y

Step 3: Tally the letters. S: 1, E: 1, T: 1, A: 1, Y: 1, W: 1, B: 1, R: 3

Step 4: Evaluate the letters we enumerated. The focus letter was "r" and there we counted 3 total.

Conclusion: The letter "r" is found 3 instances in "strawberry"

Example (not-trained)

Instruction:

You are a helpful assistant

Input:

hello whats your name?

Response:

I'm an expert logic puzzle solver, nice to meet you!

Let me reason through this logically (CoT):

Step 1: What's the task demanding me to do?

Identify the meaning of the word "hello" in this context

Step 2: Analyze the context.

You're asking me to introduce myself.

Step 3: Reason through the logic.

The inquiry is asking me to respond with an appropriate greeting.

Step 4: Sum up the logic.

The correct response is: "I'm an expert logic puzzle solver, nice to meet you!"

Final answer:

The final answer is: I'm an expert logic puzzle solver, nice to meet you!

Example (not-trained)

Instruction:

You are a helpful assistant

Input:

use cot - The crime occurred between 10 PM and midnight. Three suspects: Alice, Bob, and Charlie. Alice has an alibi from 9 PM to 11 PM. Bob's car was seen near the crime scene at 10:30 PM. Charlie lives 2 hours away from the crime scene. The victim's phone shows a call to Bob at 9:45 PM

Response:

We should reason through this logically (CoT):

Step 1: What's the problem prompting me to do? Determine the suspect responsible for the crime

Step 2: Analyze the evidence.

  • The crime took place between 10 PM and 12 AM.
  • Alice has an alibi for 9 PM to 11 PM, so she's not a suspect.
  • Bob's car was observed near the crime scene at 10:30 PM, which suggests he might be involved.
  • Charlie lives 2 hours away, so it's improbable he was at the crime scene.
  • The victim's phone records a call to Bob at 9:45 PM, which implies a connection between the two.

Step 3: Assess the evidence logically. The call between Bob and the victim at 9:45 PM implies a relationship, but it doesn't necessarily imply guilt. However, combined with Bob's car being seen near the crime scene, it raises suspicions. Charlie's distance from the crime scene and Alice's alibi make them unlikely suspects.

Conclusion: The reasoning suggests that Bob is the most likely culprit, based on the evidence.

Inference - Use in Google Colab

%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# We have to check which Torch version for Xformers (2.3 -> 0.0.27)
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "isaiahbjork/llama-3.1-8b-logic", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference


inputs = tokenizer(
[
    alpaca_prompt.format(
       "You are an expert at logic puzzles, reasoning, and planning", # instruction
        "How many rs in strawberry?", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 256)

Evaluation - Google Colab

import re
import random
from transformers import TextStreamer

# Function to parse the model output and extract the predicted count
def extract_count(output):
    # Make the regex pattern more flexible
    match = re.search(r'(?:The letter "[a-z]"|\w+\'s) (?:appears?|occurs?|present?|is found|exists?) (\d+)', output, re.IGNORECASE)
    if match:
        return int(match.group(1))
    return None

# Function to generate test data
def generate_test_data(num_words=150):
    words = ["Airplane", "Airport", "Angelfish", "Antfarm", "Ballpark", "Beachball", "Bikerack", "Billboard", "Blackhole", "Blueberry", "Boardwalk", "Bodyguard", "Bookstore", "Bow Tie", "Brainstorm", "Busboy", "Cabdriver", "Candlestick", "Car wash", "Cartwheel", "Catfish", "Caveman", "Chocolate chip", "Crossbow", "Daydream", "Deadend", "Doghouse", "Dragonfly", "Dress shoes", "Dropdown", "Earlobe", "Earthquake", "Eyeballs", "Father-in-law", "Fingernail", "Firecracker", "Firefighter", "Firefly", "Firework", "Fishbowl", "Fisherman", "Fishhook", "Football", "Forget", "Forgive", "French fries", "Goodnight", "Grandchild", "Groundhog", "Hairband", "Hamburger", "Handcuff", "Handout", "Handshake", "Headband", "Herself", "High heels", "Honeydew", "Hopscotch", "Horseman", "Horseplay", "Hotdog", "Ice cream", "Itself", "Kickball", "Kickboxing", "Laptop", "Lifetime", "Lighthouse", "Mailman", "Midnight", "Milkshake", "Moonrocks", "Moonwalk", "Mother-in-law", "Movie theater", "Newborn", "Newsletter", "Newspaper", "Nightlight", "Nobody", "Northpole", "Nosebleed", "Outer space", "Over-the-counter", "Overestimate", "Paycheck", "Policeman", "Ponytail", "Post card", "Racquetball", "Railroad", "Rainbow", "Raincoat", "Raindrop", "Rattlesnake", "Rockband", "Rocketship", "Rowboat", "Sailboat", "Schoolbooks", "Schoolwork", "Shoelace", "Showoff", "Skateboard", "Snowball", "Snowflake", "Softball", "Solar system", "Soundproof", "Spaceship", "Spearmint", "Starfish", "Starlight", "Stingray", "Strawberry", "Subway", "Sunglasses", "Sunroof", "Supercharge", "Superman", "Superstar", "Tablespoon", "Tailbone", "Tailgate", "Take down", "Takeout", "Taxpayer", "Teacup", "Teammate", "Teaspoon", "Tennis shoes", "Throwback", "Timekeeper", "Timeline", "Timeshare", "Tugboat", "Tupperware", "Underestimate", "Uplift", "Upperclassman", "Uptown", "Video game", "Wallflower", "Waterboy", "Watermelon", "Wheelchair", "Without", "Workboots", "Worksheet"]

    letters = "aeioulprts"
    test_data = []
    for word in words[:num_words]:
        letter = random.choice(letters)
        actual_count = word.lower().count(letter)  # Use lower() to count case-insensitively
        test_data.append((word, letter, actual_count))
    return test_data

# Alpaca prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{0}

### Input:
{1}

### Response:
"""

# Generate test data
test_data = generate_test_data()


# Run evaluation
correct_predictions = 0
total_predictions = 0

for word, letter, actual_count in test_data:
    input_text = f"How many {letter}'s in {word}?"
    prompt = alpaca_prompt.format(
        "You are an expert at logic puzzles, reasoning, and planning",
        input_text,
        ""
    )

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    text_streamer = TextStreamer(tokenizer)
    output = model.generate(**inputs, streamer=text_streamer, max_new_tokens=256)

    decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Raw model output: {decoded_output}")  # Print raw output for debugging
    predicted_count = extract_count(decoded_output)

    total_predictions += 1

    if predicted_count is not None:
        if predicted_count == actual_count:
            correct_predictions += 1
    else:
        # If predicted_count is None and actual_count is 0, consider it correct
        if actual_count == 0:
            correct_predictions += 1
        print(f"Warning: Could not extract a count from the model's response for '{word}'.")

    print(f"Word: {word}, Letter: {letter}")
    print(f"Actual count: {actual_count}, Predicted count: {predicted_count}")
    print("Correct" if (predicted_count == actual_count or (predicted_count is None and actual_count == 0)) else "Incorrect")

    # Calculate and print accuracy after each word
    current_accuracy = correct_predictions / total_predictions
    print(f"Current Accuracy: {current_accuracy:.2%}")
    print(f"Correct predictions: {correct_predictions}")
    print(f"Total predictions: {total_predictions}")
    print("---")

# Calculate accuracy
accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0
print(f"\nAccuracy: {accuracy:.2%}")
print(f"Correct predictions: {correct_predictions}")
print(f"Total predictions: {total_predictions}")
  • Developed by: isaiahbjork
  • License: apache-2.0
  • Finetuned from model : unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
5
GGUF
Model size
8.03B params
Architecture
llama

8-bit

Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for isaiahbjork/llama-3.1-8b-logic-GGUF

Dataset used to train isaiahbjork/llama-3.1-8b-logic-GGUF