Model Card for "InstructMix Llama 3B"

Model Name: InstructMix Llama 3B

Description:

InstructMix Llama 3B is a language model fine-tuned on the InstructMix dataset using parameter-efficient fine-tuning (PEFT), using the base model "openlm-research/open_llama_3b_v2," which can be found at https://huggingface.co/openlm-research/open_llama_3b_v2.

An easy way to use InstructMix Llama 3B is via the API: https://replicate.com/ritabratamaiti/instructmix-llama-3b

Usage:

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
from peft import PeftModel, PeftConfig

# Hugging Face model_path
model_path = 'openlm-research/open_llama_3b_v2'
peft_model_id = 'Xilabs/instructmix-llama-3b'
tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)
def generate_prompt(instruction, input=None):
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""
def evaluate(
    instruction,
    input=None,
    temperature=0.5,
    top_p=0.75,
    top_k=40,
    num_beams=5,
    max_new_tokens=128,
    **kwargs,
):
    prompt = generate_prompt(instruction, input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to("cuda")
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
        early_stopping=True,
        repetition_penalty=1.1,
        **kwargs,
    )
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens,
        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s, skip_special_tokens = True)
    #print(output)
    return output.split("### Response:")[1]

instruction = "What is the meaning of life?"
print(evaluate(instruction, num_beams=3, temperature=0.1, max_new_tokens=256))

Training procedure

The following bitsandbytes quantization config was used during training:

  • load_in_8bit: False
  • load_in_4bit: True
  • llm_int8_threshold: 6.0
  • llm_int8_skip_modules: None
  • llm_int8_enable_fp32_cpu_offload: False
  • llm_int8_has_fp16_weight: False
  • bnb_4bit_quant_type: nf4
  • bnb_4bit_use_double_quant: True
  • bnb_4bit_compute_dtype: bfloat16

Framework versions

  • PEFT 0.4.0
Downloads last month
2
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Xilabs/instructmix-llama-3b