Model Card for "InstructMix Llama 3B"
Model Name: InstructMix Llama 3B
Description:
InstructMix Llama 3B is a language model fine-tuned on the InstructMix dataset using parameter-efficient fine-tuning (PEFT), using the base model "openlm-research/open_llama_3b_v2," which can be found at https://huggingface.co/openlm-research/open_llama_3b_v2.
An easy way to use InstructMix Llama 3B is via the API: https://replicate.com/ritabratamaiti/instructmix-llama-3b
Usage:
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
from peft import PeftModel, PeftConfig
# Hugging Face model_path
model_path = 'openlm-research/open_llama_3b_v2'
peft_model_id = 'Xilabs/instructmix-llama-3b'
tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
model_path, device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)
def generate_prompt(instruction, input=None):
if input:
return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:"""
else:
return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:"""
def evaluate(
instruction,
input=None,
temperature=0.5,
top_p=0.75,
top_k=40,
num_beams=5,
max_new_tokens=128,
**kwargs,
):
prompt = generate_prompt(instruction, input)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to("cuda")
generation_config = GenerationConfig(
temperature=temperature,
top_p=top_p,
top_k=top_k,
num_beams=num_beams,
early_stopping=True,
repetition_penalty=1.1,
**kwargs,
)
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=max_new_tokens,
)
s = generation_output.sequences[0]
output = tokenizer.decode(s, skip_special_tokens = True)
#print(output)
return output.split("### Response:")[1]
instruction = "What is the meaning of life?"
print(evaluate(instruction, num_beams=3, temperature=0.1, max_new_tokens=256))
Training procedure
The following bitsandbytes
quantization config was used during training:
- load_in_8bit: False
- load_in_4bit: True
- llm_int8_threshold: 6.0
- llm_int8_skip_modules: None
- llm_int8_enable_fp32_cpu_offload: False
- llm_int8_has_fp16_weight: False
- bnb_4bit_quant_type: nf4
- bnb_4bit_use_double_quant: True
- bnb_4bit_compute_dtype: bfloat16
Framework versions
- PEFT 0.4.0
- Downloads last month
- 25
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.