metadata

language: en
license: apache-2.0

Shears Model Card: Shears-llama-13b-50-math-heuristic

Fine tuned model on LLaMA-13B with some math reasoning datasets using Shears.

Model Details

Information

Model name: Shears-llama-13b-50-math-heuristic
Base model: LLaMA-13b
Sparsity: 50%
Domain: Math
Subnetwork version: Heuristic

Adapter Configuration

LoRA rank: 32 (24 in the heuristic subnetwork)
LoRA alpha: 64
LoRA target modules: q_proj, k_proj, v_proj, up_proj, down_proj
LoRA rank search space: [32, 24, 16]

Training Hyperparameters

Batch size: 16
Learning rate: 3e-4
Epoch: 3

Training Data

Unified math reasoning dataset: math_10k.json (collected with the training sets of GSM8K, MAWPS, and AQuA).

Evaluation Data

GSM8K, AQuA, MAWPS, SVAMP

How to use

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

def generate_prompt(instruction):
    return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request. 

                    ### Instruction:
                    {instruction}

                    ### Response:
                    """

base_model_path = "shears-llama-13b-50-math-heuristic/base_model"
adapter_model_path = "shears-llama-13b-50-math-heuristic/adapter_model"
base_model = AutoModelForCausalLM.from_pretrained(base_model_path)
model = PeftModel.from_pretrained(base_model, adapter_model_path)
model.eval()

non_zero_params = sum([(param.data != 0).sum().item() for _, param in model.named_parameters()])
print(f"Number of all non-zero parameters: {non_zero_params}")

tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token_id = 0

instruction = "Edgar eats 18 pretzels a day. If his brother eats 1/2 as many, how many does his brother eat in a week?"
prompt = generate_prompt(instruction)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=256,
        use_cache=True,
        num_beams=4,
    )
  s = generation_output.sequences[0]
  output = tokenizer.decode(s)
print(output)

Evaluation Results

Model	Sparsity	GSM8K	AQuA	MAWPS	SVAMP	Average
LLaMA-7B-LoRA	-	37.5	18.9	79.0	52.1	46.9
LLaMA-7B-Shears	50%	36.1	22.0	78.6	44.5	45.3
LLaMA-13B-LoRA	-	47.5	18.5	83.6	54.6	51.1
LLaMA-13B-Shears	50%	45.1	22.0	83.2	53.3	50.9

Model Sources

Repository: https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears
Paper: Shears: Unstructured Sparsity with Neural Low-rank Adapter Search

Citation

@article{munoz2024shears,
  title = {Shears: Unstructured Sparsity with Neural Low-rank Adapter Search},
  author={J. Pablo Munoz and Jinjie Yuan and Nilesh Jain},
  journal={},
  year={2024}
}

License

Apache-2.0