Update README.md

f04c4e1 verified 4 months ago

6.22 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	---

	# Compressed Meta Llama-3-8B-Instruct with Palu

	## Overview
	This repository contains a compressed version of the Meta Llama-3-8B-Instruct model, utilizing the Palu framework for KV-Cache compression. Palu reduces the hidden dimensions of the KV-Cache through low-rank decomposition, significantly reducing the model's memory footprint while maintaining or enhancing performance.


	# Meta Llama-3-8B-Instruct: Palu Compression Results

	## Perplexity (PPL)

	\| Model \| PPL \|
	\|----------------------------------------\|-----------------\|
	\| meta-llama-3-8b-instruct-palu \| 8.8309 \|
	\| meta-llama-3-8b-instruct (Base) \| 8.2845 \|

	## Zero-shot Evaluation

	### meta-llama-3-8b-instruct-palu

	\| Tasks \| Version \| Filter \| n-shot \| Metric \| Value \| Stderr \|
	\|-----------------\|---------\|--------\|--------\|---------\|--------\|---------\|
	\| winogrande \| 1 \| none \| 0 \| acc \| 0.7277 \| ±0.0125 \|
	\| arc_challenge \| 1 \| none \| 0 \| acc \| 0.4949 \| ±0.0146 \|
	\| \| \| \| 0 \| acc_norm\| 0.5427 \| ±0.0146 \|
	\| arc_easy \| 1 \| none \| 0 \| acc \| 0.7942 \| ±0.0083 \|
	\| \| \| \| 0 \| acc_norm\| 0.7551 \| ±0.0088 \|
	\| piqa \| 1 \| none \| 0 \| acc \| 0.7655 \| ±0.0099 \|
	\| \| \| \| 0 \| acc_norm\| 0.7644 \| ±0.0099 \|
	\| hellaswag \| 1 \| none \| 0 \| acc \| 0.5664 \| ±0.0049 \|
	\| \| \| \| 0 \| acc_norm\| 0.7511 \| ±0.0043 \|
	\| openbookqa \| 1 \| none \| 0 \| acc \| 0.3360 \| ±0.0211 \|
	\| \| \| \| 0 \| acc_norm\| 0.4380 \| ±0.0222 \|

	### meta-llama-3-8b-instruct (Base)

	\| Tasks \| Version \| Filter \| n-shot \| Metric \| Value \| Stderr \|
	\|-----------------\|---------\|--------\|--------\|---------\|--------\|---------\|
	\| winogrande \| 1 \| none \| 0 \| acc \| 0.7206 \| ±0.0126 \|
	\| arc_challenge \| 1 \| none \| 0 \| acc \| 0.5299 \| ±0.0146 \|
	\| \| \| \| 0 \| acc_norm\| 0.5683 \| ±0.0145 \|
	\| arc_easy \| 1 \| none \| 0 \| acc \| 0.8161 \| ±0.0079 \|
	\| \| \| \| 0 \| acc_norm\| 0.7976 \| ±0.0082 \|
	\| piqa \| 1 \| none \| 0 \| acc \| 0.7867 \| ±0.0096 \|
	\| \| \| \| 0 \| acc_norm\| 0.7856 \| ±0.0096 \|
	\| hellaswag \| 1 \| none \| 0 \| acc \| 0.5769 \| ±0.0049 \|
	\| \| \| \| 0 \| acc_norm\| 0.7581 \| ±0.0043 \|
	\| openbookqa \| 1 \| none \| 0 \| acc \| 0.3420 \| ±0.0212 \|
	\| \| \| \| 0 \| acc_norm\| 0.4320 \| ±0.0222 \|

	## Long-Bench Evaluation

	### triviaqa

	\| Model \| Score \|
	\|----------------------------------------\|--------\|
	\| meta-llama-3-8b-instruct-palu \| 89.45 \|
	\| meta-llama-3-8b-instruct (Base) \| 90.56 \|

	### qasper

	\| Model \| Score \|
	\|----------------------------------------\|--------\|
	\| meta-llama-3-8b-instruct-palu \| 34.92 \|
	\| meta-llama-3-8b-instruct (Base) \| 31.74 \|

	---


	## Key Features
	- Model: Meta Llama-3-8B-Instruct
	- Compression Framework: Palu
	- Compression Rate: Up to 91.25% memory reduction
	- Accuracy: Maintained or improved perplexity compared to the base model

	## Installation

	### Clone the Repository
	Ensure you have Git and Conda installed on your system.
	```bash
	git clone --recurse-submodules https://github.com/shadowpa0327/Palu.git
	cd Palu
	```

	### Set Up the Environment
	Create and activate a Conda environment.
	```bash
	conda create -n Palu python=3.10
	conda activate Palu
	pip install -r requirements.txt
	```

	### Install Third-Party Libraries
	```bash
	pip install -e 3rdparty/lm-evaluation-harness
	pip install -e 3rdparty/fast-hadamard-transform
	```

	## Usage

	### Compress the Model
	To compress Meta Llama-3-8B-Instruct using Palu's low-rank decomposition, use the following command:

	```bash
	python compress.py \
	--model_id="meta-llama/Llama-3-8b-instruct" \
	--calib_dataset wikitext2 \
	--param_ratio_target 0.7 \
	--search_method fisher_uniform \
	--head_group_size 4 \
	--dump_huggingface_model \
	--use_cache
	```

	The compressed model will be saved in the `Meta-Llama-3-8b-instruct_ratio-0.7_gs-4-fisher_uniform` directory in Hugging Face format.

	### Evaluate the Compressed Model

	#### Perplexity
	To evaluate the perplexity on the `wikitext2` dataset with sequence length 2048, run:

	```bash
	python run_ppl_eval.py \
	--model_name_or_path /Path/To/Palu/Model \
	--datasets wikitext2 \
	--seqlen 2048
	```

	To evaluate with 3-bit low-rank aware quantization, use:
	```bash
	python run_ppl_eval.py \
	--model_name_or_path /Path/To/Palu/Model \
	--datasets wikitext2 \
	--seqlen 4096 \
	--lt_bits 3 \
	--lt_hadamard
	```

	#### Zero-shot Evaluation
	For zero-shot evaluations, use the following command:
	```bash
	CUDA_VISIBLE_DEVICES=0 python run_lm_eval.py \
	--model_name_or_path "/Path/To/Palu/Model" \
	--tasks "openbookqa,hellaswag,piqa,arc_easy,arc_challenge,winogrande"
	```

	#### Long-Bench Evaluation
	Evaluate the compressed model on long-bench tasks:
	```bash
	CUDA_VISIBLE_DEVICES=0 python run_long_bench.py \
	--model_name_or_path /Path/To/Palu/Model
	```

	## Latency Evaluation

	### Attention Module
	Evaluate the latency of the Palu-compressed attention module:
	```bash
	CUDA_VISIBLE_DEVICES=0 python run_latency_attention.py \
	--rank_k 1024 --rank_v 3072 --group_size 4 \
	--prompt_len 65536 --palu
	```

	### Reconstruction Kernel
	Evaluate the latency of the reconstruction kernel:
	```bash
	CUDA_VISIBLE_DEVICES=0 python run_latency_kernel.py \
	--total_rank 1024 --group_size 4
	```

	## Conclusion
	This compressed version of Meta Llama-3-8B-Instruct, powered by Palu, is optimized for memory efficiency without compromising performance. Whether you're working with large datasets or deploying models in memory-constrained environments, this setup is designed to provide robust results.