Edit model card

Compressed Meta Llama-3-8B-Instruct with Palu

Overview

This repository contains a compressed version of the Meta Llama-3-8B-Instruct model, utilizing the Palu framework for KV-Cache compression. Palu reduces the hidden dimensions of the KV-Cache through low-rank decomposition, significantly reducing the model's memory footprint while maintaining or enhancing performance.

Meta Llama-3-8B-Instruct: Palu Compression Results

Perplexity (PPL)

Model PPL
meta-llama-3-8b-instruct-palu 8.8309
meta-llama-3-8b-instruct (Base) 8.2845

Zero-shot Evaluation

meta-llama-3-8b-instruct-palu

Tasks Version Filter n-shot Metric Value Stderr
winogrande 1 none 0 acc 0.7277 ±0.0125
arc_challenge 1 none 0 acc 0.4949 ±0.0146
0 acc_norm 0.5427 ±0.0146
arc_easy 1 none 0 acc 0.7942 ±0.0083
0 acc_norm 0.7551 ±0.0088
piqa 1 none 0 acc 0.7655 ±0.0099
0 acc_norm 0.7644 ±0.0099
hellaswag 1 none 0 acc 0.5664 ±0.0049
0 acc_norm 0.7511 ±0.0043
openbookqa 1 none 0 acc 0.3360 ±0.0211
0 acc_norm 0.4380 ±0.0222

meta-llama-3-8b-instruct (Base)

Tasks Version Filter n-shot Metric Value Stderr
winogrande 1 none 0 acc 0.7206 ±0.0126
arc_challenge 1 none 0 acc 0.5299 ±0.0146
0 acc_norm 0.5683 ±0.0145
arc_easy 1 none 0 acc 0.8161 ±0.0079
0 acc_norm 0.7976 ±0.0082
piqa 1 none 0 acc 0.7867 ±0.0096
0 acc_norm 0.7856 ±0.0096
hellaswag 1 none 0 acc 0.5769 ±0.0049
0 acc_norm 0.7581 ±0.0043
openbookqa 1 none 0 acc 0.3420 ±0.0212
0 acc_norm 0.4320 ±0.0222

Long-Bench Evaluation

triviaqa

Model Score
meta-llama-3-8b-instruct-palu 89.45
meta-llama-3-8b-instruct (Base) 90.56

qasper

Model Score
meta-llama-3-8b-instruct-palu 34.92
meta-llama-3-8b-instruct (Base) 31.74

Key Features

  • Model: Meta Llama-3-8B-Instruct
  • Compression Framework: Palu
  • Compression Rate: Up to 91.25% memory reduction
  • Accuracy: Maintained or improved perplexity compared to the base model

Installation

Clone the Repository

Ensure you have Git and Conda installed on your system.

git clone --recurse-submodules https://github.com/shadowpa0327/Palu.git
cd Palu

Set Up the Environment

Create and activate a Conda environment.

conda create -n Palu python=3.10
conda activate Palu
pip install -r requirements.txt

Install Third-Party Libraries

pip install -e 3rdparty/lm-evaluation-harness
pip install -e 3rdparty/fast-hadamard-transform

Usage

Compress the Model

To compress Meta Llama-3-8B-Instruct using Palu's low-rank decomposition, use the following command:

python compress.py \
--model_id="meta-llama/Llama-3-8b-instruct" \
--calib_dataset wikitext2 \
--param_ratio_target 0.7 \
--search_method fisher_uniform \
--head_group_size 4 \
--dump_huggingface_model \
--use_cache 

The compressed model will be saved in the Meta-Llama-3-8b-instruct_ratio-0.7_gs-4-fisher_uniform directory in Hugging Face format.

Evaluate the Compressed Model

Perplexity

To evaluate the perplexity on the wikitext2 dataset with sequence length 2048, run:

python run_ppl_eval.py \
--model_name_or_path /Path/To/Palu/Model \
--datasets wikitext2 \
--seqlen 2048

To evaluate with 3-bit low-rank aware quantization, use:

python run_ppl_eval.py \
--model_name_or_path /Path/To/Palu/Model \
--datasets wikitext2 \
--seqlen 4096 \
--lt_bits 3 \
--lt_hadamard 

Zero-shot Evaluation

For zero-shot evaluations, use the following command:

CUDA_VISIBLE_DEVICES=0 python run_lm_eval.py \
--model_name_or_path "/Path/To/Palu/Model" \
--tasks "openbookqa,hellaswag,piqa,arc_easy,arc_challenge,winogrande"

Long-Bench Evaluation

Evaluate the compressed model on long-bench tasks:

CUDA_VISIBLE_DEVICES=0 python run_long_bench.py \
--model_name_or_path /Path/To/Palu/Model

Latency Evaluation

Attention Module

Evaluate the latency of the Palu-compressed attention module:

CUDA_VISIBLE_DEVICES=0 python run_latency_attention.py \
--rank_k 1024 --rank_v 3072 --group_size 4 \
--prompt_len 65536 --palu

Reconstruction Kernel

Evaluate the latency of the reconstruction kernel:

CUDA_VISIBLE_DEVICES=0 python run_latency_kernel.py \
--total_rank 1024  --group_size 4

Conclusion

This compressed version of Meta Llama-3-8B-Instruct, powered by Palu, is optimized for memory efficiency without compromising performance. Whether you're working with large datasets or deploying models in memory-constrained environments, this setup is designed to provide robust results.

Downloads last month
4
Safetensors
Model size
7.97B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Syed-Hasan-8503/PaluLlama-3-8B-Instruct