|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# ReplaceMe: Training-Free Transformer Pruning via Layer Removal & Linear Transformations |
|
[](https://arxiv.org/abs/2505.02819) |
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
|
|
|
 |
|
|
|
## Model Description |
|
ReplaceMe is a novel method for transformer model compression that enables **training-free** block/layer pruning while maintaining model performance through linear transformations. The approach: |
|
|
|
- Identifies and removes block of layers |
|
- Applies mathematically-derived transformations to preserve information flow |
|
- Requires no fine-tuning or retraining |
|
- Works with standard transformer architectures (The LTs are merged with the original model weights) |
|
|
|
## Key Features |
|
- π **Zero-Training Pruning**: Remove layers without any fine-tuning |
|
- π§ **Performance Preservation**: <8% accuracy drop in most cases |
|
- β‘ **Instant Speedup**: less blocks -> faster inference + less memory |
|
- π **Plug-and-Play**: Works with existing HuggingFace models |
|
|
|
## π₯ Performance Comparison of Pruning Methods (Llama 2 7B, 25% Compression) |
|
|
|
| Method | num_pruned_layers | Dataset | State | race π | winogrande π² | piqa π§ | boolq β | openbookqa π | sciq π¬ | lambada_openai π¦ | ppl | Avg-acc π | |
|
|-----------------------|-------------------|------------|---------------|--------|--------------|--------|---------|--------------|--------|------------------|-----------|------------| |
|
| | | | | acc | acc | acc_norm | acc | acc_norm | acc_norm | acc | | | |
|
| **Llama 3.1** (baseline) | - | - | - | 0.450 | 0.779 | 0.810 | 0.842 | 0.430 | 0.961 | 0.732 | 3.404 | **0.712** | |
|
| **UIDL*** | 8 | slim_orca | no training | 0.341 | 0.719 | 0.690 | 0.773 | 0.310 | 0.719 | 0.087 | 932.000 | 0.592 | |
|
| **ReplaceMe** (Ours) β
| 8 | slim_orca | no training | 0.406 | **0.742** π | 0.706 | 0.830 | 0.338 | 0.901 | 0.471 | 16.760 | 0.654 | |
|
| **ReplaceMe** (Ours) β | 8 | slim_orca | SFT | **0.431** π | 0.716 | **0.728** π | **0.849** π | **0.378** π | **0.912** π | **0.697** π | 4.04 π | **0.669** π | |
|
|
|
**Key:** |
|
- π Best performance in column |
|
- β
Training-free (our methods) |
|
- β Requires training |
|
|
|
**Metrics Explained:** |
|
- **Bold**: Best training-free results |
|
- All numbers are accuracy scores |
|
|
|
> π₯ **Our Healed model can acheive 94.0% of baseline performance after healing on 1B tokens!** |
|
|
|
## Installation |
|
```bash |
|
pip install replaceme |
|
# or |
|
git clone https://github.com/mts-ai/ReplaceMe |
|
cd ReplaceMe |
|
pip install -e . |
|
``` |
|
## Basic Usage |
|
``` |
|
# LSTSQ method (recommended) |
|
run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml |
|
|
|
# Cosine similarity method |
|
run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml |
|
``` |
|
There are many parameters you can play with, visit our repo and dscover π₯π₯ |
|
## Load Model |
|
As we said we are merging the LTs with the original transformer architecture so you just do it as usual |
|
```python |
|
## EXAMPLE |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name = "MTSAIR/Llama3.1-6B-ReplaceMe-Healed" |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
torch_dtype="auto", |
|
device_map="auto" |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
prompt = "What is ReplaceME pruning method?!" |
|
messages = [ |
|
{"role": "user", "content": prompt} |
|
] |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
output = model.generate( |
|
**model_inputs, |
|
max_new_tokens=512 |
|
) |
|
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0] |
|
|
|
``` |
|
# Citation |
|
If you use ReplaceMe in your research, please cite our paper: |
|
|
|
```bibtex |
|
@article{shopkhoev2025replaceme0, |
|
title = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations}, |
|
author = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko}, |
|
year = {2025}, |
|
journal = {arXiv preprint arXiv: 2505.02819} |
|
} |
|
``` |