metadata
license: apache-2.0
ReplaceMe: Training-Free Transformer Pruning via Layer Removal & Linear Transformations
Model Description
ReplaceMe is a novel method for transformer model compression that enables training-free block/layer pruning while maintaining model performance through linear transformations. The approach:
- Identifies and removes block of layers
- Applies mathematically-derived transformations to preserve information flow
- Requires no fine-tuning or retraining
- Works with standard transformer architectures (The LTs are merged with the original model weights)
Key Features
- π Zero-Training Pruning: Remove layers without any fine-tuning
- π§ Performance Preservation: <8% accuracy drop in most cases
- β‘ Instant Speedup: less blocks -> faster inference + less memory
- π Plug-and-Play: Works with existing HuggingFace models
π₯ Performance Comparison of Pruning Methods (Llama 2 7B, 25% Compression)
Method | num_pruned_layers | Dataset | State | race π | winogrande π² | piqa π§ | boolq β | openbookqa π | sciq π¬ | lambada_openai π¦ | ppl | Avg-acc π |
---|---|---|---|---|---|---|---|---|---|---|---|---|
acc | acc | acc_norm | acc | acc_norm | acc_norm | acc | ||||||
Llama 3.1 (baseline) | - | - | - | 0.450 | 0.779 | 0.810 | 0.842 | 0.430 | 0.961 | 0.732 | 3.404 | 0.712 |
UIDL* | 8 | slim_orca | no training | 0.341 | 0.719 | 0.690 | 0.773 | 0.310 | 0.719 | 0.087 | 932.000 | 0.592 |
ReplaceMe (Ours) β | 8 | slim_orca | no training | 0.406 | 0.742 π | 0.706 | 0.830 | 0.338 | 0.901 | 0.471 | 16.760 | 0.654 |
ReplaceMe (Ours) β | 8 | slim_orca | SFT | 0.431 π | 0.716 | 0.728 π | 0.849 π | 0.378 π | 0.912 π | 0.697 π | 4.04 π | 0.669 π |
Key:
- π Best performance in column
- β Training-free (our methods)
- β Requires training
Metrics Explained:
- Bold: Best training-free results
- All numbers are accuracy scores
π₯ Our Healed model can acheive 94.0% of baseline performance after healing on 1B tokens!
Installation
pip install replaceme
# or
git clone https://github.com/mts-ai/ReplaceMe
cd ReplaceMe
pip install -e .
Basic Usage
# LSTSQ method (recommended)
run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml
# Cosine similarity method
run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml
There are many parameters you can play with, visit our repo and dscover π₯π₯
Load Model
As we said we are merging the LTs with the original transformer architecture so you just do it as usual
## EXAMPLE
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "MTSAIR/Llama3.1-6B-ReplaceMe-Healed"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "What is ReplaceME pruning method?!"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(
**model_inputs,
max_new_tokens=512
)
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
Citation
If you use ReplaceMe in your research, please cite our paper:
@article{shopkhoev2025replaceme0,
title = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations},
author = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko},
year = {2025},
journal = {arXiv preprint arXiv: 2505.02819}
}