ThomasTheMaker's picture
Upload folder using huggingface_hub
2ad1fe1 verified
---
license: apache-2.0
---
# ReplaceMe: Training-Free Transformer Pruning via Layer Removal & Linear Transformations
[![arXiv](https://img.shields.io/badge/arXiv-2310.12345-b31b1b.svg)](https://arxiv.org/abs/2505.02819)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
![ReplaceMe Logo](./figs/logo2.jpg)
## Model Description
ReplaceMe is a novel method for transformer model compression that enables **training-free** block/layer pruning while maintaining model performance through linear transformations. The approach:
- Identifies and removes block of layers
- Applies mathematically-derived transformations to preserve information flow
- Requires no fine-tuning or retraining
- Works with standard transformer architectures (The LTs are merged with the original model weights)
## Key Features
- πŸš€ **Zero-Training Pruning**: Remove layers without any fine-tuning
- 🧠 **Performance Preservation**: <8% accuracy drop in most cases
- ⚑ **Instant Speedup**: less blocks -> faster inference + less memory
- πŸ”Œ **Plug-and-Play**: Works with existing HuggingFace models
## πŸ”₯ Performance Comparison of Pruning Methods (Llama 2 7B, 25% Compression)
| Method | num_pruned_layers | Dataset | State | race 🏁 | winogrande 🎲 | piqa 🧠 | boolq ❓ | openbookqa πŸ“– | sciq πŸ”¬ | lambada_openai πŸ¦™ | ppl | Avg-acc πŸ“Š |
|-----------------------|-------------------|------------|---------------|--------|--------------|--------|---------|--------------|--------|------------------|-----------|------------|
| | | | | acc | acc | acc_norm | acc | acc_norm | acc_norm | acc | | |
| **Llama 3.1** (baseline) | - | - | - | 0.450 | 0.779 | 0.810 | 0.842 | 0.430 | 0.961 | 0.732 | 3.404 | **0.712** |
| **UIDL*** | 8 | slim_orca | no training | 0.341 | 0.719 | 0.690 | 0.773 | 0.310 | 0.719 | 0.087 | 932.000 | 0.592 |
| **ReplaceMe** (Ours) βœ… | 8 | slim_orca | no training | 0.406 | **0.742** πŸ† | 0.706 | 0.830 | 0.338 | 0.901 | 0.471 | 16.760 | 0.654 |
| **ReplaceMe** (Ours) ❌ | 8 | slim_orca | SFT | **0.431** πŸ† | 0.716 | **0.728** πŸ† | **0.849** πŸ† | **0.378** πŸ† | **0.912** πŸ† | **0.697** πŸ† | 4.04 πŸ† | **0.669** πŸ† |
**Key:**
- πŸ† Best performance in column
- βœ… Training-free (our methods)
- ❌ Requires training
**Metrics Explained:**
- **Bold**: Best training-free results
- All numbers are accuracy scores
> πŸ”₯ **Our Healed model can acheive 94.0% of baseline performance after healing on 1B tokens!**
## Installation
```bash
pip install replaceme
# or
git clone https://github.com/mts-ai/ReplaceMe
cd ReplaceMe
pip install -e .
```
## Basic Usage
```
# LSTSQ method (recommended)
run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml
# Cosine similarity method
run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml
```
There are many parameters you can play with, visit our repo and dscover πŸ”₯πŸ”₯
## Load Model
As we said we are merging the LTs with the original transformer architecture so you just do it as usual
```python
## EXAMPLE
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "MTSAIR/Llama3.1-6B-ReplaceMe-Healed"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "What is ReplaceME pruning method?!"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(
**model_inputs,
max_new_tokens=512
)
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
```
# Citation
If you use ReplaceMe in your research, please cite our paper:
```bibtex
@article{shopkhoev2025replaceme0,
title = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations},
author = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko},
year = {2025},
journal = {arXiv preprint arXiv: 2505.02819}
}
```