File size: 4,566 Bytes

2ad1fe1

---
license: apache-2.0
---

# ReplaceMe: Training-Free Transformer Pruning via Layer Removal & Linear Transformations
[![arXiv](https://img.shields.io/badge/arXiv-2310.12345-b31b1b.svg)](https://arxiv.org/abs/2505.02819)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)


![ReplaceMe Logo](./figs/logo2.jpg)

## Model Description
ReplaceMe is a novel method for transformer model compression that enables **training-free** block/layer pruning while maintaining model performance through linear transformations. The approach:

- Identifies and removes block of layers
- Applies mathematically-derived transformations to preserve information flow
- Requires no fine-tuning or retraining
- Works with standard transformer architectures (The LTs are merged with the original model weights)

## Key Features
- 🚀 **Zero-Training Pruning**: Remove layers without any fine-tuning
- 🧠 **Performance Preservation**: <8% accuracy drop in most cases
- ⚡ **Instant Speedup**: less blocks -> faster inference + less memory
- 🔌 **Plug-and-Play**: Works with existing HuggingFace models

## 🔥 Performance Comparison of Pruning Methods (Llama 2 7B, 25% Compression)

| Method                | num_pruned_layers | Dataset    | State         | race 🏁 | winogrande 🎲 | piqa 🧠 | boolq ❓ | openbookqa 📖 | sciq 🔬 | lambada_openai 🦙 | ppl       | Avg-acc 📊 |
|-----------------------|-------------------|------------|---------------|--------|--------------|--------|---------|--------------|--------|------------------|-----------|------------|
|                       |                   |            |               | acc    | acc          | acc_norm | acc     | acc_norm     | acc_norm | acc              |           |            |
| **Llama 3.1** (baseline) | -               | -          | -             | 0.450  | 0.779        | 0.810   | 0.842   | 0.430        | 0.961  | 0.732            | 3.404     | **0.712**  |
| **UIDL***            | 8                 | slim_orca  | no training   | 0.341  | 0.719        | 0.690   | 0.773   | 0.310        | 0.719  | 0.087            | 932.000   | 0.592      |
| **ReplaceMe** (Ours) ✅ | 8                | slim_orca  | no training   | 0.406 | **0.742** 🏆 | 0.706 | 0.830 | 0.338  | 0.901 | 0.471  | 16.760  | 0.654  |
| **ReplaceMe** (Ours) ❌ | 8                | slim_orca  | SFT   | **0.431** 🏆 | 0.716 | **0.728** 🏆 | **0.849** 🏆 | **0.378** 🏆 | **0.912** 🏆 | **0.697** 🏆 | 4.04 🏆 | **0.669** 🏆 |

**Key:**
- 🏆 Best performance in column
- ✅ Training-free (our methods)
- ❌ Requires training

**Metrics Explained:**
- **Bold**: Best training-free results
- All numbers are accuracy scores

> 🔥 **Our Healed model can acheive 94.0% of baseline performance after healing on 1B tokens!**

## Installation
```bash
pip install replaceme
# or
git clone https://github.com/mts-ai/ReplaceMe
cd ReplaceMe
pip install -e .
```
## Basic Usage
```
# LSTSQ method (recommended)
run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml

# Cosine similarity method
run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml
```
There are many parameters you can play with, visit our repo and dscover 🔥🔥
## Load Model
As we said we are merging the LTs with the original transformer architecture so you just do it as usual
```python
## EXAMPLE
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MTSAIR/Llama3.1-6B-ReplaceMe-Healed"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "What is ReplaceME pruning method?!"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(
    **model_inputs,
    max_new_tokens=512
)
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

```
# Citation
If you use ReplaceMe in your research, please cite our paper:

```bibtex
@article{shopkhoev2025replaceme0,
  title   = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations},
  author  = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2505.02819}
}
```