File size: 4,566 Bytes
2ad1fe1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: apache-2.0
---

# ReplaceMe: Training-Free Transformer Pruning via Layer Removal & Linear Transformations
[![arXiv](https://img.shields.io/badge/arXiv-2310.12345-b31b1b.svg)](https://arxiv.org/abs/2505.02819)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)


![ReplaceMe Logo](./figs/logo2.jpg)

## Model Description
ReplaceMe is a novel method for transformer model compression that enables **training-free** block/layer pruning while maintaining model performance through linear transformations. The approach:

- Identifies and removes block of layers
- Applies mathematically-derived transformations to preserve information flow
- Requires no fine-tuning or retraining
- Works with standard transformer architectures (The LTs are merged with the original model weights)

## Key Features
- πŸš€ **Zero-Training Pruning**: Remove layers without any fine-tuning
- 🧠 **Performance Preservation**: <8% accuracy drop in most cases
- ⚑ **Instant Speedup**: less blocks -> faster inference + less memory
- πŸ”Œ **Plug-and-Play**: Works with existing HuggingFace models

## πŸ”₯ Performance Comparison of Pruning Methods (Llama 2 7B, 25% Compression)

| Method                | num_pruned_layers | Dataset    | State         | race 🏁 | winogrande 🎲 | piqa 🧠 | boolq ❓ | openbookqa πŸ“– | sciq πŸ”¬ | lambada_openai πŸ¦™ | ppl       | Avg-acc πŸ“Š |
|-----------------------|-------------------|------------|---------------|--------|--------------|--------|---------|--------------|--------|------------------|-----------|------------|
|                       |                   |            |               | acc    | acc          | acc_norm | acc     | acc_norm     | acc_norm | acc              |           |            |
| **Llama 3.1** (baseline) | -               | -          | -             | 0.450  | 0.779        | 0.810   | 0.842   | 0.430        | 0.961  | 0.732            | 3.404     | **0.712**  |
| **UIDL***            | 8                 | slim_orca  | no training   | 0.341  | 0.719        | 0.690   | 0.773   | 0.310        | 0.719  | 0.087            | 932.000   | 0.592      |
| **ReplaceMe** (Ours) βœ… | 8                | slim_orca  | no training   | 0.406 | **0.742** πŸ† | 0.706 | 0.830 | 0.338  | 0.901 | 0.471  | 16.760  | 0.654  |
| **ReplaceMe** (Ours) ❌ | 8                | slim_orca  | SFT   | **0.431** πŸ† | 0.716 | **0.728** πŸ† | **0.849** πŸ† | **0.378** πŸ† | **0.912** πŸ† | **0.697** πŸ† | 4.04 πŸ† | **0.669** πŸ† |

**Key:**
- πŸ† Best performance in column
- βœ… Training-free (our methods)
- ❌ Requires training

**Metrics Explained:**
- **Bold**: Best training-free results
- All numbers are accuracy scores

> πŸ”₯ **Our Healed model can acheive 94.0% of baseline performance after healing on 1B tokens!**

## Installation
```bash
pip install replaceme
# or
git clone https://github.com/mts-ai/ReplaceMe
cd ReplaceMe
pip install -e .
```
## Basic Usage
```
# LSTSQ method (recommended)
run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml

# Cosine similarity method
run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml
```
There are many parameters you can play with, visit our repo and dscover πŸ”₯πŸ”₯
## Load Model
As we said we are merging the LTs with the original transformer architecture so you just do it as usual
```python
## EXAMPLE
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MTSAIR/Llama3.1-6B-ReplaceMe-Healed"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "What is ReplaceME pruning method?!"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(
    **model_inputs,
    max_new_tokens=512
)
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

```
# Citation
If you use ReplaceMe in your research, please cite our paper:

```bibtex
@article{shopkhoev2025replaceme0,
  title   = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations},
  author  = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2505.02819}
}
```