ThomasTheMaker
/

Llama3.1-6B-ReplaceMe-Healed-rkllm-v1.2.0

llama

Model card Files Files and versions Community

Llama3.1-6B-ReplaceMe-Healed-rkllm-v1.2.0 / README.md

ThomasTheMaker

Upload folder using huggingface_hub

2ad1fe1 verified about 1 month ago

preview code

raw

history blame contribute delete

4.57 kB

	---
	license: apache-2.0
	---

	# ReplaceMe: Training-Free Transformer Pruning via Layer Removal & Linear Transformations
	[![arXiv](https://img.shields.io/badge/arXiv-2310.12345-b31b1b.svg)](https://arxiv.org/abs/2505.02819)
	[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)


	![ReplaceMe Logo](./figs/logo2.jpg)

	## Model Description
	ReplaceMe is a novel method for transformer model compression that enables training-free block/layer pruning while maintaining model performance through linear transformations. The approach:

	- Identifies and removes block of layers
	- Applies mathematically-derived transformations to preserve information flow
	- Requires no fine-tuning or retraining
	- Works with standard transformer architectures (The LTs are merged with the original model weights)

	## Key Features
	- 🚀 Zero-Training Pruning: Remove layers without any fine-tuning
	- 🧠 Performance Preservation: <8% accuracy drop in most cases
	- ⚡ Instant Speedup: less blocks -> faster inference + less memory
	- 🔌 Plug-and-Play: Works with existing HuggingFace models

	## 🔥 Performance Comparison of Pruning Methods (Llama 2 7B, 25% Compression)

	\| Method \| num_pruned_layers \| Dataset \| State \| race 🏁 \| winogrande 🎲 \| piqa 🧠 \| boolq ❓ \| openbookqa 📖 \| sciq 🔬 \| lambada_openai 🦙 \| ppl \| Avg-acc 📊 \|
	\|-----------------------\|-------------------\|------------\|---------------\|--------\|--------------\|--------\|---------\|--------------\|--------\|------------------\|-----------\|------------\|
	\| \| \| \| \| acc \| acc \| acc_norm \| acc \| acc_norm \| acc_norm \| acc \| \| \|
	\| Llama 3.1 (baseline) \| - \| - \| - \| 0.450 \| 0.779 \| 0.810 \| 0.842 \| 0.430 \| 0.961 \| 0.732 \| 3.404 \| 0.712 \|
	\| UIDL* \| 8 \| slim_orca \| no training \| 0.341 \| 0.719 \| 0.690 \| 0.773 \| 0.310 \| 0.719 \| 0.087 \| 932.000 \| 0.592 \|
	\| ReplaceMe (Ours) ✅ \| 8 \| slim_orca \| no training \| 0.406 \| 0.742 🏆 \| 0.706 \| 0.830 \| 0.338 \| 0.901 \| 0.471 \| 16.760 \| 0.654 \|
	\| ReplaceMe (Ours) ❌ \| 8 \| slim_orca \| SFT \| 0.431 🏆 \| 0.716 \| 0.728 🏆 \| 0.849 🏆 \| 0.378 🏆 \| 0.912 🏆 \| 0.697 🏆 \| 4.04 🏆 \| 0.669 🏆 \|

	Key:
	- 🏆 Best performance in column
	- ✅ Training-free (our methods)
	- ❌ Requires training

	Metrics Explained:
	- Bold: Best training-free results
	- All numbers are accuracy scores

	> 🔥 Our Healed model can acheive 94.0% of baseline performance after healing on 1B tokens!

	## Installation
	```bash
	pip install replaceme
	# or
	git clone https://github.com/mts-ai/ReplaceMe
	cd ReplaceMe
	pip install -e .
	```
	## Basic Usage
	```
	# LSTSQ method (recommended)
	run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml

	# Cosine similarity method
	run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml
	```
	There are many parameters you can play with, visit our repo and dscover 🔥🔥
	## Load Model
	As we said we are merging the LTs with the original transformer architecture so you just do it as usual
	```python
	## EXAMPLE
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "MTSAIR/Llama3.1-6B-ReplaceMe-Healed"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	prompt = "What is ReplaceME pruning method?!"
	messages = [
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	output = model.generate(
	**model_inputs,
	max_new_tokens=512
	)
	response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

	```
	# Citation
	If you use ReplaceMe in your research, please cite our paper:

	```bibtex
	@article{shopkhoev2025replaceme0,
	title = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations},
	author = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko},
	year = {2025},
	journal = {arXiv preprint arXiv: 2505.02819}
	}
	```