Qwen2-7B-ReLU
Qwen2-7B-ReLU is a variant of Qwen2-7B that replaces the SiLU/Swish activation function with dReLU, achieving higher sparsity while maintaining the performance of the original model.
Key Features
- Replaces SiLU/Swish activation function with dReLU
- Maintains comparable or even better performance with the original Qwen2-7B
- Significantly increases activation sparsity, enabling further optimization and compression
Benchmarks
The model has been evaluated on standard benchmarks to verify its performance:
- MMLU: 69.19% (5-shot)
- IFEval: 73.2% (Prompt Strict-Accuracy)
- Livebench:
- Average: 32.1%
- Coding: 39.8%
- Data Analysis: 45.3%
- Instruction Following: 58.1%
- Language: 9.0%
- Math: 22.0%
- Reasoning: 18.7%
These results demonstrate that the ReLU modification maintains competitive performance while achieving higher sparsity compared to the original model.
Technical Details
The key modification in this version is the application of ReLU activation to both branches in the MLP block. The implementation modifies the original Qwen2MLP
class as follows:
class Qwen2MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.act_fn(self.up_proj(x)))
return down_proj
The key change is in the forward pass, where the activation function is now applied to both the gate projection and up projection outputs before multiplication. This modification, combined with the use of ReLU, contributes to the increased sparsity of the model.
Intended Usage
This release primarily targets the research community for:
- Studying sparsity in large language models
- Model compression and optimization research
- Understanding the impact of activation functions on model behavior
Model Limitations
- The model may exhibit biases present in the training data
- May generate incorrect, inappropriate, or harmful content
- Performance may vary across different domains and tasks
- Not suitable for production deployment without proper evaluation
Quick Start
You should replace original modeling_qwen FFN implementation code to dReLU firstly.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("PowerInfer/SparseQwen2-7B")
tokenizer = AutoTokenizer.from_pretrained("PowerInfer/SparseQwen2-7B")
prompt = "Hello"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs)
response = tokenizer.decode(outputs[0])
Citation
If you use this model in your research, please cite:
@article{song2024turbo,
title={Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters},
author={Song, Yixin and Xie, Haotong and Zhang, Zhengyan and Wen, Bo and Ma, Li and Mi, Zeyu and Chen, Haibo},
journal={arXiv preprint arXiv:2406.05955},
year={2024}
}
- Downloads last month
- 67