Raincleared commited on
Commit
1c106ea
1 Parent(s): ac351f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -30
README.md CHANGED
@@ -21,7 +21,7 @@ The utilization of activation sparsity, namely the existence of considerable wea
21
 
22
  Adopting ReLU as the activation function is a straightforward method to achieve activation sparsity. However, most recent mainstream LLMs adopt activation functions without intrinsic sparsity (e.g., GELU and Swish). Some efforts ([Zhang et al., 2022](https://aclanthology.org/2022.findings-acl.71.pdf); [Mirzadeh et al., 2023](https://arxiv.org/pdf/2310.04564.pdf); [Zhang et al., 2024](https://arxiv.org/pdf/2402.03804.pdf)) introduce ReLU or its variants as the substitutive activation function to help non-ReLU LLMs achieve activation sparsity and inference acceleration, but few can concurrently obtain high sparsity and comparable task-specific performance.
23
 
24
- In this work, we introduce an effective sparsification method named "ProSparse" to push for higher activation sparsity without performance degradation. By applying ProSparse to Swish-activated LLaMA2-7B and LLaMA2-13B, we obtain ReLU-activated LLaMA2 models with high sparsity of 89.32% and 88.80% respectively while their performance is comparable to the original version. Further inference acceleration experiments demonstrate the practical speedup effects of higher sparsity on both [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf) and our two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator).
25
 
26
  ### Training Dataset
27
 
@@ -52,7 +52,7 @@ Intuitively, training the model with even more tokens or with data of a wider co
52
 
53
  The training process of ProSparse consists of three steps (refer to Section 3.2 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details):
54
 
55
- 1. **Activation Function Substitution**: We substituting the activation function of FFNs with ReLU and applying continual training;
56
  2. **Progressive Sparsity Regularization**: We jointly optimize the model on the conventional next-token prediction loss and \\(L_1\\) regularization loss. The regularization is applied to the sparse intermediate outputs of FFNs with a regularization factor increasing progressively in multiple stages. Specifically, the regularization factor \\(\lambda\\) is set to a small constant for the warmup stage, and then increases along a smooth sine curve for each of the subsequent incremental stages. Each stage is accompanied by certain steps of training. In this way, the model can have more time to adapt to the increasing regularization without radical activation shifts, thus alleviating performance degradation.
57
  3. **Activation Threshold Shifting**: We finally replace ReLU with FATReLU ([Kurtz et al., 2020](https://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf)), a ReLU variant with a positive threshold. This can prune those non-zero weakly-contributed elements in activation outputs and further boost sparsity.
58
 
@@ -75,27 +75,27 @@ The evaluation results on the above benchmarks demonstrate the advantage of ProS
75
 
76
  - **Commonsense Reasoning**: We report the average 0-shot accuracies on PIQA, SIQA, HellaSwag, WinoGrande, and COPA.
77
 
78
- - **Reading Comprehension**: We compute the average 0-shot accuracies on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
79
 
80
  - **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and AGI-Eval (0-shot).
81
 
82
  **Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
83
 
84
- | Setting | Average<br>Sparsity | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI Eval | Average |
85
- | :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
86
- | Original-7B | - | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 | 37.96 |
87
- | ReluLLaMA-7B | 66.98 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 | 37.62 |
88
- | Vanilla ReLU-7B | 66.04 | 21.31 | 70.73 | 73.22 | 11.22 | 49.22 | 36.11 | 28.01 | 41.40 |
89
- | Shifted ReLU-7B | 69.59 | 20.50 | 70.09 | 73.17 | 13.87 | 48.54 | 35.20 | 27.94 | 41.33 |
90
- | Fixed \\(L_1\\)-7B | 91.46 | 18.85 | 66.01 | 55.39 | 2.27 | 32.28 | 31.40 | 26.48 | 33.24 |
91
- | **ProSparse-7B**\* | 88.11 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 | 38.31 |
92
- | **ProSparse-7B** | 89.32 | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 | 38.46 |
93
- | Original-13B | - | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 | 44.06 |
94
- | ReluLLaMA-13B | 71.56 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 | 42.74 |
95
- | **ProSparse-13B**\* | 87.97 | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 | 45.07 |
96
- | **ProSparse-13B** | 88.80 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 | 44.90 |
97
-
98
- **Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. "ProSparse-7B\*" and "ProSparse-13B\*" denote the ProSparse versions without activation threshold shifting.
99
 
100
  ### Evaluation Issues with LM-Eval
101
 
@@ -132,18 +132,18 @@ where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) de
132
 
133
  The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
134
 
135
- | Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | `S2`<br>Time | `S2`<br>Speedup | `S3`<br/>Time | `S3`<br/>Speedup |
136
- | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
137
- | ReluLLaMA-7B | 66.98 | 90.89 | 58.95 | 11.37 | 67.12 | 1.35 | 63.00 | 1.32 |
138
- | Vanilla ReLU-7B | 66.04 | 87.72 | 72.57 | 12.04 | 67.85 | 1.33 | 63.28 | 1.31 |
139
- | Fixed \\(L_1\\)-7B | 91.46 | 94.51 | 82.85 | 19.62 | 40.99 | 2.21 | 54.19 | 1.53 |
140
- | **ProSparse-7B**\* | 88.11 | 93.46 | 75.24 | 16.30 | 46.66 | 1.94 | 55.56 | 1.49 |
141
- | **ProSparse-7B** | 89.32 | 92.34 | 78.75 | - | 45.38 | 2.00 | 55.05 | 1.51 |
142
- | ReluLLaMA-13B | 71.56 | 86.41 | 71.93 | 6.59 | 69.92 | 1.88 | 75.47 | 1.51 |
143
- | **ProSparse-13B**\* | 87.97 | 91.02 | 77.93 | 8.67 | 55.29 | 2.38 | 67.50 | 1.68 |
144
- | **ProSparse-13B** | 88.80 | 91.11 | 78.28 | - | 53.78 | 2.44 | 66.73 | 1.70 |
145
-
146
- **Notes**: Fixed \\(L_1\\) suffers from severe performance degradation. ProSparse with Activation Threshold Shifting is not supported by PowerInfer. "Time" means the average wall-clock time (us) cost by each step with our sparse GPU operators, and "Speedup" is the speedup ratio to the setting without operators. For reference, the average number of tokens generated by [llama.cpp](https://github.com/ggerganov/llama.cpp) per second is about **3.67 for 7B and 1.92 for 13B**. The average time for step (2) and (3) without sparse GPU operators is about **90.55 and 82.92 (us) for 7B, 131.36 and 113.68 (us) for 13B** respectively under all sparsity.
147
 
148
  ### License Disclaimer
149
 
 
21
 
22
  Adopting ReLU as the activation function is a straightforward method to achieve activation sparsity. However, most recent mainstream LLMs adopt activation functions without intrinsic sparsity (e.g., GELU and Swish). Some efforts ([Zhang et al., 2022](https://aclanthology.org/2022.findings-acl.71.pdf); [Mirzadeh et al., 2023](https://arxiv.org/pdf/2310.04564.pdf); [Zhang et al., 2024](https://arxiv.org/pdf/2402.03804.pdf)) introduce ReLU or its variants as the substitutive activation function to help non-ReLU LLMs achieve activation sparsity and inference acceleration, but few can concurrently obtain high sparsity and comparable task-specific performance.
23
 
24
+ In this work, we introduce a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity while maintaining comparable performance. By applying ProSparse to Swish-activated LLaMA2-7B, LLaMA2-13B, and MiniCPM-1B, we obtain ReLU-activated models with high sparsity of 89.32%, 88.80%, and 87.89%, respectively, while their performance is comparable to the original version. These present the most sparsely activated models among open-source LLaMA versions and competitive end-size models, considerably surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Further inference acceleration experiments demonstrate the practical speedup effects of higher sparsity on both [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf) and our two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator).
25
 
26
  ### Training Dataset
27
 
 
52
 
53
  The training process of ProSparse consists of three steps (refer to Section 3.2 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details):
54
 
55
+ 1. **Activation Function Substitution**: We substitute the activation function of FFNs with ReLU and apply continual training;
56
  2. **Progressive Sparsity Regularization**: We jointly optimize the model on the conventional next-token prediction loss and \\(L_1\\) regularization loss. The regularization is applied to the sparse intermediate outputs of FFNs with a regularization factor increasing progressively in multiple stages. Specifically, the regularization factor \\(\lambda\\) is set to a small constant for the warmup stage, and then increases along a smooth sine curve for each of the subsequent incremental stages. Each stage is accompanied by certain steps of training. In this way, the model can have more time to adapt to the increasing regularization without radical activation shifts, thus alleviating performance degradation.
57
  3. **Activation Threshold Shifting**: We finally replace ReLU with FATReLU ([Kurtz et al., 2020](https://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf)), a ReLU variant with a positive threshold. This can prune those non-zero weakly-contributed elements in activation outputs and further boost sparsity.
58
 
 
75
 
76
  - **Commonsense Reasoning**: We report the average 0-shot accuracies on PIQA, SIQA, HellaSwag, WinoGrande, and COPA.
77
 
78
+ - **Reading Comprehension**: We compute the average 0-shot accuracies on BoolQ, LAMBADA, and TyDi QA.
79
 
80
  - **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and AGI-Eval (0-shot).
81
 
82
  **Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
83
 
84
+ | Setting | Average<br>Sparsity | Average<br>Performance | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI Eval |
85
+ | :-------------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: | :-----------------: |
86
+ | LLaMA2-7B | - | 37.96 | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 |
87
+ | ReluLLaMA-7B | 66.98 | 37.62 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 |
88
+ | **ProSparse-7B**\* | 88.11 | 38.31 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 |
89
+ | **ProSparse-7B** | **89.32** | **38.46** | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 |
90
+ | LLaMA2-13B | - | 44.06 | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 |
91
+ | ReluLLaMA-13B | 71.56 | 42.74 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 |
92
+ | **ProSparse-13B**\* | 87.97 | **45.07** | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 |
93
+ | **ProSparse-13B** | **88.80** | 44.90 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 |
94
+ | MiniCPM-1B | - | 44.44 | 36.85 | 63.67 | 60.90 | 35.48 | 50.44 | 35.03 | 28.71 |
95
+ | **ProSparse-1B**\* | 86.25 | **44.72** | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 |
96
+ | **ProSparse-1B** | **87.89** | **44.72** | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 |
97
+
98
+ **Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. MiniCPM-1B is available at [1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). "ProSparse-7B\*", "ProSparse-13B\*", and "ProSparse-1B\*" denote the ProSparse versions without activation threshold shifting.
99
 
100
  ### Evaluation Issues with LM-Eval
101
 
 
132
 
133
  The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
134
 
135
+ | Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | Speedup<br>to Dense | `S2`<br>Time | Speedup<br>to Dense | `S3`<br/>Time | Speedup<br/>to Dense |
136
+ | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
137
+ | Dense-7B | - | - | - | 3.67 | 1.00 | 90.55 | 1.00 | 82.92 | 1.00 |
138
+ | ReluLLaMA-7B | 66.98 | 90.89 | 58.95 | 11.37 | 3.10 | 67.12 | 1.35 | 63.00 | 1.32 |
139
+ | **ProSparse-7B**\* | 88.11 | **93.46** | 75.24 | **16.30** | **4.44** | 46.66 | 1.94 | 55.56 | 1.49 |
140
+ | **ProSparse-7B** | **89.32** | 92.34 | **78.75** | - | - | **45.38** | **2.00** | **55.05** | **1.51** |
141
+ | Dense-13B | - | - | - | 1.92 | 1.00 | 131.36 | 1.00 | 113.68 | 1.00 |
142
+ | ReluLLaMA-13B | 71.56 | 86.41 | 71.93 | 6.59 | 3.43 | 69.92 | 1.88 | 75.47 | 1.51 |
143
+ | **ProSparse-13B**\* | 87.97 | 91.02 | 77.93 | **8.67** | **4.52** | 55.29 | 2.38 | 67.50 | 1.68 |
144
+ | **ProSparse-13B** | **88.80** | **91.11** | **78.28** | - | - | **53.78** | **2.44** | **66.73** | **1.70** |
145
+
146
+ **Notes**: For "Dense" settings, the "Inference Speed" (token/sec) is obtained by [llama.cpp](https://github.com/ggerganov/llama.cpp), and the time (us) for steps (2) and (3) is measured without sparse GPU operators. For other sparse settings, the "Inference Speed" is obtained by [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), and sparse GPU operators are applied. ProSparse settings with activation threshold shifting and the MiniCPM architecture are not supported by PowerInfer at present.
147
 
148
  ### License Disclaimer
149