Shangming Cai
commited on
Commit
•
575c4e9
1
Parent(s):
08c8530
Update README of branch dev_triton.
Browse files- README.md +27 -0
- triton_kernels.py +10 -0
README.md
CHANGED
@@ -67,6 +67,14 @@ cd flash-attention && pip install .
|
|
67 |
# pip install csrc/layer_norm
|
68 |
# pip install csrc/rotary
|
69 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
<br>
|
71 |
|
72 |
|
@@ -140,6 +148,25 @@ In detail, the setting of profiling is generating 8192 new tokens with 1 context
|
|
140 |
|
141 |
Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
|
142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
143 |
### 显存使用 (GPU Memory Usage)
|
144 |
|
145 |
我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。(显存消耗在是否使用FlashAttn的情况下均类似。)结果如下所示:
|
|
|
67 |
# pip install csrc/layer_norm
|
68 |
# pip install csrc/rotary
|
69 |
```
|
70 |
+
|
71 |
+
如果您有更高推理性能方面的需求,但上述可选加速项`layer_norm`及`rotary`未能安装成功,或是您所使用的GPU不满足`flash-attention`库所要求的NVIDIA Ampere/Ada/Hopper架构,您可以尝试使用该分支下基于Triton进行实现的推理加速方案。该方案适用于更宽范围的GPU产品,且无需安装。您可以通过将config.json里的`use_triton`设置为true来进行启用。
|
72 |
+
|
73 |
+
**(在dev_triton分支下`use_triton`默认设置为auto,由于pytorch 2.0及以上版本已默认安装了Triton,因此上述优化方案无需其它安装与配置操作即可直接启用。如果您不想开启该优化,请将config.json里的`use_triton`设置为false)**
|
74 |
+
|
75 |
+
If you require higher inference performance yet encounter some problems when installing the optional acceleration features (i.e., `layer_norm` and `rotary`) or if the GPU you are using does not meet the NVIDIA Ampere/Ada/Hopper architecture required by the `flash-attention` library, you may consider trying the inference acceleration solution implemented with Triton in this branch. This solution adapts to a wider range of GPU products and does not require installation. You can enable this acceleration feature by setting the `use_triton` option to true in the config.json file.
|
76 |
+
|
77 |
+
**(In the dev_triton branch, `use_triton` is set to 'auto' by default. As Triton is pre-installed with pytorch version 2.0 and above, this acceleration solution can be enabled directly without additional installation or configuration. If you prefer not to activate this optimization, please set `use_triton` to false in the config.json file.)**
|
78 |
<br>
|
79 |
|
80 |
|
|
|
148 |
|
149 |
Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
|
150 |
|
151 |
+
另外,我们也测算了在使用不同GPU及推理加速方法时Qwen-7B-Chat-Int4模型生成2048和8192个token的平均推理速度。所有评测均使用PyTorch 2.1.0和CUDA 11.8。
|
152 |
+
|
153 |
+
In addition, we also measured the average inference speed of generating 2048 and 8192 tokens with different GPU devices and acceleration methods, respectively. All results run with PyTorch 2.1.0 and CUDA 11.8.
|
154 |
+
|
155 |
+
| GPU Device | Method | Speed (2048 tokens) | Speed (8192 tokens) |
|
156 |
+
| :--------: | :----------: | :------------------:| :------------------:|
|
157 |
+
| A10 | FlashAttn v2 | 41.28 | 30.78 |
|
158 |
+
| A10 | Triton | 49.04 | 29.17 |
|
159 |
+
| A10 | Disabled | 39.26 | 26.81 |
|
160 |
+
| V100 | FlashAttn v2 | N/A | N/A |
|
161 |
+
| V100 | Triton | 37.01 | 27.66 |
|
162 |
+
| V100 | Disabled | 24.47 | 20.40 |
|
163 |
+
| P100 | FlashAttn v2 | N/A | N/A |
|
164 |
+
| P100 | Triton | 29.03 | 13.85 |
|
165 |
+
| P100 | Disabled | 20.50 | 12.73 |
|
166 |
+
| T4 | FlashAttn v2 | N/A | N/A |
|
167 |
+
| T4 | Triton | 27.98 | 15.22 |
|
168 |
+
| T4 | Disabled | 23.11 | 14.55 |
|
169 |
+
|
170 |
### 显存使用 (GPU Memory Usage)
|
171 |
|
172 |
我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。(显存消耗在是否使用FlashAttn的情况下均类似。)结果如下所示:
|
triton_kernels.py
CHANGED
@@ -1,3 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
from typing import Any, Callable, Dict, Hashable, Tuple
|
2 |
|
3 |
import torch
|
|
|
1 |
+
# Copyright (c) Alibaba Cloud.
|
2 |
+
#
|
3 |
+
# This source code is licensed under the license found in the
|
4 |
+
# LICENSE file in the root directory of this source tree.
|
5 |
+
|
6 |
+
# This module provides ApplyRoPE and RMSNorm kernels written in OpenAI Triton.
|
7 |
+
# Feel free to contact the contributors if you have any questions or issues regarding this code.
|
8 |
+
# Contributors: Shangming Cai, Zihan Wang
|
9 |
+
# Contacts: [email protected], [email protected]
|
10 |
+
|
11 |
from typing import Any, Callable, Dict, Hashable, Tuple
|
12 |
|
13 |
import torch
|