Qwen
/

Qwen-7B-Chat-Int4

@@ -67,6 +67,14 @@ cd flash-attention && pip install .
 # pip install csrc/layer_norm
 # pip install csrc/rotary
 ```
 <br>
@@ -140,6 +148,25 @@ In detail, the setting of profiling is generating 8192 new tokens with 1 context
 Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
 ### 显存使用 (GPU Memory Usage)
 我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。（显存消耗在是否使用FlashAttn的情况下均类似。）结果如下所示：

 # pip install csrc/layer_norm
 # pip install csrc/rotary
 ```
+如果您有更高推理性能方面的需求，但上述可选加速项`layer_norm`及`rotary`未能安装成功，或是您所使用的GPU不满足`flash-attention`库所要求的NVIDIA Ampere/Ada/Hopper架构，您可以尝试使用该分支下基于Triton进行实现的推理加速方案。该方案适用于更宽范围的GPU产品，且无需安装。您可以通过将config.json里的`use_triton`设置为true来进行启用。
+**(在dev_triton分支下`use_triton`默认设置为auto，由于pytorch 2.0及以上版本已默认安装了Triton，因此上述优化方案无需其它安装与配置操作即可直接启用。如果您不想开启该优化，请将config.json里的`use_triton`设置为false)**
+If you require higher inference performance yet encounter some problems when installing the optional acceleration features (i.e., `layer_norm` and `rotary`) or if the GPU you are using does not meet the NVIDIA Ampere/Ada/Hopper architecture required by the `flash-attention` library, you may consider trying the inference acceleration solution implemented with Triton in this branch. This solution adapts to a wider range of GPU products and does not require installation. You can enable this acceleration feature by setting the `use_triton` option to true in the config.json file.
+**(In the dev_triton branch, `use_triton` is set to 'auto' by default. As Triton is pre-installed with pytorch version 2.0 and above, this acceleration solution can be enabled directly without additional installation or configuration. If you prefer not to activate this optimization, please set `use_triton` to false in the config.json file.)**
 <br>
 Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
+另外，我们也测算了在使用不同GPU及推理加速方法时Qwen-7B-Chat-Int4模型生成2048和8192个token的平均推理速度。所有评测均使用PyTorch 2.1.0和CUDA 11.8。
+In addition, we also measured the average inference speed of generating 2048 and 8192 tokens with different GPU devices and acceleration methods, respectively. All results run with PyTorch 2.1.0 and CUDA 11.8.
+| GPU Device | Method       | Speed (2048 tokens) | Speed (8192 tokens) |
+| :--------: | :----------: | :------------------:| :------------------:|
+|  A10       | FlashAttn v2 | 41.28               | 30.78               |
+|  A10       | Triton       | 49.04               | 29.17               |
+|  A10       | Disabled     | 39.26               | 26.81               |
+|  V100      | FlashAttn v2 | N/A                 | N/A                 |
+|  V100      | Triton       | 37.01               | 27.66               |
+|  V100      | Disabled     | 24.47               | 20.40               |
+|  P100      | FlashAttn v2 | N/A                 | N/A                 |
+|  P100      | Triton       | 29.03               | 13.85               |
+|  P100      | Disabled     | 20.50               | 12.73               |
+|  T4        | FlashAttn v2 | N/A                 | N/A                 |
+|  T4        | Triton       | 27.98               | 15.22               |
+|  T4        | Disabled     | 23.11               | 14.55               |
 ### 显存使用 (GPU Memory Usage)
 我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。（显存消耗在是否使用FlashAttn的情况下均类似。）结果如下所示：

triton_kernels.py CHANGED Viewed

@@ -1,3 +1,13 @@
 from typing import Any, Callable, Dict, Hashable, Tuple
 import torch

+# Copyright (c) Alibaba Cloud.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+# This module provides ApplyRoPE and RMSNorm kernels written in OpenAI Triton.
+# Feel free to contact the contributors if you have any questions or issues regarding this code.
+# Contributors: Shangming Cai, Zihan Wang
+# Contacts: [email protected], [email protected]
 from typing import Any, Callable, Dict, Hashable, Tuple
 import torch