The performance of sageattention2.2 is worse than sageattention2.1.
Could you please provide the code so we can reproduce this?
Could you please provide the code so we can reproduce this?
# torch
import torch
# sageattention
from sageattention import sageattn_qk_int8_pv_fp8_cuda
def sageattn_call(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor) -> torch.Tensor:
out = sageattn_qk_int8_pv_fp8_cuda(
query,
key,
value,
tensor_layout='NHD',
qk_quant_gran='per_warp',
# pv_accum_dtype='fp32+fp32', # switch 2 commit_id: 1439bd9b56e5b9d1e64e583c261a381b419a5ab7
pv_accum_dtype='fp32+fp16', # switch 2++ commit_id: 3ea20a257c55bfa513695aa97bca3f3b8060424f
)
return out
@torch
.inference_mode()
def main():
torch.manual_seed(10086)
torch.cuda.manual_seed(10086)
torch.cuda.manual_seed_all(10086)
device = torch.device('cuda:0')
dtype = torch.bfloat16
shape = (1, 75600, 40, 128) # wan2.1 720p 81frames
stream = torch.cuda.current_stream(device)
query = torch.rand(shape, dtype=dtype, device=device)
key = torch.rand(shape, dtype=dtype, device=device)
value = torch.rand(shape, dtype=dtype, device=device)
# warm up
for i in range(10):
sageattn_call(query, key, value)
stream.synchronize()
# capture nsys
sageattn_call(query, key, value)
if __name__ == '__main__':
# nsys profile python sage_test.py
main()
# pip install sageattention2_plus
# change "pv_accum_dtype='fp32+fp16'"
nsys profile python sage_test.py
# pip install sageattention2
# change "pv_accum_dtype='fp32+fp32'"
nsys profile python sage_test.py
Then I got 2 nsys profile results:
ENV:
L20 GPU
cuda=12.8
pytorch=2.7.0
To the best of my knowledge, Ends - Begins represents the duration, which reflects the wall-clock time from when a kernel is submitted to when it completes. This duration includes not only the actual execution time but also potential dispatch delays and stream stalls.
I recommend trying the benchmark examples in bench/bench_qk_int8_pv_fp8_cuda.py to measure the end-to-end time of the two kernels.
To the best of my knowledge,
Ends-Beginsrepresents the duration, which reflects the wall-clock time from when a kernel is submitted to when it completes. This duration includes not only the actual execution time but also potential dispatch delays and stream stalls.I recommend trying the benchmark examples in
bench/bench_qk_int8_pv_fp8_cuda.pyto measure the end-to-end time of the two kernels.
I agree with your statement, but after 10 warm-ups, since only the computing kernel of this core has changed between different versions, I think nsys can report the performance correctly, and the two comparison conditions are balanced.



