The performance of sageattention2.2 is worse than sageattention2.1.

#7
by triplemu - opened

sageattention2.1.1: 534.252
sageattention2.2.0: 558.459
image.png
image.png

Owner

Could you please provide the code so we can reproduce this?

This comment has been hidden (marked as Off-Topic)

Could you please provide the code so we can reproduce this?

# torch
import torch

# sageattention
from sageattention import sageattn_qk_int8_pv_fp8_cuda


def sageattn_call(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor) -> torch.Tensor:
    out = sageattn_qk_int8_pv_fp8_cuda(
        query,
        key,
        value,
        tensor_layout='NHD',
        qk_quant_gran='per_warp',
        # pv_accum_dtype='fp32+fp32', # switch 2    commit_id: 1439bd9b56e5b9d1e64e583c261a381b419a5ab7
        pv_accum_dtype='fp32+fp16', # switch 2++  commit_id: 3ea20a257c55bfa513695aa97bca3f3b8060424f
    )
    return out




@torch
	.inference_mode()
def main():
    torch.manual_seed(10086)
    torch.cuda.manual_seed(10086)
    torch.cuda.manual_seed_all(10086)

    device = torch.device('cuda:0')
    dtype = torch.bfloat16
    shape = (1, 75600, 40, 128)  # wan2.1 720p 81frames
    stream = torch.cuda.current_stream(device)

    query = torch.rand(shape, dtype=dtype, device=device)
    key = torch.rand(shape, dtype=dtype, device=device)
    value = torch.rand(shape, dtype=dtype, device=device)

    # warm up
    for i in range(10):
        sageattn_call(query, key, value)
    stream.synchronize()

    # capture nsys
    sageattn_call(query, key, value)


if __name__ == '__main__':
    # nsys profile python sage_test.py
    main()
# pip install sageattention2_plus
# change "pv_accum_dtype='fp32+fp16'"
nsys profile python sage_test.py

# pip install sageattention2
# change "pv_accum_dtype='fp32+fp32'"
nsys profile python sage_test.py

Then I got 2 nsys profile results:

image.png

image.png

ENV:
L20 GPU
cuda=12.8
pytorch=2.7.0

To the best of my knowledge, Ends - Begins represents the duration, which reflects the wall-clock time from when a kernel is submitted to when it completes. This duration includes not only the actual execution time but also potential dispatch delays and stream stalls.

I recommend trying the benchmark examples in bench/bench_qk_int8_pv_fp8_cuda.py to measure the end-to-end time of the two kernels.

To the best of my knowledge, Ends - Begins represents the duration, which reflects the wall-clock time from when a kernel is submitted to when it completes. This duration includes not only the actual execution time but also potential dispatch delays and stream stalls.

I recommend trying the benchmark examples in bench/bench_qk_int8_pv_fp8_cuda.py to measure the end-to-end time of the two kernels.

I agree with your statement, but after 10 warm-ups, since only the computing kernel of this core has changed between different versions, I think nsys can report the performance correctly, and the two comparison conditions are balanced.

triplemu changed discussion status to closed
triplemu changed discussion status to open

Sign up or log in to comment