The performance of sageattention2.2 is worse than sageattention2.1.

by triplemu - opened Jul 1

Discussion

triplemu

Jul 1

sageattention2.1.1: 534.252
sageattention2.2.0: 558.459

jt-zhang

Owner Jul 7

Could you please provide the code so we can reproduce this?

triplemu

Jul 7

This comment has been hidden (marked as Off-Topic)

triplemu

Jul 7

•

edited Jul 7

Could you please provide the code so we can reproduce this?

# torch
import torch

# sageattention
from sageattention import sageattn_qk_int8_pv_fp8_cuda


def sageattn_call(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor) -> torch.Tensor:
    out = sageattn_qk_int8_pv_fp8_cuda(
        query,
        key,
        value,
        tensor_layout='NHD',
        qk_quant_gran='per_warp',
        # pv_accum_dtype='fp32+fp32', # switch 2    commit_id: 1439bd9b56e5b9d1e64e583c261a381b419a5ab7
        pv_accum_dtype='fp32+fp16', # switch 2++  commit_id: 3ea20a257c55bfa513695aa97bca3f3b8060424f
    )
    return out




@torch
	.inference_mode()
def main():
    torch.manual_seed(10086)
    torch.cuda.manual_seed(10086)
    torch.cuda.manual_seed_all(10086)

    device = torch.device('cuda:0')
    dtype = torch.bfloat16
    shape = (1, 75600, 40, 128)  # wan2.1 720p 81frames
    stream = torch.cuda.current_stream(device)

    query = torch.rand(shape, dtype=dtype, device=device)
    key = torch.rand(shape, dtype=dtype, device=device)
    value = torch.rand(shape, dtype=dtype, device=device)

    # warm up
    for i in range(10):
        sageattn_call(query, key, value)
    stream.synchronize()

    # capture nsys
    sageattn_call(query, key, value)


if __name__ == '__main__':
    # nsys profile python sage_test.py
    main()

# pip install sageattention2_plus
# change "pv_accum_dtype='fp32+fp16'"
nsys profile python sage_test.py

# pip install sageattention2
# change "pv_accum_dtype='fp32+fp32'"
nsys profile python sage_test.py

Then I got 2 nsys profile results:

ENV:
L20 GPU
cuda=12.8
pytorch=2.7.0

xiaomingxu1995

Jul 8

To the best of my knowledge, Ends - Begins represents the duration, which reflects the wall-clock time from when a kernel is submitted to when it completes. This duration includes not only the actual execution time but also potential dispatch delays and stream stalls.

I recommend trying the benchmark examples in bench/bench_qk_int8_pv_fp8_cuda.py to measure the end-to-end time of the two kernels.

triplemu

Jul 8

•

edited Jul 8

To the best of my knowledge, Ends - Begins represents the duration, which reflects the wall-clock time from when a kernel is submitted to when it completes. This duration includes not only the actual execution time but also potential dispatch delays and stream stalls.

I recommend trying the benchmark examples in bench/bench_qk_int8_pv_fp8_cuda.py to measure the end-to-end time of the two kernels.

I agree with your statement, but after 10 warm-ups, since only the computing kernel of this core has changed between different versions, I think nsys can report the performance correctly, and the two comparison conditions are balanced.

triplemu changed discussion status to closed Jul 8

triplemu changed discussion status to open Jul 8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment