Skip to content

cuda : prevent integer truncation and overflow errors when using KQ mask strides in flash_attn_mask_to_KV_max kernel#24945

Open
fairydreaming wants to merge 1 commit into
ggml-org:masterfrom
fairydreaming:stride-narrow-conv-fix
Open

cuda : prevent integer truncation and overflow errors when using KQ mask strides in flash_attn_mask_to_KV_max kernel#24945
fairydreaming wants to merge 1 commit into
ggml-org:masterfrom
fairydreaming:stride-narrow-conv-fix

Conversation

@fairydreaming

@fairydreaming fairydreaming commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Overview

This PR prevents integer truncation and overflow errors in flash_attn_mask_to_KV_max kernel by changing type of s31 and s33 from int to size_t.

Fixes #24912

Additional information

When large KQ masks are used (for example in models with long context lengths like 1M tokens with large ubatch size used to speed up prompt processing) mask->nb[3] / sizeof(half2) can exceed int type value range resulting in s33 being interpreted as negative or smaller than expected due to value truncation. There's also another problem with jt*ncols1*s31 multiplication resulting in integer overflow when both jt*ncols1 and s31 are large, so s31 type was also changed from int to size_t.

Requirements

…ask strides in flash_attn_mask_to_KV_max kernel
@fairydreaming fairydreaming requested a review from a team as a code owner June 23, 2026 12:48
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning CUDA Related to the CUDA backend labels Jun 23, 2026

@am17an am17an left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ORippler

Copy link
Copy Markdown
Collaborator

Can we get a test-case for this?

@fairydreaming

Copy link
Copy Markdown
Collaborator Author

Can we get a test-case for this?

@ORippler If you mean test-backend-ops test case that fails with master and runs correctly with this PR it's somewhat problematic at this moment, since there are several areas there that make tests with huge tensors very slow to run (in some parts tensor value initialization is still single-threaded, processing result tensors (float conversion, error calculation) is also single-threaded and the reference CPU backend runs with the default number of threads that is 4. Moreover, NMSE value calculated in FLASH_ATTN_EXT test scales proportionally to KV cache length. Currently I have some local modifications that brought the "FLASH_ATTN_EXT(hsk=192,hsv=128,nh=4,nr23=[16,1],kv=1048576,nb=8192,mask=1,sinks=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_K=f16,type_V=f16,permute=[0,2,1,3])" test time down to ~2100s on my Epyc 9374F workstation, but that needs cleanup and likely another PR.

Anyway, without this PR:

$ time ./bin/test-backend-ops -o "FLASH_ATTN_EXT(hsk=192,hsv=128,nh=4,nr23=[16,1],kv=1048576,nb=8192,mask=1,sinks=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_K=f16,type_V=f16,permute=[0,2,1,3])"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

initialization of tensors took 37.957919 seconds
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /home/phm/projects/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:3249
  cudaStreamSynchronize(cuda_ctx->stream())
/home/phm/projects/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:104: CUDA error
[New LWP 9516]
[New LWP 9515]
[New LWP 9511]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x0000744ecad10813 in __GI___wait4 (pid=10797, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x0000744ecad10813 in __GI___wait4 (pid=10797, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000744ecb2bf663 in ggml_print_backtrace () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#2  0x0000744ecb2bf80b in ggml_abort () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#3  0x0000744ec7f8937f in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#4  0x0000744ec7f95778 in ggml_backend_cuda_synchronize(ggml_backend*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#5  0x0000744ecb2d75fc in ggml_backend_graph_compute () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#6  0x0000744ecb2dc5d1 in ggml_backend_compare_graph_backend () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#7  0x0000571bb2c6c3a2 in test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*) ()
#8  0x0000571bb2c0d0a7 in test_backend(ggml_backend*, ggml_backend_device*, test_mode, char const*, char const*, printer*, char const*, int)::{lambda(ggml_backend*, ggml_backend*)#1}::operator()(ggml_backend*, ggml_backend*) const ()
#9  0x0000571bb2c2ec17 in test_backend(ggml_backend*, ggml_backend_device*, test_mode, char const*, char const*, printer*, char const*, int) ()
#10 0x0000571bb2bf46d1 in main ()
[Inferior 1 (process 9510) detached]
Aborted (core dumped)

real	0m51.128s
user	1m35.250s
sys	0m41.820s

With this PR:

$ time ./bin/test-backend-ops -o "FLASH_ATTN_EXT(hsk=192,hsv=128,nh=4,nr23=[16,1],kv=1048576,nb=8192,mask=1,sinks=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_K=f16,type_V=f16,permute=[0,2,1,3])"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

initialization of tensors took 38.256392 seconds
[FLASH_ATTN_EXT] ERR = 0.024052996 > 0.000500000 ggml_backend_compare_graph_backend() took 2070.979775 seconds
  FLASH_ATTN_EXT(hsk=192,hsv=128,nh=4,nr23=[16,1],kv=1048576,nb=8192,mask=1,sinks=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_K=f16,type_V=f16,permute=[0,2,1,3]): FAIL
  0/1 tests passed

Failing tests:
  FLASH_ATTN_EXT(hsk=192,hsv=128,nh=4,nr23=[16,1],kv=1048576,nb=8192,mask=1,sinks=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_K=f16,type_V=f16,permute=[0,2,1,3])
  Backend CUDA0: FAIL
Backend 2/2: CPU
  Skipping CPU backend
1/2 backends passed
FAIL

real	35m9.766s
user	2172m41.960s
sys	0m43.385s

So the test still fails because of ERR being too high, but there are no crashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CUDA Related to the CUDA backend ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: CUDA flash attention helper flash_attn_mask_to_KV_max() kernel does not work correctly for large KQ masks

4 participants