cuda : prevent integer truncation and overflow errors when using KQ mask strides in flash_attn_mask_to_KV_max kernel by fairydreaming · Pull Request #24945 · ggml-org/llama.cpp

fairydreaming · 2026-06-23T12:48:18Z

Overview

This PR prevents integer truncation and overflow errors in flash_attn_mask_to_KV_max kernel by changing type of s31 and s33 from int to size_t.

Fixes #24912

Additional information

When large KQ masks are used (for example in models with long context lengths like 1M tokens with large ubatch size used to speed up prompt processing) mask->nb[3] / sizeof(half2) can exceed int type value range resulting in s33 being interpreted as negative or smaller than expected due to value truncation. There's also another problem with jt*ncols1*s31 multiplication resulting in integer overflow when both jt*ncols1 and s31 are large, so s31 type was also changed from int to size_t.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

…ask strides in flash_attn_mask_to_KV_max kernel

am17an

cc @JohannesGaessler

ORippler · 2026-06-24T13:59:24Z

Can we get a test-case for this?

fairydreaming · 2026-06-25T10:49:54Z

Can we get a test-case for this?

@ORippler If you mean test-backend-ops test case that fails with master and runs correctly with this PR it's somewhat problematic at this moment, since there are several areas there that make tests with huge tensors very slow to run (in some parts tensor value initialization is still single-threaded, processing result tensors (float conversion, error calculation) is also single-threaded and the reference CPU backend runs with the default number of threads that is 4. Moreover, NMSE value calculated in FLASH_ATTN_EXT test scales proportionally to KV cache length. Currently I have some local modifications that brought the "FLASH_ATTN_EXT(hsk=192,hsv=128,nh=4,nr23=[16,1],kv=1048576,nb=8192,mask=1,sinks=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_K=f16,type_V=f16,permute=[0,2,1,3])" test time down to ~2100s on my Epyc 9374F workstation, but that needs cleanup and likely another PR.

Anyway, without this PR:

$ time ./bin/test-backend-ops -o "FLASH_ATTN_EXT(hsk=192,hsv=128,nh=4,nr23=[16,1],kv=1048576,nb=8192,mask=1,sinks=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_K=f16,type_V=f16,permute=[0,2,1,3])"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

initialization of tensors took 37.957919 seconds
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /home/phm/projects/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:3249
  cudaStreamSynchronize(cuda_ctx->stream())
/home/phm/projects/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:104: CUDA error
[New LWP 9516]
[New LWP 9515]
[New LWP 9511]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x0000744ecad10813 in __GI___wait4 (pid=10797, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x0000744ecad10813 in __GI___wait4 (pid=10797, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000744ecb2bf663 in ggml_print_backtrace () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#2  0x0000744ecb2bf80b in ggml_abort () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#3  0x0000744ec7f8937f in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#4  0x0000744ec7f95778 in ggml_backend_cuda_synchronize(ggml_backend*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#5  0x0000744ecb2d75fc in ggml_backend_graph_compute () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#6  0x0000744ecb2dc5d1 in ggml_backend_compare_graph_backend () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#7  0x0000571bb2c6c3a2 in test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*) ()
#8  0x0000571bb2c0d0a7 in test_backend(ggml_backend*, ggml_backend_device*, test_mode, char const*, char const*, printer*, char const*, int)::{lambda(ggml_backend*, ggml_backend*)#1}::operator()(ggml_backend*, ggml_backend*) const ()
#9  0x0000571bb2c2ec17 in test_backend(ggml_backend*, ggml_backend_device*, test_mode, char const*, char const*, printer*, char const*, int) ()
#10 0x0000571bb2bf46d1 in main ()
[Inferior 1 (process 9510) detached]
Aborted (core dumped)

real	0m51.128s
user	1m35.250s
sys	0m41.820s

With this PR:

$ time ./bin/test-backend-ops -o "FLASH_ATTN_EXT(hsk=192,hsv=128,nh=4,nr23=[16,1],kv=1048576,nb=8192,mask=1,sinks=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_K=f16,type_V=f16,permute=[0,2,1,3])"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

initialization of tensors took 38.256392 seconds
[FLASH_ATTN_EXT] ERR = 0.024052996 > 0.000500000 ggml_backend_compare_graph_backend() took 2070.979775 seconds
  FLASH_ATTN_EXT(hsk=192,hsv=128,nh=4,nr23=[16,1],kv=1048576,nb=8192,mask=1,sinks=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_K=f16,type_V=f16,permute=[0,2,1,3]): FAIL
  0/1 tests passed

Failing tests:
  FLASH_ATTN_EXT(hsk=192,hsv=128,nh=4,nr23=[16,1],kv=1048576,nb=8192,mask=1,sinks=0,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_K=f16,type_V=f16,permute=[0,2,1,3])
  Backend CUDA0: FAIL
Backend 2/2: CPU
  Skipping CPU backend
1/2 backends passed
FAIL

real	35m9.766s
user	2172m41.960s
sys	0m43.385s

So the test still fails because of ERR being too high, but there are no crashes.

cuda : prevent integer truncation and overflow errors when using KQ m…

86a263b

…ask strides in flash_attn_mask_to_KV_max kernel

fairydreaming requested a review from a team as a code owner June 23, 2026 12:48

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning CUDA Related to the CUDA backend labels Jun 23, 2026

am17an approved these changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda : prevent integer truncation and overflow errors when using KQ mask strides in flash_attn_mask_to_KV_max kernel#24945

cuda : prevent integer truncation and overflow errors when using KQ mask strides in flash_attn_mask_to_KV_max kernel#24945
fairydreaming wants to merge 1 commit into
ggml-org:masterfrom
fairydreaming:stride-narrow-conv-fix

fairydreaming commented Jun 23, 2026 •

edited

Loading

Uh oh!

am17an left a comment

Uh oh!

ORippler commented Jun 24, 2026

Uh oh!

fairydreaming commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

fairydreaming commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

ORippler commented Jun 24, 2026

Uh oh!

fairydreaming commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fairydreaming commented Jun 23, 2026 •

edited

Loading