ggml : process data in smaller chunks in CUDA ggml_top_k() implementation to reduce temporary buffers memory usage by fairydreaming · Pull Request #24776 · ggml-org/llama.cpp

fairydreaming · 2026-06-18T18:42:51Z

Overview

This PR reduces temporary buffers memory usage in CUDA backend ggml_top_k() CUB implementation by processing input data in smaller chunks. Without this PR temporary buffers memory usage is 3 * input buffer size, allocated here:

llama.cpp/ggml/src/ggml-cuda/top-k.cu

Line 79 in d5376cf

ggml_cuda_pool_alloc<int> temp_dst_alloc(pool, ncols * nrows);

and here:

llama.cpp/ggml/src/ggml-cuda/argsort.cu

Lines 38 to 39 in d5376cf

    
           ggml_cuda_pool_alloc<int>   temp_indices_alloc(pool, ncols * nrows); 
        
           ggml_cuda_pool_alloc<float> temp_keys_alloc(pool, ncols * nrows);

With this PR memory usage for temporary buffers is only 3*min(input buffer size, 64MiB).

It also partially mitigates the problem of integer overflow in ncols * nrows product by lowering the amount of rows processed at once.

Fixes #24718

Additional information

For example when running this test (not present originally, I added it):

./bin/test-backend-ops -o "TOP_K(type=f32,ne=[65536,8192,1,1],k=2048,ties=0)"

without this PR memory usage in nvidia-smi goes up to 12968MiB, while with this PR it goes up only to 3048MiB.

Let's also compare the performance. Without this PR:

$ ./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[65536,8192,1,1],k=2048,ties=0)"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  TOP_K(type=f32,ne=[65536,8192,1,1],k=2048,ties=0):                      32 runs - 42707.91 us/run -  2162688 kB/run -   48.29 GB/s
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

with this PR:

$ ./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[65536,8192,1,1],k=2048,ties=0)"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  TOP_K(type=f32,ne=[65536,8192,1,1],k=2048,ties=0):                      32 runs - 42043.22 us/run -  2162688 kB/run -   49.06 GB/s
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

I ran test-backend-ops TOP-K tests and they all passed. Test failing in #24718 also passed:

  ...
  TOP_K(type=f32,ne=[176384,8192,1,1],k=2048,ties=0): OK
  TOP_K(type=f32,ne=[262144,8192,1,1],k=2048,ties=0): OK
  TOP_K(type=f32,ne=[843776,8192,1,1],k=2048,ties=0): OK

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

…tion to reduce temporary buffers memory usage

ggerganov · 2026-06-19T05:39:43Z

+    for (int64_t i = 0; i < nrows; i+= nrows_per_chunk) {
+        int64_t chunk_nrows = std::min(nrows_per_chunk, nrows - i);
+
+        ggml_cuda_pool_alloc<int> temp_dst_alloc(pool, ncols * chunk_nrows);


I'm not sure how the cuda pool works exactly - just wondering if we actually need to have this allocation inside the loop and not one time before it?

I'm not sure how the cuda pool works exactly - just wondering if we actually need to have this allocation inside the loop and not one time before it?

@ggerganov Good point, will try it out.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

fairydreaming · 2026-06-19T09:27:41Z

@ORippler Do you envision any problems with organizing top-k processing the way I did it in this PR (added loop processing smaller input chunks, temporary buffer allocated outside the loop)?

ORippler · 2026-06-24T13:50:09Z

@ORippler Do you envision any problems with organizing top-k processing the way I did it in this PR (added loop processing smaller input chunks, temporary buffer allocated outside the loop)?

No general problems with doing it this way. Some remarks:

You are currently modifying the sort+view-based fallback path. Ideally, this looping should be used by ggml_cuda_op_argsort also so top_k and argsort have similar memory footprints in the CUDA backend
What CCCL/CTK versions are you on? I'd expect cub::DeviceTopK and cub::DeviceSegmentedTopK to offer a significantly lower memory footprint on NVGPUs ~~(cub::DeviceSegmentedTopK should be available in CTK13.3/CCCL 3.3, but needs to be integrated into llama.cpp still): https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)~~ EDIT: this is not yet exposed, see Exposes DeviceBatchedTopK::{Min,Max}{Keys,Pairs} for non-deterministic, unordered, and small segments-only NVIDIA/cccl#9331

fairydreaming · 2026-06-24T19:28:41Z

@ORippler Do you envision any problems with organizing top-k processing the way I did it in this PR (added loop processing smaller input chunks, temporary buffer allocated outside the loop)?

No general problems with doing it this way. Some remarks:
1. You are currently modifying the sort+view-based fallback path. Ideally, this looping should be used by `ggml_cuda_op_argsort` also so top_k and argsort have similar memory footprints in the CUDA backend

@ORippler OK, will put the number of chunk rows calculation into a common function and apply this to argsort as well.

…ementations

ggml : process data in smaller chunks in CUDA ggml_top_k() implementa…

c1a79a6

…tion to reduce temporary buffers memory usage

fairydreaming requested a review from a team as a code owner June 18, 2026 18:42

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning CUDA Related to the CUDA backend labels Jun 18, 2026

ggerganov reviewed Jun 19, 2026

View reviewed changes

sszymczy and others added 2 commits June 19, 2026 10:54

ggml : allocate tmp_dst only only once before the loop

ca88122

chore : whitespaces

4d0e838

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggml : use chunked processing in both CUDA CUB top-k and argsort impl…

8521ec4

…ementations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : process data in smaller chunks in CUDA ggml_top_k() implementation to reduce temporary buffers memory usage#24776

ggml : process data in smaller chunks in CUDA ggml_top_k() implementation to reduce temporary buffers memory usage#24776
fairydreaming wants to merge 4 commits into
ggml-org:masterfrom
fairydreaming:chunked-top-k

fairydreaming commented Jun 18, 2026

Uh oh!

Uh oh!

ggerganov Jun 19, 2026

Uh oh!

fairydreaming Jun 19, 2026

Uh oh!

fairydreaming commented Jun 19, 2026

Uh oh!

ORippler commented Jun 24, 2026 •

edited

Loading

Uh oh!

fairydreaming commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	ggml_cuda_pool_alloc<int> temp_indices_alloc(pool, ncols * nrows);
	ggml_cuda_pool_alloc<float> temp_keys_alloc(pool, ncols * nrows);

Uh oh!

Conversation

fairydreaming commented Jun 18, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

ggerganov Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

fairydreaming Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

fairydreaming commented Jun 19, 2026

Uh oh!

ORippler commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fairydreaming commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ORippler commented Jun 24, 2026 •

edited

Loading