Skip to content

ggml : process data in smaller chunks in CUDA ggml_top_k() implementation to reduce temporary buffers memory usage#24776

Open
fairydreaming wants to merge 4 commits into
ggml-org:masterfrom
fairydreaming:chunked-top-k
Open

ggml : process data in smaller chunks in CUDA ggml_top_k() implementation to reduce temporary buffers memory usage#24776
fairydreaming wants to merge 4 commits into
ggml-org:masterfrom
fairydreaming:chunked-top-k

Conversation

@fairydreaming

Copy link
Copy Markdown
Collaborator

Overview

This PR reduces temporary buffers memory usage in CUDA backend ggml_top_k() CUB implementation by processing input data in smaller chunks. Without this PR temporary buffers memory usage is 3 * input buffer size, allocated here:

ggml_cuda_pool_alloc<int> temp_dst_alloc(pool, ncols * nrows);

and here:

ggml_cuda_pool_alloc<int> temp_indices_alloc(pool, ncols * nrows);
ggml_cuda_pool_alloc<float> temp_keys_alloc(pool, ncols * nrows);

With this PR memory usage for temporary buffers is only 3*min(input buffer size, 64MiB).

It also partially mitigates the problem of integer overflow in ncols * nrows product by lowering the amount of rows processed at once.

Fixes #24718

Additional information

For example when running this test (not present originally, I added it):

./bin/test-backend-ops -o "TOP_K(type=f32,ne=[65536,8192,1,1],k=2048,ties=0)"

without this PR memory usage in nvidia-smi goes up to 12968MiB, while with this PR it goes up only to 3048MiB.

Let's also compare the performance. Without this PR:

$ ./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[65536,8192,1,1],k=2048,ties=0)"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  TOP_K(type=f32,ne=[65536,8192,1,1],k=2048,ties=0):                      32 runs - 42707.91 us/run -  2162688 kB/run -   48.29 GB/s
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

with this PR:

$ ./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[65536,8192,1,1],k=2048,ties=0)"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  TOP_K(type=f32,ne=[65536,8192,1,1],k=2048,ties=0):                      32 runs - 42043.22 us/run -  2162688 kB/run -   49.06 GB/s
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

I ran test-backend-ops TOP-K tests and they all passed. Test failing in #24718 also passed:

  ...
  TOP_K(type=f32,ne=[176384,8192,1,1],k=2048,ties=0): OK
  TOP_K(type=f32,ne=[262144,8192,1,1],k=2048,ties=0): OK
  TOP_K(type=f32,ne=[843776,8192,1,1],k=2048,ties=0): OK

Requirements

…tion to reduce temporary buffers memory usage
@fairydreaming fairydreaming requested a review from a team as a code owner June 18, 2026 18:42
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning CUDA Related to the CUDA backend labels Jun 18, 2026
Comment thread ggml/src/ggml-cuda/top-k.cu Outdated
Comment thread ggml/src/ggml-cuda/top-k.cu Outdated
for (int64_t i = 0; i < nrows; i+= nrows_per_chunk) {
int64_t chunk_nrows = std::min(nrows_per_chunk, nrows - i);

ggml_cuda_pool_alloc<int> temp_dst_alloc(pool, ncols * chunk_nrows);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the cuda pool works exactly - just wondering if we actually need to have this allocation inside the loop and not one time before it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the cuda pool works exactly - just wondering if we actually need to have this allocation inside the loop and not one time before it?

@ggerganov Good point, will try it out.

sszymczy and others added 2 commits June 19, 2026 10:54
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@fairydreaming

Copy link
Copy Markdown
Collaborator Author

@ORippler Do you envision any problems with organizing top-k processing the way I did it in this PR (added loop processing smaller input chunks, temporary buffer allocated outside the loop)?

@ORippler

ORippler commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

@ORippler Do you envision any problems with organizing top-k processing the way I did it in this PR (added loop processing smaller input chunks, temporary buffer allocated outside the loop)?

No general problems with doing it this way. Some remarks:

  1. You are currently modifying the sort+view-based fallback path. Ideally, this looping should be used by ggml_cuda_op_argsort also so top_k and argsort have similar memory footprints in the CUDA backend
  2. What CCCL/CTK versions are you on? I'd expect cub::DeviceTopK and cub::DeviceSegmentedTopK to offer a significantly lower memory footprint on NVGPUs (cub::DeviceSegmentedTopK should be available in CTK13.3/CCCL 3.3, but needs to be integrated into llama.cpp still): https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) EDIT: this is not yet exposed, see Exposes DeviceBatchedTopK::{Min,Max}{Keys,Pairs} for non-deterministic, unordered, and small segments-only NVIDIA/cccl#9331

@fairydreaming

Copy link
Copy Markdown
Collaborator Author

@ORippler Do you envision any problems with organizing top-k processing the way I did it in this PR (added loop processing smaller input chunks, temporary buffer allocated outside the loop)?

No general problems with doing it this way. Some remarks:

1. You are currently modifying the sort+view-based fallback path. Ideally, this looping should be used by `ggml_cuda_op_argsort` also so top_k and argsort have similar memory footprints in the CUDA backend

@ORippler OK, will put the number of chunk rows calculation into a common function and apply this to argsort as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CUDA Related to the CUDA backend ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: ggml_top_k() CUDA CUB implementation runs out of memory for large tensors

4 participants