Skip to content

cuda: sanitize invalid Blackwell sharedMemPerBlockOptin#24991

Open
wgu9 wants to merge 1 commit into
ggml-org:masterfrom
wgu9:fix-cuda-blackwell-smpbo-sanitize
Open

cuda: sanitize invalid Blackwell sharedMemPerBlockOptin#24991
wgu9 wants to merge 1 commit into
ggml-org:masterfrom
wgu9:fix-cuda-blackwell-smpbo-sanitize

Conversation

@wgu9

@wgu9 wgu9 commented Jun 25, 2026

Copy link
Copy Markdown

Some Blackwell CUDA driver/device combinations can report an invalid sharedMemPerBlockOptin value. Sanitize that value during CUDA device initialization and fall back to sharedMemPerBlock when the opt-in value is zero or larger than sharedMemPerMultiprocessor.

Validation:

  • RTX 5090
  • SM120 CUDA build passed
  • test-backend-ops CUDA0 MUL_MAT passed 1134/1134

@wgu9 wgu9 requested a review from a team as a code owner June 25, 2026 01:53
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning CUDA Related to the CUDA backend labels Jun 25, 2026
@am17an

am17an commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@ggml-org/nvidia there have been multiple PRs which attempt to "fix" this issue. I'm now wondering if this is a real issue

@wgu9

wgu9 commented Jun 26, 2026

Copy link
Copy Markdown
Author

Thanks for calling that out. I agree this should not merge unless the device-property issue is real and this PR is not just another speculative Blackwell workaround.

What I verified before opening this:

  • This came from an RTX 5090 / SM120 CUDA build where sharedMemPerBlockOptin was reported outside the usable per-SM limit during ggml CUDA init.
  • Falling back to sharedMemPerBlock let CUDA initialization continue with a conservative value instead of propagating an invalid opt-in shared-memory size into later launch/resource decisions.
  • After the guard, my local CUDA validation passed: SM120 CUDA build and test-backend-ops CUDA0 MUL_MAT passed 1134/1134.
  • I also searched current open and closed PRs/issues for the same sharedMemPerBlockOptin / sharedMemPerMultiprocessor guard and did not find a direct duplicate.

If the NVIDIA maintainers think the driver/device report should be treated as impossible or fixed lower in the stack, I am fine closing this. The intent here is only to add a narrow defensive guard around an invalid device property, not to mask unrelated Blackwell issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CUDA Related to the CUDA backend ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants