conv: avoid O(groups) kernel launches for depthwise conv1d/conv2d#3531
Open
RyanJamesStewart wants to merge 1 commit into
Open
conv: avoid O(groups) kernel launches for depthwise conv1d/conv2d#3531RyanJamesStewart wants to merge 1 commit into
RyanJamesStewart wants to merge 1 commit into
Conversation
Tensor::conv1d / conv2d with `groups > 1` decompose into `groups` separate single-group convolutions plus a `cat`. For depthwise convolution (`groups == in_channels`) this is O(groups) tiny kernel launches: with `groups = 2048, k = 3` that is ~6000 CUDA launches whose host/driver overhead dominates the trivial per-channel work. Issue huggingface#3389 reports depthwise conv layers at 54% of total inference time on an RTX 5090, ~33 ms/layer vs <0.1 ms for a hand-written expansion. For the common depthwise case (`groups == c_in`, `c_in_k == 1`, no channel multiplier, unit stride/dilation) compute the convolution directly as a sum of `k` slices each scaled by a per-channel scalar: `out[b,c,..] = sum_k input[b,c,..+k] * weight[c,k]`. That is a fixed number of elementwise kernels (~2*k) regardless of channel count, numerically equivalent to the existing backends (up to float summation order, just as the CPU/CUDA paths already differ from each other). Pure Tensor-level code, so it benefits every backend (CPU/CUDA/Metal) and stays differentiable. All other `groups > 1` cases keep the existing per-group path. Adds conv1d_depthwise / conv2d_depthwise tests that cross-check the fast path against an independent code path (the equivalent block-diagonal dense weight run through `groups == 1`).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the depthwise part of #3389.
Problem
Tensor::conv1d/conv2dwithgroups > 1decompose intogroupsseparate single-group convolutions plus acat. For depthwise convolution (groups == in_channels) that's O(groups) tiny kernel launches —groups = 2048, k = 3→ thousands of CUDA launches whose host/driver overhead dwarfs the trivial per-channel work. #3389 reports depthwise conv layers at ~54% of total inference time on an RTX 5090 (~33 ms/layer) vs<0.1 msfor a hand-written expansion.Change
For the common depthwise case (
groups == c_in,c_in_k == 1— no channel multiplier, unit stride/dilation) compute the convolution directly as a sum ofkslices each scaled by a per-channel scalar:That's a fixed number of elementwise kernels (~2k) regardless of channel count, numerically equivalent to the existing backends (up to float summation order — the CPU/CUDA paths already differ from each other to that degree). It's pure
Tensor-level code, so it benefits every backend (CPU / CUDA / Metal) and stays differentiable. All othergroups > 1cases keep the existing per-group path unchanged.The
stride == dilation == 1guard is only because there's no strided-narrow primitive yet; that regime is the dominant one (MobileNet/EfficientNet 3×3, LFM2 / Mamba conv) and covers the case in the issue. The general grouped-conv case (c_in_k > 1, ResNeXt-style) needs a separate fix — passinggroupsthrough tocudnnSetConvolutionGroupCount, a batched-GEMM rewrite, or a fused grouped kernel — discussed in #3389.Tests / validation
Adds
conv1d_depthwise/conv2d_depthwisetests that cross-check the fast path against an independent code path (the equivalent block-diagonal dense weight run throughgroups == 1).cargo test -p candle-core --test conv_tests10/10,cargo test -p candle-nngreen,cargo clippy -p candle-core --tests0 warnings,cargo fmt --checkclean.cargo test -p candle-core --features cuda --test conv_tests— 20/20 pass, incl. the newconv1d_depthwise_gpu/conv2d_depthwise_gpu. Microbench, depthwise conv1d(1, 2048, 21) × (2048, 1, 3)groups=2048, 100 iters after warmup: 31971 µs/call → 23 µs/call (≈1390×);cuLaunchKernelcount per call (viansys): ~4096 → ~5.--features cuda,cudnnhas a pre-existing exact-equality assertion failure in the non-depthwiseconv1d_gputest (cuDNN picks a different algorithm with ~1e-3 different rounding) — unrelated to this change; all depthwise tests pass under cuDNN too.+154/−0, 2 files.