Skip to content

conv: avoid O(groups) kernel launches for depthwise conv1d/conv2d#3531

Open
RyanJamesStewart wants to merge 1 commit into
huggingface:mainfrom
RyanJamesStewart:perf/conv-depthwise-no-per-group-launch
Open

conv: avoid O(groups) kernel launches for depthwise conv1d/conv2d#3531
RyanJamesStewart wants to merge 1 commit into
huggingface:mainfrom
RyanJamesStewart:perf/conv-depthwise-no-per-group-launch

Conversation

@RyanJamesStewart
Copy link
Copy Markdown

Fixes the depthwise part of #3389.

Problem

Tensor::conv1d / conv2d with groups > 1 decompose into groups separate single-group convolutions plus a cat. For depthwise convolution (groups == in_channels) that's O(groups) tiny kernel launches — groups = 2048, k = 3 → thousands of CUDA launches whose host/driver overhead dwarfs the trivial per-channel work. #3389 reports depthwise conv layers at ~54% of total inference time on an RTX 5090 (~33 ms/layer) vs <0.1 ms for a hand-written expansion.

Change

For the common depthwise case (groups == c_in, c_in_k == 1 — no channel multiplier, unit stride/dilation) compute the convolution directly as a sum of k slices each scaled by a per-channel scalar:

out[b, c, ..] = Σ_k  input[b, c, ..+k] · weight[c, k]

That's a fixed number of elementwise kernels (~2k) regardless of channel count, numerically equivalent to the existing backends (up to float summation order — the CPU/CUDA paths already differ from each other to that degree). It's pure Tensor-level code, so it benefits every backend (CPU / CUDA / Metal) and stays differentiable. All other groups > 1 cases keep the existing per-group path unchanged.

The stride == dilation == 1 guard is only because there's no strided-narrow primitive yet; that regime is the dominant one (MobileNet/EfficientNet 3×3, LFM2 / Mamba conv) and covers the case in the issue. The general grouped-conv case (c_in_k > 1, ResNeXt-style) needs a separate fix — passing groups through to cudnnSetConvolutionGroupCount, a batched-GEMM rewrite, or a fused grouped kernel — discussed in #3389.

Tests / validation

Adds conv1d_depthwise / conv2d_depthwise tests that cross-check the fast path against an independent code path (the equivalent block-diagonal dense weight run through groups == 1).

  • CPU: cargo test -p candle-core --test conv_tests 10/10, cargo test -p candle-nn green, cargo clippy -p candle-core --tests 0 warnings, cargo fmt --check clean.
  • GPU, validated on an RTX PRO 6000 (Blackwell, CUDA 12.9): cargo test -p candle-core --features cuda --test conv_tests — 20/20 pass, incl. the new conv1d_depthwise_gpu / conv2d_depthwise_gpu. Microbench, depthwise conv1d (1, 2048, 21) × (2048, 1, 3) groups=2048, 100 iters after warmup: 31971 µs/call → 23 µs/call (≈1390×); cuLaunchKernel count per call (via nsys): ~4096 → ~5.
  • --features cuda,cudnn has a pre-existing exact-equality assertion failure in the non-depthwise conv1d_gpu test (cuDNN picks a different algorithm with ~1e-3 different rounding) — unrelated to this change; all depthwise tests pass under cuDNN too.

+154/−0, 2 files.

Tensor::conv1d / conv2d with `groups > 1` decompose into `groups` separate
single-group convolutions plus a `cat`. For depthwise convolution
(`groups == in_channels`) this is O(groups) tiny kernel launches: with
`groups = 2048, k = 3` that is ~6000 CUDA launches whose host/driver overhead
dominates the trivial per-channel work. Issue huggingface#3389 reports depthwise conv
layers at 54% of total inference time on an RTX 5090, ~33 ms/layer vs <0.1 ms
for a hand-written expansion.

For the common depthwise case (`groups == c_in`, `c_in_k == 1`, no channel
multiplier, unit stride/dilation) compute the convolution directly as a sum of
`k` slices each scaled by a per-channel scalar:
`out[b,c,..] = sum_k input[b,c,..+k] * weight[c,k]`. That is a fixed number of
elementwise kernels (~2*k) regardless of channel count, numerically equivalent
to the existing backends (up to float summation order, just as the CPU/CUDA
paths already differ from each other). Pure Tensor-level code, so it benefits
every backend (CPU/CUDA/Metal) and stays differentiable. All other
`groups > 1` cases keep the existing per-group path.

Adds conv1d_depthwise / conv2d_depthwise tests that cross-check the fast path
against an independent code path (the equivalent block-diagonal dense weight
run through `groups == 1`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant