conv: avoid O(groups) kernel launches for depthwise conv1d/conv2d by RyanJamesStewart · Pull Request #3531 · huggingface/candle

RyanJamesStewart · 2026-05-13T02:13:59Z

Fixes the depthwise part of #3389.

Problem

Tensor::conv1d / conv2d with groups > 1 decompose into groups separate single-group convolutions plus a cat. For depthwise convolution (groups == in_channels) that's O(groups) tiny kernel launches — groups = 2048, k = 3 → thousands of CUDA launches whose host/driver overhead dwarfs the trivial per-channel work. #3389 reports depthwise conv layers at ~54% of total inference time on an RTX 5090 (~33 ms/layer) vs <0.1 ms for a hand-written expansion.

Change

For the common depthwise case (groups == c_in, c_in_k == 1 — no channel multiplier, unit stride/dilation) compute the convolution directly as a sum of k slices each scaled by a per-channel scalar:

out[b, c, ..] = Σ_k  input[b, c, ..+k] · weight[c, k]

That's a fixed number of elementwise kernels (~2k) regardless of channel count, numerically equivalent to the existing backends (up to float summation order — the CPU/CUDA paths already differ from each other to that degree). It's pure Tensor-level code, so it benefits every backend (CPU / CUDA / Metal) and stays differentiable. All other groups > 1 cases keep the existing per-group path unchanged.

The stride == dilation == 1 guard is only because there's no strided-narrow primitive yet; that regime is the dominant one (MobileNet/EfficientNet 3×3, LFM2 / Mamba conv) and covers the case in the issue. The general grouped-conv case (c_in_k > 1, ResNeXt-style) needs a separate fix — passing groups through to cudnnSetConvolutionGroupCount, a batched-GEMM rewrite, or a fused grouped kernel — discussed in #3389.

Tests / validation

Adds conv1d_depthwise / conv2d_depthwise tests that cross-check the fast path against an independent code path (the equivalent block-diagonal dense weight run through groups == 1).

CPU: cargo test -p candle-core --test conv_tests 10/10, cargo test -p candle-nn green, cargo clippy -p candle-core --tests 0 warnings, cargo fmt --check clean.
GPU, validated on an RTX PRO 6000 (Blackwell, CUDA 12.9): cargo test -p candle-core --features cuda --test conv_tests — 20/20 pass, incl. the new conv1d_depthwise_gpu / conv2d_depthwise_gpu. Microbench, depthwise conv1d (1, 2048, 21) × (2048, 1, 3) groups=2048, 100 iters after warmup: 31971 µs/call → 23 µs/call (≈1390×); cuLaunchKernel count per call (via nsys): ~4096 → ~5.
--features cuda,cudnn has a pre-existing exact-equality assertion failure in the non-depthwise conv1d_gpu test (cuDNN picks a different algorithm with ~1e-3 different rounding) — unrelated to this change; all depthwise tests pass under cuDNN too.

+154/−0, 2 files.

Tensor::conv1d / conv2d with `groups > 1` decompose into `groups` separate single-group convolutions plus a `cat`. For depthwise convolution (`groups == in_channels`) this is O(groups) tiny kernel launches: with `groups = 2048, k = 3` that is ~6000 CUDA launches whose host/driver overhead dominates the trivial per-channel work. Issue huggingface#3389 reports depthwise conv layers at 54% of total inference time on an RTX 5090, ~33 ms/layer vs <0.1 ms for a hand-written expansion. For the common depthwise case (`groups == c_in`, `c_in_k == 1`, no channel multiplier, unit stride/dilation) compute the convolution directly as a sum of `k` slices each scaled by a per-channel scalar: `out[b,c,..] = sum_k input[b,c,..+k] * weight[c,k]`. That is a fixed number of elementwise kernels (~2*k) regardless of channel count, numerically equivalent to the existing backends (up to float summation order, just as the CPU/CUDA paths already differ from each other). Pure Tensor-level code, so it benefits every backend (CPU/CUDA/Metal) and stays differentiable. All other `groups > 1` cases keep the existing per-group path. Adds conv1d_depthwise / conv2d_depthwise tests that cross-check the fast path against an independent code path (the equivalent block-diagonal dense weight run through `groups == 1`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conv: avoid O(groups) kernel launches for depthwise conv1d/conv2d#3531

conv: avoid O(groups) kernel launches for depthwise conv1d/conv2d#3531
RyanJamesStewart wants to merge 1 commit into
huggingface:mainfrom
RyanJamesStewart:perf/conv-depthwise-no-per-group-launch

RyanJamesStewart commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RyanJamesStewart commented May 13, 2026

Problem

Change

Tests / validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant