Skip to content

WASAPI: AUTOCONVERTPCM (#1097) produces silent input streams on Windows 11 24H2 Communications-class endpoints #1200

@louis030195

Description

@louis030195

Summary

Since cpal v0.17.2, default_input_config() for "Communications-class" USB microphones on Windows 11 24H2 returns 16 kHz mono F32 (the system Communications mix format), and the resulting WASAPI capture stream delivers genuine zero/near-zero samples — i.e. silence at the noise floor — while the same physical microphone records normal speech levels via DirectShow on the same machine at the same moment.

The regression bisects to PR #1097"wasapi: Enable resampling and rate adjustment" (merged 2026-01-29, released in v0.17.2 on 2026-02-08). My downstream users started reporting "audio looks like it's capturing, but the files are basically silent with a bit of white noise" exactly when their auto-updater pulled them through the cpal-bump release.

A precise measurement, same mic + same speaker + same minute:

Path Mean Peak
ffmpeg -f dshow -i audio="<mic>" -28.6 dB -3.3 dB (normal speech)
cpal WASAPI (default_input_config + build_input_stream) -85.5 dB -42.9 dB (noise floor)

That's an ~82 dB delta = ~12,600× attenuation. It's not an offset; the samples are genuinely zero, not misinterpreted bytes from a format mismatch (see hex dump below).

Environment

  • OS: Windows 11 Pro 24H2 (build 10.0.26200, Insider)
  • cpal: v0.18.0 (downstream fork pinned to a commit based on upstream main; behavior is the same as v0.17.2+)
  • Hardware (reproduced on both): USB headset (Jabra Evolve 75) and USB webcam (Logi C270 HD WebCam)
  • Working baseline: Same machine, same mics, ffmpeg via DirectShow → normal speech levels
  • Not affected on the same machine: Built-in Microphone Array (Intel Smart Sound Technology) — exposes 48 kHz stereo via WASAPI and records normally. Only the USB Communications-class endpoints are silent.

Reproduction

  1. Use a USB headset or USB webcam mic that Windows registers as a Communications-class endpoint on Win11 24H2 (verifiable: mmsys.cpl → Recording → properties → the device is set as both Default Device AND Default Communications Device).
  2. Enumerate via cpal:
use cpal::traits::{DeviceTrait, HostTrait};
fn main() {
    let host = cpal::default_host();
    for d in host.input_devices().unwrap() {
        let name = d.name().unwrap_or("?".into());
        println!("=== {} ===", name);
        if let Ok(c) = d.default_input_config() {
            println!("  default: {:?} {} ch @ {} Hz",
                c.sample_format(), c.channels(), c.sample_rate().0);
        }
        if let Ok(configs) = d.supported_input_configs() {
            for c in configs {
                println!("  supported: {:?} {} ch @ {}-{} Hz",
                    c.sample_format(), c.channels(),
                    c.min_sample_rate().0, c.max_sample_rate().0);
            }
        }
    }
}

Output on the affected machine:

=== Microphone (Logi C270 HD WebCam) ===
  default: F32 1 ch @ 16000 Hz
  supported: F32 1 ch @ 16000-16000 Hz
  supported: I32 1 ch @ 16000-16000 Hz
  supported: I16 1 ch @ 16000-16000 Hz
  supported: U8  1 ch @ 16000-16000 Hz

=== Headset (Jabra Evolve 75) ===
  default: F32 1 ch @ 16000 Hz
  supported: F32 1 ch @ 16000-16000 Hz
  ... same as above

Note: cpal exposes only 16 kHz for these devices — which is not a native hardware rate. ffmpeg -f dshow -list_options true for the same devices lists 8000 / 11025 / 22050 / 32000 / 44100 / 48000 / 96000 Hz × 1/2 ch × 8/16-bit. 16 kHz is the Windows Communications-class mix format, and AUTOCONVERTPCM is what makes WASAPI accept that rate via server-side resampling.

  1. Build an input stream with default_input_config() and dump samples. Result: stream callbacks fire at the expected rate, but every sample value is 0, ±1, or extremely-near-zero noise. Decoded as s16le, the first 256 bytes of one capture look like:
00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 FF FF 00 00 00 00 01 00
01 00 00 00 00 00 FF FF 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 00
00 00 00 00 00 00 FF FF 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[…continues with the same near-zero pattern]

For comparison, the same mic captured via DirectShow in the same second (decoded to s16le):

C2 FE 53 FE 87 FE A0 FE EE FE 4F FF 7E FF D9 FF 20 00 23 00 83 00 1A 01
4E 01 5F 01 8D 01 BF 01 F3 01 1F 02 3C 02 5D 02 8E 02 CF 02 1C 03 73 03
[…normal speech signal continues]

The cpal samples are not misinterpreted bytes from a format mismatch — they are genuinely zero. The format negotiation succeeds; the stream just doesn't carry any signal.

Why I believe PR #1097 is the cause

  • The change in wasapi: Enable resampling and rate adjustment #1097 enables AUDCLNT_STREAMFLAGS_AUTOCONVERTPCM in WASAPI Initialize so non-native rates can be requested through the server-side resampler.
  • The PR thread (wasapi: Enable resampling and rate adjustment #1097) acknowledges this flag was non-standard prior to Windows 10, with no testing reported on Win11 24H2 Communications-class endpoints.
  • On Win11 24H2 specifically, the WASAPI audio engine appears to apply a privacy/Communications policy when a non-Communications consumer opens a Communications-class endpoint at the Communications mix format (16 kHz F32 mono): Initialize succeeds, the stream "plays," callbacks fire — but the samples delivered are zero.
  • Reverting to v0.15.3 (the last release before AUTOCONVERTPCM) restores normal capture on the exact same hardware. (We confirmed the timeline: our downstream stopped working when users were rolled past the cpal v0.17.2+ release; no other audio-code changes correlate.)
  • The Intel Smart Sound mic array on the same machine is NOT a Communications-class endpoint, exposes 48 kHz stereo via WASAPI (without AUTOCONVERTPCM in play), and records normally.

Suggested fix directions

  1. Gate AUTOCONVERTPCM behind an opt-in flag rather than always-on. The PR's stated goal (issue build_output_stream fails on Windows 10 if the specified sample rate does not match the output device's default sample rate #593) was solving a build-time failure when users request non-native rates; AUTOCONVERTPCM is one valid solution, but for callers who request the device's native rate (or who use default_input_config() expecting a usable stream), the flag introduces silent-failure risk on Win11 24H2.
  2. Or: probe for silent streams during stream setup. A 100–500 ms post-Start check — if RMS is exactly zero over the first N buffers, retry with AUTOCONVERTPCM off and use the device's actual hardware mix format from GetMixFormat on the endpoint's eMultimedia role (instead of eCommunications).
  3. Or: pick the endpoint role explicitly. IMMDeviceEnumerator::GetDefaultAudioEndpoint(eCapture, eMultimedia) returns a different audio session policy than eCommunications, even for the same physical device. cpal currently doesn't expose role selection; exposing it (or defaulting to eMultimedia for non-RT use cases) sidesteps the policy gate entirely.

Happy to test patches against the affected hardware here. cc @yeah-its-gloria @roderickvd.

Downstream context

We're screenpipe — Rust + Tauri app that records audio + accessibility text continuously. We started seeing user reports immediately after our auto-updater rolled cpal v0.17.2+ to Windows users. Diagnosis credit to one of our users (William Lucas) who built the DirectShow baseline + WASAPI hex dump to isolate the regression to the cpal capture layer.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions