Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion src/dns.zig
Original file line number Diff line number Diff line change
Expand Up @@ -286,10 +286,19 @@
.{ "getaddrinfo", .libc },
});

// `dns.lookup` is specified to behave like getaddrinfo(3), which
// consults nsswitch.conf / /etc/hosts. On Linux the default
// backend used to be `c_ares`, which produced results that did
// not match Node for names defined only in /etc/hosts — see
// https://github.com/oven-sh/bun/issues/29227.
//
// Only `dns.lookup` routes through this default. `dns.resolve*`
// (and all record-type queries) use c-ares directly, matching
// Node's behavior.
pub const default: GetAddrInfo.Backend = switch (bun.Environment.os) {
.mac, .windows => .system,
else => .c_ares,
else => .libc,
};

Check failure on line 301 in src/dns.zig

View check run for this annotation

Claude / Claude Code Review

EAI_AGAIN mapped to ENOTIMP instead of ETIMEOUT via initEAI on new Linux libc default

When `getaddrinfo()` returns `EAI_AGAIN` (temporary DNS failure — DNS server unreachable, empty `/etc/resolv.conf`, network partition), the new Linux `.libc` default backend maps it through `initEAI()` which has no `AGAIN` case, silently returning `Error.ENOTIMP` instead of `Error.ETIMEOUT`. Users in Docker/Kubernetes environments will see `err.code='ENOTIMP'` / `'DNS resolver does not implement requested operation'` for what should be a transient, retryable timeout error. Fix: add `.AGAIN => Er
Comment on lines 298 to 301
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 When getaddrinfo() returns EAI_AGAIN (temporary DNS failure — DNS server unreachable, empty /etc/resolv.conf, network partition), the new Linux .libc default backend maps it through initEAI() which has no AGAIN case, silently returning Error.ENOTIMP instead of Error.ETIMEOUT. Users in Docker/Kubernetes environments will see err.code='ENOTIMP' / 'DNS resolver does not implement requested operation' for what should be a transient, retryable timeout error. Fix: add .AGAIN => Error.ETIMEOUT to the Linux-specific switch in initEAI in src/deps/c_ares.zig (matching the Windows path at c_ares.zig:1773: UV_EAI_AGAIN => Error.ETIMEOUT).

Extended reasoning...

What the bug is

When getaddrinfo() returns EAI_AGAIN (-3, "Temporary failure in name resolution"), the new Linux default .libc backend maps it to the wrong error code. Users receive err.code = 'ENOTIMP' ("DNS resolver does not implement requested operation") instead of the semantically correct ETIMEOUT. This is a meaningful mismatch: ENOTIMP signals an unsupported operation and is non-retryable in most application code, whereas ETIMEOUT correctly signals a transient, retryable failure.

The specific code path that triggers it

The call chain for the new .libc backend on Linux when getaddrinfo() fails:

  1. LibC.run() (dns.zig ~L773): stores .err = @intFromEnum(err) where err = EAI_AGAIN = -3
  2. then() (dns.zig ~L831): .err branch calls getAddrInfoAsyncCallback(-3, null, this)
  3. getAddrInfoAsyncCallback (dns.zig:727): calls head.processGetAddrInfoNative(-3, null)
  4. processGetAddrInfoNative (dns.zig:1073): calls c_ares.Error.initEAI(-3)
  5. initEAI (c_ares.zig ~L1763): -3 maps to .AGAIN via @enumFromInt; no case matches; falls to else => bun.todo(@src(), Error.ENOTIMP)

In release builds, bun.todo() logs nothing and silently returns the value — no panic, no assert, no visible warning.

Why existing code doesn't prevent it

The initEAI function has three layers of case handling:

  • NODATA/NONAME early check: .AGAIN \!= .NODATA and \!= .NONAME — miss
  • Linux-specific switch: .SOCKTYPE, .IDN_ENCODE, .ALLDONE, .INPROGRESS, .CANCELED, .NOTCANCELED — miss (else => {} fallthrough)
  • General switch: 0, .ADDRFAMILY, .BADFLAGS, .FAIL, .FAMILY, .MEMORY, .SERVICE, .SYSTEM — miss (else => bun.todo return ENOTIMP)

The Windows path correctly handles this via libuv.UV_EAI_AGAIN => Error.ETIMEOUT but that branch is not taken on Linux.

The blast-radius increase from this PR

Before this PR: Linux defaulted to .c_ares, which routes DNS errors through processGetAddrInfo() — a completely different code path that never calls initEAI. Only users who explicitly passed { backend: 'libc' } or { backend: 'system' } hit this bug; they were a small minority.

After this PR: .libc is the new Linux default for all dns.lookup() calls. Every Linux user who hits a temporary DNS failure (empty /etc/resolv.conf, DNS server temporarily down, Docker/Kubernetes service not yet ready, network partition) now gets ENOTIMP instead of ETIMEOUT. Ironically, one of the issues this PR claims to fix (#19086 — ioredis EAI_AGAIN for Docker service hostnames) would now fail with a different wrong error code.

Step-by-step proof

Suppose a Docker container runs dns.lookup('my-service', cb) while the Docker DNS resolver is temporarily unavailable:

  1. std.c.getaddrinfo('my-service', ...) returns EAI_AGAIN (-3)
  2. LibC.run() stores this.* = .{ .err = -3 }
  3. then() calls getAddrInfoAsyncCallback(-3, null, this)
  4. processGetAddrInfoNative(-3, null)c_ares.Error.initEAI(-3)
  5. @enumFromInt(-3) yields .AGAIN (glibc EAI_AGAIN=-3 per /usr/include/netdb.h)
  6. No switch case matches → bun.todo(@src(), Error.ENOTIMP) → returns Error.ENOTIMP
  7. Callback receives: err.code = 'DNS_ENOTIMP', message = 'DNS resolver does not implement requested operation'

Application code checking err.code === 'ENOTFOUND' or err.code === 'ETIMEOUT' for retry logic will not retry, causing hard failures for what should be a transient condition.

How to fix

Add .AGAIN => Error.ETIMEOUT to the Linux-specific switch in initEAI in src/deps/c_ares.zig, matching what Windows does with UV_EAI_AGAIN => Error.ETIMEOUT.


pub const FromJSError = JSError || error{
Comment on lines 286 to 303
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing inconsistency between addrInfoCount and addrInfoToJSArray in src/dns.zig: addrInfoCount always starts count=1 for the first node regardless of whether its addr is null, but addrInfoToJSArray uses fromAddrInfo(this_node) orelse continue (skipping j++) for null-addr nodes. If getaddrinfo returns a first node with addr==null, the JS array is over-allocated by one, leaving a trailing undefined slot. This is a pre-existing bug; notably, the second refutation is factually correct that this PR's Linux path uses Result.toList → Any{.list} rather than addrInfoToJSArray, so the blast radius is NOT increased on Linux — the bug remains limited to the macOS LibInfo path.

Extended reasoning...

The bug: addrInfoCount (dns.zig ~L286) initialises count = 1 unconditionally for the head node, then only increments for subsequent nodes when addr \!= null. In contrast, addrInfoToJSArray (dns.zig ~L303) iterates over all nodes starting from the head and does fromAddrInfo(this_node) orelse continue, skipping j++ whenever addr == null. These two policies disagree on the head node.

Concrete proof: Suppose getaddrinfo returns a linked list where node[0].addr==null and node[1].addr!=null. addrInfoCount returns 2 (count starts at 1, then +1 for node[1].addr!=null). addrInfoToJSArray creates a 2-element array but only writes to index 0 (for node[1], since node[0] is skipped). Index 1 is never written, leaving an uninitialised/undefined JS array slot. In JS terms: the returned array has length 2 with array[1] === undefined as a hole.

Does this PR increase the blast radius on Linux? The second refutation is correct on this point. The Linux LibC path introduced by this PR routes through: LibC.lookup → background thread → std.c.getaddrinfoGetAddrInfo.Result.toList(allocator, addrinfo) → stored as GetAddrInfo.Result.Any{ .list = result }. In then() at dns.zig:815, const any = GetAddrInfo.Result.Any{ .list = result }. When Any.toJS handles the .list variant it creates the array with list.items.len (the actual filled count from appendAssumeCapacity), NOT from addrInfoCount. So the Linux path is safe and the blast radius claim in the synthesis description is incorrect.

Where the bug actually lives: The .addrinfo variant of Any.toJS calls addrInfoToJSArray, which is only triggered on the macOS LibInfo path (onCompleteNative(this, .{ .addrinfo = result }) at dns.zig:1078). That path predates this PR. The inconsistency in addrInfoCount for toList is also harmless since initCapacity over-allocates by at most 1, but list.items.len is always correct.

Impact: In practice, POSIX getaddrinfo conformant implementations (glibc, musl, macOS libc) do not return nodes with ai_addr==null, so this bug is purely theoretical and unlikely to manifest. The trailing undefined slot could cause subtle JS-level issues (sparse array, length mismatch) on macOS if a non-conformant getaddrinfo were used.

Fix: In addrInfoCount, add a null check for the head node: var count: u32 = @intFromBool(addrinfo.addr \!= null); to match the behaviour of addrInfoToJSArray.

InvalidBackend,
Expand Down
82 changes: 82 additions & 0 deletions test/regression/issue/29227.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import { expect, test } from "bun:test";
import { bunEnv, bunExe, isLinux } from "harness";
import { appendFileSync, readFileSync, writeFileSync } from "node:fs";

// https://github.com/oven-sh/bun/issues/29227
//
// On Linux, `dns.lookup()` for a name that only has an IPv4 entry in
// /etc/hosts must return only IPv4, matching Node. Previously Bun's
// default backend returned an extra `::1` entry (and, because the
// default result order is `verbatim`, that `::1` became the single
// result returned by the callback form).
//
// This test requires Linux because it mutates /etc/hosts. The bug is
// Linux-specific — macOS uses LibInfo and Windows uses libuv, both
// already matching Node.
test.skipIf(!isLinux)("dns.lookup respects /etc/hosts and matches Node", async () => {
// Use a random tag so re-runs don't conflict. The tag is long enough
// that it's extremely unlikely to collide with anything on the host.
const tag = "bun-issue-29227-" + Math.random().toString(36).slice(2, 10);
const hostsEntry = `\n127.0.0.1 ${tag}\n`;

// /etc/hosts is a system file; snapshot-then-restore so a crashed
// test can't leave the system in a bad state.
let original: string;
try {
original = readFileSync("/etc/hosts", "utf8");
} catch {
// Not writable / not root — skip. CI Linux is root in the container.
return;
}

try {
appendFileSync("/etc/hosts", hostsEntry);
} catch {
return;
}

try {
await using proc = Bun.spawn({
cmd: [
bunExe(),
"-e",
`
const dns = require("node:dns");
const name = ${JSON.stringify(tag)};
dns.lookup(name, { all: true }, (err, results) => {
if (err) { console.error("ERR:" + err.code); process.exit(1); }
console.log(JSON.stringify(results));
});
dns.lookup(name, (err, address, family) => {
if (err) { console.error("ERR:" + err.code); process.exit(1); }
console.log("single:" + address + ":" + family);
});
`,
],
env: bunEnv,
stdout: "pipe",
stderr: "pipe",
});

const [stdout, stderr, exitCode] = await Promise.all([proc.stdout.text(), proc.stderr.text(), proc.exited]);

// Filter out the ASAN warning that debug builds print to stderr.
const realStderr = stderr
.split("\n")
.filter(l => l && !l.includes("ASAN"))
.join("\n");
expect(realStderr).toBe("");
expect(exitCode).toBe(0);

Comment on lines +67 to +74
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid empty-stderr assertion here; assert behavior first, then exitCode last.

expect(realStderr).toBe("") is brittle for regression subprocess tests, and exitCode is currently asserted before stdout-behavior assertions.

Suggested patch
-    // Filter out the ASAN warning that debug builds print to stderr.
-    const realStderr = stderr
-      .split("\n")
-      .filter(l => l && !l.includes("ASAN"))
-      .join("\n");
-    expect(realStderr).toBe("");
-    expect(exitCode).toBe(0);
+    // Keep stderr available for debugging but avoid empty-stderr assertions.
+    const realStderr = stderr
+      .split("\n")
+      .filter(l => l && !l.includes("ASAN"))
+      .join("\n");

     // Find the `all` array and the single-form line in the output.
     const lines = stdout.trim().split("\n");
     const allLine = lines.find(l => l.startsWith("["))!;
     const singleLine = lines.find(l => l.startsWith("single:"))!;

     expect(JSON.parse(allLine)).toEqual([{ address: "127.0.0.1", family: 4 }]);
     expect(singleLine).toBe("single:127.0.0.1:4");
+    expect(exitCode).toBe(0);

Based on learnings: in test/regression/issue/*.test.ts, avoid asserting spawned Bun subprocess stderr is empty; use exitCode as the primary failure signal.
As per coding guidelines: “Expect stdout assertions before exit code assertions … BEFORE expect(exitCode).toBe(0).”

Also applies to: 76-81

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/regression/issue/29227.test.ts` around lines 67 - 74, Remove the brittle
assertion that stderr is exactly empty and instead make exitCode the primary
failure signal; delete or stop asserting expect(realStderr).toBe("") (the
variables involved are stderr and realStderr) and ensure any stdout/behavior
assertions occur before the final expect(exitCode).toBe(0) assertion so exitCode
is checked last; apply the same change to the similar block around the other
occurrence (lines referencing realStderr/stderr/exitCode).

// Find the `all` array and the single-form line in the output.
const lines = stdout.trim().split("\n");
const allLine = lines.find(l => l.startsWith("["))!;
const singleLine = lines.find(l => l.startsWith("single:"))!;
Comment on lines +72 to +74
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new regression test asserts exitCode at line 73 before the stdout content assertions at lines 80–81, violating the CLAUDE.md convention that stdout assertions must come first. Move the expect(exitCode).toBe(0) call to after the expect(JSON.parse(allLine)) and expect(singleLine) assertions so that a subprocess failure surfaces the actual output rather than an opaque exit-code mismatch.

Extended reasoning...

What the bug is

CLAUDE.md (line 128) states: "When spawning processes, tests should expect(stdout).toBe(...) BEFORE expect(exitCode).toBe(0). This gives you a more useful error message on test failure." The new test in test/regression/issue/29227.test.ts violates this convention.

The specific code path

After the subprocess is awaited, the test makes the following assertions in this order:

  1. Line 72: expect(realStderr).toBe('')
  2. Line 73: expect(exitCode).toBe(0) ← exit code checked here
  3. Lines 77–79: stdout is parsed into allLine / singleLine
  4. Line 80: expect(JSON.parse(allLine)).toEqual([...])
  5. Line 81: expect(singleLine).toBe('single:127.0.0.1:4')

The stdout content assertions (lines 80–81) come after the exitCode assertion (line 73).

Why existing code doesn't prevent it

There is no lint rule or type check enforcing assertion order; it is a documentation-only convention. The code compiles and runs fine, the wrong order is just a quality issue for diagnostics.

Impact

If the child process exits with code 1 (e.g., because libc getaddrinfo is unavailable, returns an error, or produces unexpected results), the test fails at line 73 with:

Expected: 0
Received: 1

This tells the developer nothing about why it failed. If the stdout assertions were first, the failure would show the actual JSON output printed by the subprocess, immediately revealing the root cause (wrong address, extra ::1 entry, DNS error code, etc.).

Concrete proof

Suppose the subprocess prints ERR:ENOTFOUND to stderr and exits with code 1:

  • Current order → test fails at expect(exitCode).toBe(0) → error: "Expected: 0, Received: 1" — no useful info.
  • Correct order → test fails at expect(JSON.parse(allLine)).toEqual([...]) → error shows the actual stdout, making the bug obvious.

How to fix

Move expect(exitCode).toBe(0) to after the stdout assertions:

expect(realStderr).toBe('');
// stdout assertions first (CLAUDE.md §128)
expect(JSON.parse(allLine)).toEqual([{ address: '127.0.0.1', family: 4 }]);
expect(singleLine).toBe('single:127.0.0.1:4');
expect(exitCode).toBe(0);


expect(JSON.parse(allLine)).toEqual([{ address: "127.0.0.1", family: 4 }]);
expect(singleLine).toBe("single:127.0.0.1:4");
} finally {
// Always restore /etc/hosts, even if assertions fail.
writeFileSync("/etc/hosts", original);
}
});
Loading