box-dind: verify host-image passthrough seeded the nested daemon and stop falsely reporting 'complete' (issue #106) by konard · Pull Request #107 · link-foundation/box

konard · 2026-06-19T10:31:35Z

Problem (issue #106)

box-dind's nested Docker daemon kept re-downloading multi-GB host images (~30 GB, ~1 hour) on the first nested docker run even with DIND_HOST_PASSTHROUGH_IMAGES set, while the entrypoint still printed image preload/passthrough complete. A misconfigured deployment — forgotten host-socket mount, host missing that exact ref, or the mode=public filter dropping a locally-built/private image — therefore looked perfectly healthy right up until the slow re-pull. This is the recurring production symptom behind already-closed issues #94 and #102 (still observed downstream in link-assistant/hive-mind#1914/#1946).

The issue asked for three code-level fixes:

Fail loudly, not silently — never print "complete" when nothing was copied.
Verify the copy — after passthrough, assert the allowlisted image is actually present in the nested daemon.
Document the required deployment wiring (socket mount + allowlist).

Fix

The root-cause robust catch-all is verification, which subsumes fix #1: any silent no-op path (no socket, unreachable socket, host lacks the ref, mode filter drop, or a save|load failure) leaves a named allowlist image missing → loud warning + an honest terminal marker.

verify_passthrough_images() (new): after preload_tarballs/passthrough_host_images/preload_images, for every concrete DIND_HOST_PASSTHROUGH_IMAGES entry it runs docker image inspect against the nested daemon. Missing images produce a loud, actionable warning that distinguishes the two failure modes (host socket reachable but lacks the ref / dropped by mode=public, vs. no usable socket mounted) and names the remediation.
ref_is_concrete() (new): only entries with an explicit tag or @sha256: digest are verified. Bare repos (konard/hive-mind) and globs (konard/hive-mind*) are skipped to avoid false alarms (the host may hold them under any tag). A : is treated as a tag separator only in the last path segment, so a registry port like host:5000/repo is correctly non-concrete.
The terminal line is now honest: image preload/passthrough complete only when every concrete allowlisted image is actually present; otherwise image preload/passthrough finished WITH WARNINGS.
No auto-pull fallback — that would mask the config error and incur the same multi-GB download. The re-pull still happens naturally; the loud warning is the explicit signal.

How to reproduce / verify

experiments/preload-unit-test.sh (source-only mode): new cases — concrete image present → honest "complete", no warning; concrete absent → loud "did NOT seed" warning + "finished WITH WARNINGS", no "complete"; glob/bare entries never false-alarm; ref_is_concrete classification. 62 passed, 0 failed.
tests/dind/example-preload-images.sh (runs in CI on the js variant, --privileged): new verify_ok case (concrete allowlist + socket mounted, DIND_HOST_PASSTHROUGH=all → image present, "complete", no "did NOT seed") and verify_miss case (concrete allowlist, no socket → "did NOT seed expected image(s)…", "finished WITH WARNINGS", "complete" absent).

Docs

docs/dind/USAGE.md: new "Verifying the copy actually happened (issue box-dind: nested daemon still re-downloads host images (~30 GB) despite passthrough — continuation of #94 / #102 #106)" subsection + updated DIND_HOST_PASSTHROUGH_IMAGES env-var row.
README.md: verification clause on the host-image-reuse note (pin a concrete tag/digest; loud warning instead of false "complete").

Versioning

Added .changeset/issue-106-passthrough-verify.md (bump: patch). VERSION is bumped automatically by CI on merge.

Fixes #106

Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #106

@sha256

…p falsely reporting 'complete' (issue #106) box-dind kept re-downloading multi-GB host images on first nested 'docker run' even with DIND_HOST_PASSTHROUGH_IMAGES set, while the entrypoint still printed 'image preload/passthrough complete' — so a misconfigured deployment (forgotten host docker socket mount, host missing that exact ref, or the mode=public filter dropping a locally-built/private image) looked healthy right up until the slow re-pull (the recurring symptom behind closed issues #94 and #102). The entrypoint now verifies the copy after passthrough: for every concrete allowlist entry (explicit tag or @sha256: digest — bare repos and globs are skipped to avoid false alarms) it runs 'docker image inspect' against the nested daemon. A missing expected image now produces a loud, actionable warning (host socket reachable but lacks the ref / filtered by mode, or no usable socket mounted) and the terminal line becomes 'image preload/passthrough finished WITH WARNINGS' instead of the misleading 'complete'. No silent no-op path can report success anymore. We do not auto-pull as a fallback — that would mask the config error and incur the same multi-GB download; the re-pull still happens naturally. Adds ref_is_concrete() + verify_passthrough_images() helpers, new unit cases in experiments/preload-unit-test.sh (62 passing), verify_ok/verify_miss assertions in the CI-run tests/dind/example-preload-images.sh, and docs in docs/dind/USAGE.md and README.md.

konard · 2026-06-19T10:44:52Z

Working session summary

Done. Issue #106 is resolved on branch issue-106-81bf2d7bb66f and PR #107 is updated and marked ready for review.

Summary

Root cause: box-dind's entrypoint printed image preload/passthrough complete unconditionally — even when the passthrough seeded nothing (missing socket mount, host lacking the exact ref, or the mode=public filter dropping a private/locally-built image). The deployment looked healthy until the first nested docker run triggered a multi-GB re-pull (~30 GB, ~1 hour).

Fix (ubuntu/24.04/dind/dind-entrypoint.sh):

verify_passthrough_images() — after all preload paths run, docker image inspect confirms every concrete DIND_HOST_PASSTHROUGH_IMAGES entry actually landed in the nested daemon. Missing images get a loud, actionable warning distinguishing the failure modes (socket reachable but ref absent/filtered vs. no socket mounted).
ref_is_concrete() — only verifies entries with an explicit tag/digest; skips bare repos and globs (and correctly treats host:5000/repo ports as non-concrete) to avoid false alarms.
The terminal line is now honest: complete only when everything expected is present, otherwise finished WITH WARNINGS. No auto-pull fallback (that would mask the config error and incur the same download).

Verification:

experiments/preload-unit-test.sh — 62 passed, 0 failed (new present/absent/glob/classification cases).
tests/dind/example-preload-images.sh — new verify_ok/verify_miss CI assertions.

Docs: docs/dind/USAGE.md + README.md updated; .changeset/issue-106-passthrough-verify.md (bump: patch) added (CI bumps VERSION on merge).

PR: #107

This summary was automatically extracted from the AI working session output.

konard · 2026-06-19T10:45:05Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $4.530429
Calculated by Anthropic: $5.270439
Difference: $0.740010 (+16.33%)

📊 Context and tokens usage:

Claude Opus 4.8: (2 sub-sessions)

116.3K / 1M (12%) input tokens, 35.6K / 128K (28%) output tokens
38.0K / 1M (4%) input tokens, 6.0K / 128K (5%) output tokens

Total: (12.2K new + 197.3K cache writes + 4.1M cache reads) input tokens, 47.9K output tokens, $4.530429 cost

🤖 Models used:

Tool: Anthropic Claude Code
Requested: opus
Model: Claude Opus 4.8 (claude-opus-4-8)

📎 Log file uploaded as Gist (2034KB)

View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-06-19T11:40:35Z

🔄 Auto-restart triggered (iteration 1)

Reason: CI failures detected

Starting new session to address the issues.

Auto-restart-until-mergeable mode is active. This run will stop after 5 restart iterations.

…ess (fix flaky dind-js) The CI dind-js example suite failed in example-storage-driver-vfs.sh: 'expected the vfs warning to mention "'vfs' storage driver", but it was absent from the container logs'. The vfs copy-on-write warning (issue #104) is emitted by the entrypoint (PID 1) *after* the inner dockerd becomes ready, in warn_if_vfs_storage_driver. The test checked the logs once, the instant wait_for_inner_docker returned — and the test's external 'docker exec docker info' readiness probe can win the race against the entrypoint's own readiness loop, grabbing the logs before the warning is flushed. The dumped logs at the failure showed only '[dind-entrypoint] Starting dockerd (storage-driver=vfs)', confirming the warning simply had not been printed yet. Add a bounded wait_for_logs CONTAINER NEEDLE [LIMIT] helper to lib.sh (poll logs_contain until the line appears or the timeout elapses) and use it for the three vfs-warning needles, mirroring the existing wait_for_preload_complete pattern in example-preload-images.sh. No production code change.

konard · 2026-06-19T12:19:07Z

Working session summary

I'll wait for the background monitors to notify me of completion. The fix is in place and verified for the originally-failing dind-js job; the remaining long-running jobs just need time to finish.

This summary was automatically extracted from the AI working session output.

konard · 2026-06-19T12:19:18Z

🔄 Auto-restart-until-mergeable Log (iteration 1)

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

Public pricing estimate: $2.003767
Calculated by Anthropic: $7.540569
Difference: $5.536802 (+276.32%)

📊 Context and tokens usage:

83.7K / 1M (8%) input tokens, 15.4K / 128K (12%) output tokens

Total: (3.8K new + 71.0K cache writes + 2.3M cache reads) input tokens, 15.4K output tokens, $2.003767 cost

🤖 Models used:

Tool: Anthropic Claude Code
Requested: opus
Model: Claude Opus 4.8 (claude-opus-4-8)

📎 Log file uploaded as Gist (4384KB)

View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

konard · 2026-06-19T12:36:38Z

✅ Ready to merge

This pull request is now ready to be merged:

All CI checks have passed
No merge conflicts
No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

This reverts commit 1b565d8.

…-essential files (issue #108) Commit eaeed07 (in PR #107) reverted a single task-details file yet the pull_request build ran the full image build/test matrix — a 32m48s dind-full job and a 32m21s full job included — for a change that cannot affect any Docker image (run 27826090748). The root cause: the detect-changes job in release.yml diffed the WHOLE pull request against its base SHA (`git diff --name-only "$BASE_SHA" HEAD`, BASE_SHA = github.event.pull_request.base.sha), so a trivial synchronize commit re-ran everything whenever any EARLIER commit on the same PR had touched image source. Fix: extract the four inline detect-changes steps into a single, unit-tested scripts/ci/detect-changes.sh that, for pull_request events, diffs ONLY the PR head's latest commit (HEAD^2^..HEAD^2 against GitHub's synthetic merge commit) instead of the whole-PR diff. A trailing .gitkeep / docs / Markdown / changeset commit therefore skips the ~33-minute matrix — earlier commits were already tested when pushed. push events still evaluate the full pushed range (github.event.before..HEAD) so release builds are never skipped. Checkout now uses fetch-depth: 0 so the merge commit's parents resolve. docs/dind/ no longer triggers the dind matrix (docs are not a build input). This mirrors the per-commit detect-code-changes scripts shipped by the link-foundation js/rust/python/csharp pipeline templates — they were already correct, so nothing to report upstream. Job-level `if: needs.detect-changes.outputs.should-build == 'true'` gating is unchanged, so skipped jobs report success and the PR stays mergeable (a whole-workflow `on: paths` skip would have left required checks Pending). Covered by experiments/test-issue108-detect-changes.sh (26 assertions: pure classification truth table, real synthetic-merge-commit git ranges, and a release.yml wiring regression guard). Full analysis, timeline, requirement matrix, root causes and template comparison in docs/case-studies/issue-108/. Closes #108

Initial commit with task details

1b565d8

Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #106

konard self-assigned this Jun 19, 2026

konard changed the title ~~[WIP] box-dind: nested daemon still re-downloads host images (~30 GB) despite passthrough — continuation of #94 / #102~~ box-dind: verify host-image passthrough seeded the nested daemon and stop falsely reporting 'complete' (issue #106) Jun 19, 2026

konard marked this pull request as ready for review June 19, 2026 10:44

konard mentioned this pull request Jun 19, 2026

Cost calculation fix link-assistant/hive-mind#1951

Closed

Revert "Initial commit with task details"

eaeed07

This reverts commit 1b565d8.

konard merged commit 765ea8a into main Jun 19, 2026
23 checks passed

This was referenced Jun 19, 2026

ci: skip build/test matrix when a PR's latest commit only touches non-essential files (issue #108) #109

Merged

fix(telegram): surface docker-isolation session id + isolation immediately (#1946) link-assistant/hive-mind#1948

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

box-dind: verify host-image passthrough seeded the nested daemon and stop falsely reporting 'complete' (issue #106)#107

box-dind: verify host-image passthrough seeded the nested daemon and stop falsely reporting 'complete' (issue #106)#107
konard merged 4 commits into
mainfrom
issue-106-81bf2d7bb66f

konard commented Jun 19, 2026 •

edited

Loading

Uh oh!

konard commented Jun 19, 2026

Uh oh!

konard commented Jun 19, 2026

Uh oh!

konard commented Jun 19, 2026

Uh oh!

konard commented Jun 19, 2026

Uh oh!

konard commented Jun 19, 2026

Uh oh!

konard commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

konard commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem (issue #106)

Fix

How to reproduce / verify

Docs

Versioning

Uh oh!

konard commented Jun 19, 2026

Working session summary

Summary

Uh oh!

konard commented Jun 19, 2026

🤖 Solution Draft Log

💰 Cost estimation:

📊 Context and tokens usage:

🤖 Models used:

📎 Log file uploaded as Gist (2034KB)

Uh oh!

konard commented Jun 19, 2026

🔄 Auto-restart triggered (iteration 1)

Uh oh!

konard commented Jun 19, 2026

Working session summary

Uh oh!

konard commented Jun 19, 2026

🔄 Auto-restart-until-mergeable Log (iteration 1)

💰 Cost estimation:

📊 Context and tokens usage:

🤖 Models used:

📎 Log file uploaded as Gist (4384KB)

Uh oh!

konard commented Jun 19, 2026

✅ Ready to merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

konard commented Jun 19, 2026 •

edited

Loading