box-dind: verify host-image passthrough seeded the nested daemon and stop falsely reporting 'complete' (issue #106)#107
Conversation
Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #106
…p falsely reporting 'complete' (issue #106) box-dind kept re-downloading multi-GB host images on first nested 'docker run' even with DIND_HOST_PASSTHROUGH_IMAGES set, while the entrypoint still printed 'image preload/passthrough complete' — so a misconfigured deployment (forgotten host docker socket mount, host missing that exact ref, or the mode=public filter dropping a locally-built/private image) looked healthy right up until the slow re-pull (the recurring symptom behind closed issues #94 and #102). The entrypoint now verifies the copy after passthrough: for every concrete allowlist entry (explicit tag or @sha256: digest — bare repos and globs are skipped to avoid false alarms) it runs 'docker image inspect' against the nested daemon. A missing expected image now produces a loud, actionable warning (host socket reachable but lacks the ref / filtered by mode, or no usable socket mounted) and the terminal line becomes 'image preload/passthrough finished WITH WARNINGS' instead of the misleading 'complete'. No silent no-op path can report success anymore. We do not auto-pull as a fallback — that would mask the config error and incur the same multi-GB download; the re-pull still happens naturally. Adds ref_is_concrete() + verify_passthrough_images() helpers, new unit cases in experiments/preload-unit-test.sh (62 passing), verify_ok/verify_miss assertions in the CI-run tests/dind/example-preload-images.sh, and docs in docs/dind/USAGE.md and README.md.
Working session summaryDone. Issue #106 is resolved on branch SummaryRoot cause: box-dind's entrypoint printed Fix (
Verification:
Docs: PR: #107 This summary was automatically extracted from the AI working session output. |
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
📊 Context and tokens usage:Claude Opus 4.8: (2 sub-sessions)
Total: (12.2K new + 197.3K cache writes + 4.1M cache reads) input tokens, 47.9K output tokens, $4.530429 cost 🤖 Models used:
📎 Log file uploaded as Gist (2034KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
🔄 Auto-restart triggered (iteration 1)Reason: CI failures detected Starting new session to address the issues. Auto-restart-until-mergeable mode is active. This run will stop after 5 restart iterations. |
…ess (fix flaky dind-js) The CI dind-js example suite failed in example-storage-driver-vfs.sh: 'expected the vfs warning to mention "'vfs' storage driver", but it was absent from the container logs'. The vfs copy-on-write warning (issue #104) is emitted by the entrypoint (PID 1) *after* the inner dockerd becomes ready, in warn_if_vfs_storage_driver. The test checked the logs once, the instant wait_for_inner_docker returned — and the test's external 'docker exec docker info' readiness probe can win the race against the entrypoint's own readiness loop, grabbing the logs before the warning is flushed. The dumped logs at the failure showed only '[dind-entrypoint] Starting dockerd (storage-driver=vfs)', confirming the warning simply had not been printed yet. Add a bounded wait_for_logs CONTAINER NEEDLE [LIMIT] helper to lib.sh (poll logs_contain until the line appears or the timeout elapses) and use it for the three vfs-warning needles, mirroring the existing wait_for_preload_complete pattern in example-preload-images.sh. No production code change.
Working session summaryI'll wait for the background monitors to notify me of completion. The fix is in place and verified for the originally-failing This summary was automatically extracted from the AI working session output. |
🔄 Auto-restart-until-mergeable Log (iteration 1)This log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
📊 Context and tokens usage:
Total: (3.8K new + 71.0K cache writes + 2.3M cache reads) input tokens, 15.4K output tokens, $2.003767 cost 🤖 Models used:
📎 Log file uploaded as Gist (4384KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
✅ Ready to mergeThis pull request is now ready to be merged:
Monitored by hive-mind with --auto-restart-until-mergeable flag |
This reverts commit 1b565d8.
…-essential files (issue #108) Commit eaeed07 (in PR #107) reverted a single task-details file yet the pull_request build ran the full image build/test matrix — a 32m48s dind-full job and a 32m21s full job included — for a change that cannot affect any Docker image (run 27826090748). The root cause: the detect-changes job in release.yml diffed the WHOLE pull request against its base SHA (`git diff --name-only "$BASE_SHA" HEAD`, BASE_SHA = github.event.pull_request.base.sha), so a trivial synchronize commit re-ran everything whenever any EARLIER commit on the same PR had touched image source. Fix: extract the four inline detect-changes steps into a single, unit-tested scripts/ci/detect-changes.sh that, for pull_request events, diffs ONLY the PR head's latest commit (HEAD^2^..HEAD^2 against GitHub's synthetic merge commit) instead of the whole-PR diff. A trailing .gitkeep / docs / Markdown / changeset commit therefore skips the ~33-minute matrix — earlier commits were already tested when pushed. push events still evaluate the full pushed range (github.event.before..HEAD) so release builds are never skipped. Checkout now uses fetch-depth: 0 so the merge commit's parents resolve. docs/dind/ no longer triggers the dind matrix (docs are not a build input). This mirrors the per-commit detect-code-changes scripts shipped by the link-foundation js/rust/python/csharp pipeline templates — they were already correct, so nothing to report upstream. Job-level `if: needs.detect-changes.outputs.should-build == 'true'` gating is unchanged, so skipped jobs report success and the PR stays mergeable (a whole-workflow `on: paths` skip would have left required checks Pending). Covered by experiments/test-issue108-detect-changes.sh (26 assertions: pure classification truth table, real synthetic-merge-commit git ranges, and a release.yml wiring regression guard). Full analysis, timeline, requirement matrix, root causes and template comparison in docs/case-studies/issue-108/. Closes #108
Problem (issue #106)
box-dind's nested Docker daemon kept re-downloading multi-GB host images (~30 GB, ~1 hour) on the first nested
docker runeven withDIND_HOST_PASSTHROUGH_IMAGESset, while the entrypoint still printedimage preload/passthrough complete. A misconfigured deployment — forgotten host-socket mount, host missing that exact ref, or themode=publicfilter dropping a locally-built/private image — therefore looked perfectly healthy right up until the slow re-pull. This is the recurring production symptom behind already-closed issues #94 and #102 (still observed downstream inlink-assistant/hive-mind#1914/#1946).The issue asked for three code-level fixes:
Fix
The root-cause robust catch-all is verification, which subsumes fix #1: any silent no-op path (no socket, unreachable socket, host lacks the ref, mode filter drop, or a
save|loadfailure) leaves a named allowlist image missing → loud warning + an honest terminal marker.verify_passthrough_images()(new): afterpreload_tarballs/passthrough_host_images/preload_images, for every concreteDIND_HOST_PASSTHROUGH_IMAGESentry it runsdocker image inspectagainst the nested daemon. Missing images produce a loud, actionable warning that distinguishes the two failure modes (host socket reachable but lacks the ref / dropped bymode=public, vs. no usable socket mounted) and names the remediation.ref_is_concrete()(new): only entries with an explicit tag or@sha256:digest are verified. Bare repos (konard/hive-mind) and globs (konard/hive-mind*) are skipped to avoid false alarms (the host may hold them under any tag). A:is treated as a tag separator only in the last path segment, so a registry port likehost:5000/repois correctly non-concrete.image preload/passthrough completeonly when every concrete allowlisted image is actually present; otherwiseimage preload/passthrough finished WITH WARNINGS.How to reproduce / verify
experiments/preload-unit-test.sh(source-only mode): new cases — concrete image present → honest "complete", no warning; concrete absent → loud "did NOT seed" warning + "finished WITH WARNINGS", no "complete"; glob/bare entries never false-alarm;ref_is_concreteclassification. 62 passed, 0 failed.tests/dind/example-preload-images.sh(runs in CI on thejsvariant,--privileged): newverify_okcase (concrete allowlist + socket mounted,DIND_HOST_PASSTHROUGH=all→ image present, "complete", no "did NOT seed") andverify_misscase (concrete allowlist, no socket → "did NOT seed expected image(s)…", "finished WITH WARNINGS", "complete" absent).Docs
docs/dind/USAGE.md: new "Verifying the copy actually happened (issue box-dind: nested daemon still re-downloads host images (~30 GB) despite passthrough — continuation of #94 / #102 #106)" subsection + updatedDIND_HOST_PASSTHROUGH_IMAGESenv-var row.README.md: verification clause on the host-image-reuse note (pin a concrete tag/digest; loud warning instead of false "complete").Versioning
Added
.changeset/issue-106-passthrough-verify.md(bump: patch). VERSION is bumped automatically by CI on merge.Fixes #106