Skip to content

box-dind: verify host-image passthrough seeded the nested daemon and stop falsely reporting 'complete' (issue #106)#107

Merged
konard merged 4 commits into
mainfrom
issue-106-81bf2d7bb66f
Jun 19, 2026
Merged

box-dind: verify host-image passthrough seeded the nested daemon and stop falsely reporting 'complete' (issue #106)#107
konard merged 4 commits into
mainfrom
issue-106-81bf2d7bb66f

Conversation

@konard

@konard konard commented Jun 19, 2026

Copy link
Copy Markdown
Member

Problem (issue #106)

box-dind's nested Docker daemon kept re-downloading multi-GB host images (~30 GB, ~1 hour) on the first nested docker run even with DIND_HOST_PASSTHROUGH_IMAGES set, while the entrypoint still printed image preload/passthrough complete. A misconfigured deployment — forgotten host-socket mount, host missing that exact ref, or the mode=public filter dropping a locally-built/private image — therefore looked perfectly healthy right up until the slow re-pull. This is the recurring production symptom behind already-closed issues #94 and #102 (still observed downstream in link-assistant/hive-mind#1914/#1946).

The issue asked for three code-level fixes:

  1. Fail loudly, not silently — never print "complete" when nothing was copied.
  2. Verify the copy — after passthrough, assert the allowlisted image is actually present in the nested daemon.
  3. Document the required deployment wiring (socket mount + allowlist).

Fix

The root-cause robust catch-all is verification, which subsumes fix #1: any silent no-op path (no socket, unreachable socket, host lacks the ref, mode filter drop, or a save|load failure) leaves a named allowlist image missing → loud warning + an honest terminal marker.

  • verify_passthrough_images() (new): after preload_tarballs/passthrough_host_images/preload_images, for every concrete DIND_HOST_PASSTHROUGH_IMAGES entry it runs docker image inspect against the nested daemon. Missing images produce a loud, actionable warning that distinguishes the two failure modes (host socket reachable but lacks the ref / dropped by mode=public, vs. no usable socket mounted) and names the remediation.
  • ref_is_concrete() (new): only entries with an explicit tag or @sha256: digest are verified. Bare repos (konard/hive-mind) and globs (konard/hive-mind*) are skipped to avoid false alarms (the host may hold them under any tag). A : is treated as a tag separator only in the last path segment, so a registry port like host:5000/repo is correctly non-concrete.
  • The terminal line is now honest: image preload/passthrough complete only when every concrete allowlisted image is actually present; otherwise image preload/passthrough finished WITH WARNINGS.
  • No auto-pull fallback — that would mask the config error and incur the same multi-GB download. The re-pull still happens naturally; the loud warning is the explicit signal.

How to reproduce / verify

  • experiments/preload-unit-test.sh (source-only mode): new cases — concrete image present → honest "complete", no warning; concrete absent → loud "did NOT seed" warning + "finished WITH WARNINGS", no "complete"; glob/bare entries never false-alarm; ref_is_concrete classification. 62 passed, 0 failed.
  • tests/dind/example-preload-images.sh (runs in CI on the js variant, --privileged): new verify_ok case (concrete allowlist + socket mounted, DIND_HOST_PASSTHROUGH=all → image present, "complete", no "did NOT seed") and verify_miss case (concrete allowlist, no socket → "did NOT seed expected image(s)…", "finished WITH WARNINGS", "complete" absent).

Docs

Versioning

Added .changeset/issue-106-passthrough-verify.md (bump: patch). VERSION is bumped automatically by CI on merge.

Fixes #106

Adding .gitkeep for PR creation (default mode).
This file will be removed when the task is complete.

Issue: #106
@konard konard self-assigned this Jun 19, 2026
…p falsely reporting 'complete' (issue #106)

box-dind kept re-downloading multi-GB host images on first nested 'docker run'
even with DIND_HOST_PASSTHROUGH_IMAGES set, while the entrypoint still printed
'image preload/passthrough complete' — so a misconfigured deployment (forgotten
host docker socket mount, host missing that exact ref, or the mode=public filter
dropping a locally-built/private image) looked healthy right up until the slow
re-pull (the recurring symptom behind closed issues #94 and #102).

The entrypoint now verifies the copy after passthrough: for every concrete
allowlist entry (explicit tag or @sha256: digest — bare repos and globs are
skipped to avoid false alarms) it runs 'docker image inspect' against the nested
daemon. A missing expected image now produces a loud, actionable warning (host
socket reachable but lacks the ref / filtered by mode, or no usable socket
mounted) and the terminal line becomes 'image preload/passthrough finished WITH
WARNINGS' instead of the misleading 'complete'. No silent no-op path can report
success anymore. We do not auto-pull as a fallback — that would mask the config
error and incur the same multi-GB download; the re-pull still happens naturally.

Adds ref_is_concrete() + verify_passthrough_images() helpers, new unit cases in
experiments/preload-unit-test.sh (62 passing), verify_ok/verify_miss assertions
in the CI-run tests/dind/example-preload-images.sh, and docs in docs/dind/USAGE.md
and README.md.
@konard konard changed the title [WIP] box-dind: nested daemon still re-downloads host images (~30 GB) despite passthrough — continuation of #94 / #102 box-dind: verify host-image passthrough seeded the nested daemon and stop falsely reporting 'complete' (issue #106) Jun 19, 2026
@konard konard marked this pull request as ready for review June 19, 2026 10:44
@konard

konard commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

Working session summary

Done. Issue #106 is resolved on branch issue-106-81bf2d7bb66f and PR #107 is updated and marked ready for review.

Summary

Root cause: box-dind's entrypoint printed image preload/passthrough complete unconditionally — even when the passthrough seeded nothing (missing socket mount, host lacking the exact ref, or the mode=public filter dropping a private/locally-built image). The deployment looked healthy until the first nested docker run triggered a multi-GB re-pull (~30 GB, ~1 hour).

Fix (ubuntu/24.04/dind/dind-entrypoint.sh):

  • verify_passthrough_images() — after all preload paths run, docker image inspect confirms every concrete DIND_HOST_PASSTHROUGH_IMAGES entry actually landed in the nested daemon. Missing images get a loud, actionable warning distinguishing the failure modes (socket reachable but ref absent/filtered vs. no socket mounted).
  • ref_is_concrete() — only verifies entries with an explicit tag/digest; skips bare repos and globs (and correctly treats host:5000/repo ports as non-concrete) to avoid false alarms.
  • The terminal line is now honest: complete only when everything expected is present, otherwise finished WITH WARNINGS. No auto-pull fallback (that would mask the config error and incur the same download).

Verification:

  • experiments/preload-unit-test.sh62 passed, 0 failed (new present/absent/glob/classification cases).
  • tests/dind/example-preload-images.sh — new verify_ok/verify_miss CI assertions.

Docs: docs/dind/USAGE.md + README.md updated; .changeset/issue-106-passthrough-verify.md (bump: patch) added (CI bumps VERSION on merge).

PR: #107


This summary was automatically extracted from the AI working session output.

@konard

konard commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

  • Public pricing estimate: $4.530429
  • Calculated by Anthropic: $5.270439
  • Difference: $0.740010 (+16.33%)

📊 Context and tokens usage:

Claude Opus 4.8: (2 sub-sessions)

  1. 116.3K / 1M (12%) input tokens, 35.6K / 128K (28%) output tokens
  2. 38.0K / 1M (4%) input tokens, 6.0K / 128K (5%) output tokens

Total: (12.2K new + 197.3K cache writes + 4.1M cache reads) input tokens, 47.9K output tokens, $4.530429 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.8 (claude-opus-4-8)

📎 Log file uploaded as Gist (2034KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard

konard commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

🔄 Auto-restart triggered (iteration 1)

Reason: CI failures detected

Starting new session to address the issues.


Auto-restart-until-mergeable mode is active. This run will stop after 5 restart iterations.

…ess (fix flaky dind-js)

The CI dind-js example suite failed in example-storage-driver-vfs.sh:
'expected the vfs warning to mention "'vfs' storage driver", but it was
absent from the container logs'. The vfs copy-on-write warning (issue #104)
is emitted by the entrypoint (PID 1) *after* the inner dockerd becomes ready,
in warn_if_vfs_storage_driver. The test checked the logs once, the instant
wait_for_inner_docker returned — and the test's external 'docker exec docker
info' readiness probe can win the race against the entrypoint's own readiness
loop, grabbing the logs before the warning is flushed. The dumped logs at the
failure showed only '[dind-entrypoint] Starting dockerd (storage-driver=vfs)',
confirming the warning simply had not been printed yet.

Add a bounded wait_for_logs CONTAINER NEEDLE [LIMIT] helper to lib.sh (poll
logs_contain until the line appears or the timeout elapses) and use it for the
three vfs-warning needles, mirroring the existing wait_for_preload_complete
pattern in example-preload-images.sh. No production code change.
@konard

konard commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

Working session summary

I'll wait for the background monitors to notify me of completion. The fix is in place and verified for the originally-failing dind-js job; the remaining long-running jobs just need time to finish.


This summary was automatically extracted from the AI working session output.

@konard

konard commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

🔄 Auto-restart-until-mergeable Log (iteration 1)

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

  • Public pricing estimate: $2.003767
  • Calculated by Anthropic: $7.540569
  • Difference: $5.536802 (+276.32%)

📊 Context and tokens usage:

  • 83.7K / 1M (8%) input tokens, 15.4K / 128K (12%) output tokens

Total: (3.8K new + 71.0K cache writes + 2.3M cache reads) input tokens, 15.4K output tokens, $2.003767 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.8 (claude-opus-4-8)

📎 Log file uploaded as Gist (4384KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard

konard commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

✅ Ready to merge

This pull request is now ready to be merged:

  • All CI checks have passed
  • No merge conflicts
  • No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

@konard konard merged commit 765ea8a into main Jun 19, 2026
23 checks passed
konard added a commit that referenced this pull request Jun 19, 2026
…-essential files (issue #108)

Commit eaeed07 (in PR #107) reverted a single task-details file yet the
pull_request build ran the full image build/test matrix — a 32m48s dind-full
job and a 32m21s full job included — for a change that cannot affect any
Docker image (run 27826090748). The root cause: the detect-changes job in
release.yml diffed the WHOLE pull request against its base SHA
(`git diff --name-only "$BASE_SHA" HEAD`, BASE_SHA =
github.event.pull_request.base.sha), so a trivial synchronize commit re-ran
everything whenever any EARLIER commit on the same PR had touched image source.

Fix: extract the four inline detect-changes steps into a single, unit-tested
scripts/ci/detect-changes.sh that, for pull_request events, diffs ONLY the PR
head's latest commit (HEAD^2^..HEAD^2 against GitHub's synthetic merge commit)
instead of the whole-PR diff. A trailing .gitkeep / docs / Markdown / changeset
commit therefore skips the ~33-minute matrix — earlier commits were already
tested when pushed. push events still evaluate the full pushed range
(github.event.before..HEAD) so release builds are never skipped. Checkout now
uses fetch-depth: 0 so the merge commit's parents resolve. docs/dind/ no longer
triggers the dind matrix (docs are not a build input). This mirrors the
per-commit detect-code-changes scripts shipped by the link-foundation
js/rust/python/csharp pipeline templates — they were already correct, so
nothing to report upstream.

Job-level `if: needs.detect-changes.outputs.should-build == 'true'` gating is
unchanged, so skipped jobs report success and the PR stays mergeable (a
whole-workflow `on: paths` skip would have left required checks Pending).

Covered by experiments/test-issue108-detect-changes.sh (26 assertions: pure
classification truth table, real synthetic-merge-commit git ranges, and a
release.yml wiring regression guard). Full analysis, timeline, requirement
matrix, root causes and template comparison in docs/case-studies/issue-108/.

Closes #108
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant