Add sim-monitor health investigation writeup (NFS + DNS root causes)#1710
Open
jcschaff wants to merge 1 commit into
Open
Add sim-monitor health investigation writeup (NFS + DNS root causes)#1710jcschaff wants to merge 1 commit into
jcschaff wants to merge 1 commit into
Conversation
Documents the multi-day investigation behind the "VCell Release Health - Sim" (/api/v0/health?check=sim) Critical incidents: - The canary's per-job overhead and the cold-NFS SIF-read fault on the 155.37.250.x path (since fixed by HPC; nodes now on 155.37.241.x). - Two collateral maintenance-window failure modes (missing vcell12 bind-mount; sbatch submission timeouts under a submission burst). - The CONFIRMED root cause of the recurring submission-timeout incidents: the classic Kubernetes 5s DNS timeout in the submit pod (A/AAAA conntrack race to the single kube-dns ClusterIP, amplified by ndots:5), which stacks past VCell's 10s sbatch submit timeout. Fixed in vcell-fluxcd (dnsConfig: single-request-reopen + ndots:2 on submit/sched). Includes the evidence chain, the throughput/latency probe results, and the ranked fixes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
docs/sim-monitor-warning-2026-06-01.md, the writeup of the multi-day investigation behind the recurring "VCell Release Health - Sim" Critical incidents (GET /api/v0/health?check=sim).It captures three distinct, independently-diagnosed issues so the next person doesn't have to re-derive them:
155.37.250.xpath (since fixed by HPC; nodes now read/share/apps/vcell3from155.37.241.xat ~950 MB/s). Includes the per-node and two-filer throughput probes./share/apps/vcell12/usersbind-mount (instant exit-255 container-creation failure) andsbatchsubmission timeouts under a submission burst.submitpod (A/AAAA conntrack race to the single kube-dns ClusterIP, amplified byndots:5), which stacks past VCell's 10 ssbatchsubmit timeout →JOB_FAILED. Measured directly: 13/80 rawgetentlookups at ~5 s.The DNS fix ships separately in vcell-fluxcd#26 (
dnsConfig: single-request-reopen + ndots:2onsubmit/sched). This PR is the documentation/evidence record.Notes
🤖 Generated with Claude Code