Add sim-monitor health investigation writeup (NFS + DNS root causes) by jcschaff · Pull Request #1710 · virtualcell/vcell

jcschaff · 2026-06-11T14:38:10Z

Summary

Adds docs/sim-monitor-warning-2026-06-01.md, the writeup of the multi-day investigation behind the recurring "VCell Release Health - Sim" Critical incidents (GET /api/v0/health?check=sim).

It captures three distinct, independently-diagnosed issues so the next person doesn't have to re-derive them:

Canary overhead + cold-NFS SIF reads on the 155.37.250.x path (since fixed by HPC; nodes now read /share/apps/vcell3 from 155.37.241.x at ~950 MB/s). Includes the per-node and two-filer throughput probes.
Two maintenance-window collateral failures — a missing /share/apps/vcell12/users bind-mount (instant exit-255 container-creation failure) and sbatch submission timeouts under a submission burst.
The confirmed root cause of the recurring submission-timeout incidents: the classic Kubernetes 5-second DNS timeout in the submit pod (A/AAAA conntrack race to the single kube-dns ClusterIP, amplified by ndots:5), which stacks past VCell's 10 s sbatch submit timeout → JOB_FAILED. Measured directly: 13/80 raw getent lookups at ~5 s.

The DNS fix ships separately in vcell-fluxcd#26 (dnsConfig: single-request-reopen + ndots:2 on submit/sched). This PR is the documentation/evidence record.

Notes

Docs-only; no code changes.
The dated filename reflects when the investigation started (2026-06-01); content runs through the DNS confirmation on 06-10.

🤖 Generated with Claude Code

Documents the multi-day investigation behind the "VCell Release Health - Sim" (/api/v0/health?check=sim) Critical incidents: - The canary's per-job overhead and the cold-NFS SIF-read fault on the 155.37.250.x path (since fixed by HPC; nodes now on 155.37.241.x). - Two collateral maintenance-window failure modes (missing vcell12 bind-mount; sbatch submission timeouts under a submission burst). - The CONFIRMED root cause of the recurring submission-timeout incidents: the classic Kubernetes 5s DNS timeout in the submit pod (A/AAAA conntrack race to the single kube-dns ClusterIP, amplified by ndots:5), which stacks past VCell's 10s sbatch submit timeout. Fixed in vcell-fluxcd (dnsConfig: single-request-reopen + ndots:2 on submit/sched). Includes the evidence chain, the throughput/latency probe results, and the ranked fixes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sim-monitor health investigation writeup (NFS + DNS root causes)#1710

Add sim-monitor health investigation writeup (NFS + DNS root causes)#1710
jcschaff wants to merge 1 commit into
masterfrom
docs-sim-monitor-dns-investigation

jcschaff commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jcschaff commented Jun 11, 2026

Summary

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant