Skip to content

Add sim-monitor health investigation writeup (NFS + DNS root causes)#1710

Open
jcschaff wants to merge 1 commit into
masterfrom
docs-sim-monitor-dns-investigation
Open

Add sim-monitor health investigation writeup (NFS + DNS root causes)#1710
jcschaff wants to merge 1 commit into
masterfrom
docs-sim-monitor-dns-investigation

Conversation

@jcschaff

Copy link
Copy Markdown
Member

Summary

Adds docs/sim-monitor-warning-2026-06-01.md, the writeup of the multi-day investigation behind the recurring "VCell Release Health - Sim" Critical incidents (GET /api/v0/health?check=sim).

It captures three distinct, independently-diagnosed issues so the next person doesn't have to re-derive them:

  1. Canary overhead + cold-NFS SIF reads on the 155.37.250.x path (since fixed by HPC; nodes now read /share/apps/vcell3 from 155.37.241.x at ~950 MB/s). Includes the per-node and two-filer throughput probes.
  2. Two maintenance-window collateral failures — a missing /share/apps/vcell12/users bind-mount (instant exit-255 container-creation failure) and sbatch submission timeouts under a submission burst.
  3. The confirmed root cause of the recurring submission-timeout incidents: the classic Kubernetes 5-second DNS timeout in the submit pod (A/AAAA conntrack race to the single kube-dns ClusterIP, amplified by ndots:5), which stacks past VCell's 10 s sbatch submit timeout → JOB_FAILED. Measured directly: 13/80 raw getent lookups at ~5 s.

The DNS fix ships separately in vcell-fluxcd#26 (dnsConfig: single-request-reopen + ndots:2 on submit/sched). This PR is the documentation/evidence record.

Notes

  • Docs-only; no code changes.
  • The dated filename reflects when the investigation started (2026-06-01); content runs through the DNS confirmation on 06-10.

🤖 Generated with Claude Code

Documents the multi-day investigation behind the "VCell Release Health - Sim"
(/api/v0/health?check=sim) Critical incidents:

- The canary's per-job overhead and the cold-NFS SIF-read fault on the
  155.37.250.x path (since fixed by HPC; nodes now on 155.37.241.x).
- Two collateral maintenance-window failure modes (missing vcell12 bind-mount;
  sbatch submission timeouts under a submission burst).
- The CONFIRMED root cause of the recurring submission-timeout incidents: the
  classic Kubernetes 5s DNS timeout in the submit pod (A/AAAA conntrack race to
  the single kube-dns ClusterIP, amplified by ndots:5), which stacks past
  VCell's 10s sbatch submit timeout. Fixed in vcell-fluxcd (dnsConfig:
  single-request-reopen + ndots:2 on submit/sched).

Includes the evidence chain, the throughput/latency probe results, and the
ranked fixes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant