Skip to content

fix(stats): degrade gracefully instead of rendering 0/0 on transient API failures#139

Open
moshemalawach wants to merge 1 commit into
mainfrom
fix/zero-render-resilience
Open

fix(stats): degrade gracefully instead of rendering 0/0 on transient API failures#139
moshemalawach wants to merge 1 commit into
mainfrom
fix/zero-render-resilience

Conversation

@moshemalawach

Copy link
Copy Markdown
Member

Summary

Fixes the Overview dashboard intermittently rendering 0 nodes / 0 VMs (marketing backlog P0-03).

Root cause: getOverviewStats() was an all-or-nothing Promise.all over /api/v1/stats plus every page of /api/v1/vms (~52 requests) and /api/v1/nodes. One transient failure rejected the whole query, and StatCard coerced the missing data into a literal "0 / 0".

Changes

  • src/api/client.ts — headline totals now come from the single cheap /api/v1/stats call; the VM/node page fan-outs get .catch fallbacks so derived breakdowns degrade to empty instead of rejecting. totalVMs keeps the 7d retention count (Decision chore(wave): close 2026-05-12-scheduler-v1 wave #110) when the fan-out succeeds, falling back to stats.total_vms when it fails.
  • src/components/stats-bar.tsx — explicit error state: em-dash + a subtle "Data unavailable" subtitle when there is no value, instead of coercing undefined to 0.
  • src/hooks/use-overview-stats.tsplaceholderData: keepPreviousData so last-good values persist through transient refetch failures.
  • scripts/smoke-stats.mjs — smoke test exercising the client's exact data path against the live scheduler API (nonzero totals, node fan-out within 5% of /stats). The deployed site is a client-rendered static export, so verifying the rendered DOM would need headless Chrome — deliberately out of scope (no new deps); limitation documented in the script header.

Verification

  • pnpm exec tsc --noEmit clean
  • pnpm lint 0 warnings / 0 errors
  • pnpm check:tokens clean
  • pnpm test 332 passed, 13 skipped
  • node scripts/smoke-stats.mjs against https://rust-scheduler.aleph.im: all checks pass (489 nodes, 7837 VMs in the 7d window)

…API failures

getOverviewStats was an all-or-nothing Promise.all over /stats plus every
page of /vms (~52 requests) and /nodes; one transient failure rejected the
whole query, and StatCard coerced the missing data to a literal "0 / 0".

- client.ts: headline totals come from the single cheap /api/v1/stats call;
  the VM/node page fan-outs get catch fallbacks so derived breakdowns
  degrade to empty instead of rejecting. Total VMs keeps the 7d retention
  count (Decision #110) when the fan-out succeeds and falls back to
  stats.total_vms when it fails.
- stats-bar.tsx: explicit error state — em-dash + "Data unavailable"
  instead of coercing undefined to 0.
- use-overview-stats.ts: placeholderData: keepPreviousData so last-good
  values persist through transient refetch failures.
- scripts/smoke-stats.mjs: smoke test exercising the client's exact data
  path against the live scheduler API (nonzero totals, fan-out within 5%
  of /stats; headless-browser DOM check deliberately out of scope).

Refs marketing backlog P0-03.

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured fix that correctly decouples headline totals from derived breakdowns. The .catch(() => null) on fan-outs and keepPreviousData work together to handle both initial-load and refetch failure scenarios gracefully. Minor edge: if /stats succeeds but both fan-outs fail on initial load, dispatchedVMs degrades to 0 while totalVMs shows the real all-time count — but this self-corrects on the next poll cycle.

src/api/client.ts (line 264): If /api/v1/stats itself fails, the whole Promise.all rejects and keepPreviousData kicks in — correct behavior. But if /stats succeeds and both fan-outs fail on initial load, dispatchedVMs/missingVMs/unschedulableVMs all degrade to 0 while totalNodes/healthyNodes/totalVMs are real. Transient, but worth noting.

src/components/stats-bar.tsx (line 101): The unavailable guard is correctly ordered: must not be loading, must be error state, and value must be undefined (not just 0 or falsy, which distinguishes a legitimate 0 from genuinely missing data). This works well with keepPreviousData — after a successful fetch, value is defined even on the current error, so unavailable stays false.

src/components/stats-bar.tsx (line 149): Nit: when unavailable is true, the subtitle changes to 'Data unavailable' but the tooltip still shows the original subtitle. Not a real problem since unavailable implies no data to describe, but the tooltip will say something inaccurate. Would be cleaner to suppress the tooltip or show 'Data unavailable' in both places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants