fix(stats): degrade gracefully instead of rendering 0/0 on transient API failures#139
fix(stats): degrade gracefully instead of rendering 0/0 on transient API failures#139moshemalawach wants to merge 1 commit into
Conversation
…API failures getOverviewStats was an all-or-nothing Promise.all over /stats plus every page of /vms (~52 requests) and /nodes; one transient failure rejected the whole query, and StatCard coerced the missing data to a literal "0 / 0". - client.ts: headline totals come from the single cheap /api/v1/stats call; the VM/node page fan-outs get catch fallbacks so derived breakdowns degrade to empty instead of rejecting. Total VMs keeps the 7d retention count (Decision #110) when the fan-out succeeds and falls back to stats.total_vms when it fails. - stats-bar.tsx: explicit error state — em-dash + "Data unavailable" instead of coercing undefined to 0. - use-overview-stats.ts: placeholderData: keepPreviousData so last-good values persist through transient refetch failures. - scripts/smoke-stats.mjs: smoke test exercising the client's exact data path against the live scheduler API (nonzero totals, fan-out within 5% of /stats; headless-browser DOM check deliberately out of scope). Refs marketing backlog P0-03.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Well-structured fix that correctly decouples headline totals from derived breakdowns. The .catch(() => null) on fan-outs and keepPreviousData work together to handle both initial-load and refetch failure scenarios gracefully. Minor edge: if /stats succeeds but both fan-outs fail on initial load, dispatchedVMs degrades to 0 while totalVMs shows the real all-time count — but this self-corrects on the next poll cycle.
src/api/client.ts (line 264): If /api/v1/stats itself fails, the whole Promise.all rejects and keepPreviousData kicks in — correct behavior. But if /stats succeeds and both fan-outs fail on initial load, dispatchedVMs/missingVMs/unschedulableVMs all degrade to 0 while totalNodes/healthyNodes/totalVMs are real. Transient, but worth noting.
src/components/stats-bar.tsx (line 101): The unavailable guard is correctly ordered: must not be loading, must be error state, and value must be undefined (not just 0 or falsy, which distinguishes a legitimate 0 from genuinely missing data). This works well with keepPreviousData — after a successful fetch, value is defined even on the current error, so unavailable stays false.
src/components/stats-bar.tsx (line 149): Nit: when unavailable is true, the subtitle changes to 'Data unavailable' but the tooltip still shows the original subtitle. Not a real problem since unavailable implies no data to describe, but the tooltip will say something inaccurate. Would be cleaner to suppress the tooltip or show 'Data unavailable' in both places.
Summary
Fixes the Overview dashboard intermittently rendering 0 nodes / 0 VMs (marketing backlog P0-03).
Root cause:
getOverviewStats()was an all-or-nothingPromise.allover/api/v1/statsplus every page of/api/v1/vms(~52 requests) and/api/v1/nodes. One transient failure rejected the whole query, andStatCardcoerced the missing data into a literal "0 / 0".Changes
/api/v1/statscall; the VM/node page fan-outs get.catchfallbacks so derived breakdowns degrade to empty instead of rejecting.totalVMskeeps the 7d retention count (Decision chore(wave): close 2026-05-12-scheduler-v1 wave #110) when the fan-out succeeds, falling back tostats.total_vmswhen it fails.undefinedto 0.placeholderData: keepPreviousDataso last-good values persist through transient refetch failures./stats). The deployed site is a client-rendered static export, so verifying the rendered DOM would need headless Chrome — deliberately out of scope (no new deps); limitation documented in the script header.Verification
pnpm exec tsc --noEmitcleanpnpm lint0 warnings / 0 errorspnpm check:tokenscleanpnpm test332 passed, 13 skippednode scripts/smoke-stats.mjsagainst https://rust-scheduler.aleph.im: all checks pass (489 nodes, 7837 VMs in the 7d window)