[consensus] Fix proposer_delay_proposal histogram buckets#19474
[consensus] Fix proposer_delay_proposal histogram buckets#19474danielxiangzl merged 3 commits intomainfrom
Conversation
PROPOSER_DELAY_PROPOSAL records a varying delay duration in seconds, but was registered via register_avg_counter() which creates a histogram with only a single bucket at 0.5. This means histogram_quantile() queries on this metric return garbage (linear interpolation inside [0, 0.5] -> always returns 0.45 when values are within range), misleading latency analysis. Change to a proper histogram with granular time buckets so P50/P90/P99 queries work correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Differential Security Review — PR #19474: Fix proposer_delay_proposal histogram buckets
Date: 2026-04-16
Scope: db26bbc35cd5ad7640bcf0cbdd542cd1f348293f..2c62345322035e80739326444e15681bcd1e3a5f
Executive Summary
| Severity | Count |
|---|---|
| CRITICAL | 0 |
| HIGH | 0 |
| MEDIUM | 0 |
| LOW | 0 |
Overall risk: Low. This is a pure metrics registration change; no consensus logic, trust boundaries, or state handling are touched.
Key metrics: 1 file changed, 1 Rust file, 1 module touched (consensus/src/counters.rs), 0 findings.
Recommendation: APPROVE WITH NOTES
What Changed
Commits: db26bbc..2c62345 (1 commit)
Files changed: 1 (1 Rust) | Lines: +9 / -2
| Module | Files Changed | Risk Level |
|---|---|---|
consensus/src/counters.rs |
1 | Low |
The PROPOSER_DELAY_PROPOSAL static was registered via register_avg_counter(), which internally creates a histogram with a single [0, 0.5] bucket. This made histogram_quantile() useless — with only one finite bucket, Prometheus linear-interpolates within [0, 0.5], always returning ≈ quantile × 0.5 regardless of actual delays. The fix switches to register_histogram! with 11 buckets spanning 1ms–5s.
Findings
No security-relevant findings identified.
Pattern Scan Results
| Pattern | Result |
|---|---|
Removed ensure!/assert!/bounds checks |
None |
New unsafe blocks |
None |
Visibility escalation (pub(crate) → pub) |
None |
New .unwrap() / .expect() on untrusted data |
Not applicable — register_histogram! panics only on duplicate metric name registration (a programmer error at init), not on runtime untrusted input; consistent with #![allow(clippy::unwrap_used)] and all other register_histogram! calls in the file |
Non-determinism (HashMap, SystemTime, float in execution) |
None |
| Signer / auth removal | None |
| Gas / cost model changes | None |
Semantic Analysis
Call site (proposal_generator.rs:615): PROPOSER_DELAY_PROPOSAL.observe(proposal_delay.as_secs_f64()) — unchanged.
proposal_delay value range: computed as the max of all active backpressure/backoff levels:
backpressure_proposal_delay_ms: configurable, default max 300 msbackoff_proposal_delay_ms: configurable, default max 300 ms- Zero when no backpressure is active (also observed, correctly landing in the
le="0.001"bucket)
All default config values (50 ms – 300 ms) fall cleanly between the 0.25 and 0.5 buckets, with substantial resolution. The 5.0 s upper bound is 16× the current default maximum; any future custom configuration setting delays above 5 s would accumulate in +Inf (expected histogram behavior, not a correctness bug).
PIPELINE_BACKPRESSURE_ON_PROPOSAL_TRIGGERED and EXECUTION_BACKPRESSURE_ON_PROPOSAL_TRIGGERED remain as register_avg_counter — they observe 0.0/1.0 boolean flags, where a single 0.5 s bucket is appropriate for computing an activation rate via sum/count. This deliberate asymmetry is correctly documented in the PR summary.
Blast Radius
PROPOSER_DELAY_PROPOSAL has exactly 1 non-test observe site (proposal_generator.rs:615). No callers are affected — this is a metrics sink, not a data source for any decision logic.
Dashboard note (non-blocking, pre-existing): The panel "Proposer delay (due to backpressure)" in dashboards/end-to-end-txn-latency.json (line 708) queries the bare metric name aptos_proposer_delay_proposal{...} > 0. Prometheus histograms expose _bucket, _sum, and _count suffixes — the bare name returns no series in standard PromQL. This was already the case with register_avg_counter and is unchanged by this PR. The PR's own test plan notes the correct post-deploy verification query (aptos_proposer_delay_proposal_bucket). Updating the dashboard to use histogram_quantile(0.9, rate(aptos_proposer_delay_proposal_bucket[5m])) would complete the intent of this fix.
Test Coverage
| Changed Symbol | Coverage | Notes |
|---|---|---|
PROPOSER_DELAY_PROPOSAL registration |
No unit test (consistent with all other counters in this file) | Metric registration correctness is validated at startup; no runtime security path |
No coverage gap with security implications.
Historical Context
git log --all --oneline -- consensus/src/counters.rs | grep -iE 'fix|security|cve' — no prior security-fix commits touching this specific metric. No regression risk.
👀
Sent by Cursor Automation: Security Review Bot
| /// instead of linear-interpolation artifacts. | ||
| pub static PROPOSER_DELAY_PROPOSAL: Lazy<Histogram> = Lazy::new(|| { | ||
| register_avg_counter( | ||
| register_histogram!( |
There was a problem hiding this comment.
I would suggest creating a new metric altogether otherwise rollout will be a mess.
Avoid bucket-boundary mixing during rolling deploy by using a new metric name rather than repurposing the old one. Also follows the Prometheus `_seconds` convention for duration histograms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 48efa5f. Configure here.
| register_avg_counter( | ||
| "aptos_proposer_delay_proposal", | ||
| register_histogram!( | ||
| "aptos_proposer_delay_proposal_seconds", |
There was a problem hiding this comment.
Metric name changed despite PR claiming it's unchanged
High Severity
The metric name changed from "aptos_proposer_delay_proposal" to "aptos_proposer_delay_proposal_seconds", contradicting the PR description's claim that "Metric name is unchanged." This is a breaking change: any existing external dashboards, alerts, or Prometheus queries referencing aptos_proposer_delay_proposal (including _sum, _count, _bucket variants) will silently stop receiving data. During rolling deployments, nodes will emit different metric names, fragmenting aggregation queries.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 48efa5f. Configure here.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
✅ Forge suite
|




Summary
PROPOSER_DELAY_PROPOSALwas registered viaregister_avg_counter()(single0.5sbucket), sohistogram_quantile()onaptos_proposer_delay_proposalreturned garbage (linear interpolation always yielding ~0.45 when values are in[0, 0.5]).register_histogram!with granular seconds buckets (0.001 .. 5.0).aptos_proposer_delay_proposal_seconds(per review feedback) to avoid bucket-boundary mixing during rolling deploys, and to follow the Prometheus_secondsconvention for duration histograms.Test plan
cargo check -p aptos-consensuspassescargo xclippypasses--checkpasses onconsensus/src/counters.rs./scripts/rust_lint.sh(pre-commit hook) passesaptos_proposer_delay_proposal_seconds_bucketemits in Prometheus andhistogram_quantile(0.9, rate(aptos_proposer_delay_proposal_seconds_bucket[5m]))returns sensible values.rate(aptos_proposer_delay_proposal_seconds_sum[5m]) / rate(aptos_proposer_delay_proposal_seconds_count[5m])computes the average.Dashboard updates
dashboards/end-to-end-txn-latency.json— updated the "Proposer delay (due to backpressure)" panel to reference the renamed metric. Note: the original query (aptos_proposer_delay_proposal{...} > 0) referenced the bare histogram name, which Prometheus never emits for histograms; that panel was already broken before this PR and still needs a properhistogram_quantile(...)query rewrite as follow-up.aptos_proposer_delay_proposalwill need to be updated toaptos_proposer_delay_proposal_seconds.Note:
PIPELINE_BACKPRESSURE_ON_PROPOSAL_TRIGGEREDandEXECUTION_BACKPRESSURE_ON_PROPOSAL_TRIGGEREDare intentionally left asregister_avg_counter- they observe 0.0/1.0 flags, not durations.Generated with Claude Code
Note
Medium Risk
Renames and redefines a Prometheus histogram, which can break dashboards/alerts and impact observability during rollout until queries are updated.
Overview
Fixes proposer backpressure delay instrumentation by switching
PROPOSER_DELAY_PROPOSALfrom an avg-counter-style metric to a real Prometheus duration histogram with granular second buckets.Renames the emitted metric to
aptos_proposer_delay_proposal_secondsand updates theend-to-end-txn-latencyGrafana dashboard panel to query the new name.Reviewed by Cursor Bugbot for commit 78e531a. Bugbot is set up for automated code reviews on this repo. Configure here.