[consensus] Fix proposer_delay_proposal histogram buckets by danielxiangzl · Pull Request #19474 · aptos-labs/aptos-core

danielxiangzl · 2026-04-16T21:32:59Z

Summary

PROPOSER_DELAY_PROPOSAL was registered via register_avg_counter() (single 0.5s bucket), so histogram_quantile() on aptos_proposer_delay_proposal returned garbage (linear interpolation always yielding ~0.45 when values are in [0, 0.5]).
Switch to a proper register_histogram! with granular seconds buckets (0.001 .. 5.0).
Rename the Prometheus metric to aptos_proposer_delay_proposal_seconds (per review feedback) to avoid bucket-boundary mixing during rolling deploys, and to follow the Prometheus _seconds convention for duration histograms.

Test plan

cargo check -p aptos-consensus passes
cargo xclippy passes
rustfmt --check passes on consensus/src/counters.rs
./scripts/rust_lint.sh (pre-commit hook) passes
Post-deploy: verify aptos_proposer_delay_proposal_seconds_bucket emits in Prometheus and histogram_quantile(0.9, rate(aptos_proposer_delay_proposal_seconds_bucket[5m])) returns sensible values.
Post-deploy: verify rate(aptos_proposer_delay_proposal_seconds_sum[5m]) / rate(aptos_proposer_delay_proposal_seconds_count[5m]) computes the average.

Dashboard updates

dashboards/end-to-end-txn-latency.json — updated the "Proposer delay (due to backpressure)" panel to reference the renamed metric. Note: the original query (aptos_proposer_delay_proposal{...} > 0) referenced the bare histogram name, which Prometheus never emits for histograms; that panel was already broken before this PR and still needs a proper histogram_quantile(...) query rewrite as follow-up.
Any out-of-repo Grafana/alert references to aptos_proposer_delay_proposal will need to be updated to aptos_proposer_delay_proposal_seconds.

Note: PIPELINE_BACKPRESSURE_ON_PROPOSAL_TRIGGERED and EXECUTION_BACKPRESSURE_ON_PROPOSAL_TRIGGERED are intentionally left as register_avg_counter - they observe 0.0/1.0 flags, not durations.

Generated with Claude Code

Note

Medium Risk
Renames and redefines a Prometheus histogram, which can break dashboards/alerts and impact observability during rollout until queries are updated.

Overview
Fixes proposer backpressure delay instrumentation by switching PROPOSER_DELAY_PROPOSAL from an avg-counter-style metric to a real Prometheus duration histogram with granular second buckets.

Renames the emitted metric to aptos_proposer_delay_proposal_seconds and updates the end-to-end-txn-latency Grafana dashboard panel to query the new name.

^{Reviewed by Cursor Bugbot for commit 78e531a. Bugbot is set up for automated code reviews on this repo. Configure here.}

PROPOSER_DELAY_PROPOSAL records a varying delay duration in seconds, but was registered via register_avg_counter() which creates a histogram with only a single bucket at 0.5. This means histogram_quantile() queries on this metric return garbage (linear interpolation inside [0, 0.5] -> always returns 0.45 when values are within range), misleading latency analysis. Change to a proper histogram with granular time buckets so P50/P90/P99 queries work correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Differential Security Review — PR #19474: Fix proposer_delay_proposal histogram buckets

Date: 2026-04-16
Scope: db26bbc35cd5ad7640bcf0cbdd542cd1f348293f..2c62345322035e80739326444e15681bcd1e3a5f

Executive Summary

Severity	Count
CRITICAL	0
HIGH	0
MEDIUM	0
LOW	0

Overall risk: Low. This is a pure metrics registration change; no consensus logic, trust boundaries, or state handling are touched.

Key metrics: 1 file changed, 1 Rust file, 1 module touched (consensus/src/counters.rs), 0 findings.

Recommendation: APPROVE WITH NOTES

What Changed

Commits: db26bbc..2c62345 (1 commit)
Files changed: 1 (1 Rust) | Lines: +9 / -2

Module	Files Changed	Risk Level
`consensus/src/counters.rs`	1	Low

The PROPOSER_DELAY_PROPOSAL static was registered via register_avg_counter(), which internally creates a histogram with a single [0, 0.5] bucket. This made histogram_quantile() useless — with only one finite bucket, Prometheus linear-interpolates within [0, 0.5], always returning ≈ quantile × 0.5 regardless of actual delays. The fix switches to register_histogram! with 11 buckets spanning 1ms–5s.

Findings

No security-relevant findings identified.

Pattern Scan Results

Pattern	Result
Removed `ensure!`/`assert!`/bounds checks	None
New `unsafe` blocks	None
Visibility escalation (`pub(crate)` → `pub`)	None
New `.unwrap()` / `.expect()` on untrusted data	Not applicable — `register_histogram!` panics only on duplicate metric name registration (a programmer error at init), not on runtime untrusted input; consistent with `#![allow(clippy::unwrap_used)]` and all other `register_histogram!` calls in the file
Non-determinism (`HashMap`, `SystemTime`, float in execution)	None
Signer / auth removal	None
Gas / cost model changes	None

Semantic Analysis

Call site (proposal_generator.rs:615): PROPOSER_DELAY_PROPOSAL.observe(proposal_delay.as_secs_f64()) — unchanged.

proposal_delay value range: computed as the max of all active backpressure/backoff levels:

backpressure_proposal_delay_ms: configurable, default max 300 ms
backoff_proposal_delay_ms: configurable, default max 300 ms
Zero when no backpressure is active (also observed, correctly landing in the le="0.001" bucket)

All default config values (50 ms – 300 ms) fall cleanly between the 0.25 and 0.5 buckets, with substantial resolution. The 5.0 s upper bound is 16× the current default maximum; any future custom configuration setting delays above 5 s would accumulate in +Inf (expected histogram behavior, not a correctness bug).

PIPELINE_BACKPRESSURE_ON_PROPOSAL_TRIGGERED and EXECUTION_BACKPRESSURE_ON_PROPOSAL_TRIGGERED remain as register_avg_counter — they observe 0.0/1.0 boolean flags, where a single 0.5 s bucket is appropriate for computing an activation rate via sum/count. This deliberate asymmetry is correctly documented in the PR summary.

Blast Radius

PROPOSER_DELAY_PROPOSAL has exactly 1 non-test observe site (proposal_generator.rs:615). No callers are affected — this is a metrics sink, not a data source for any decision logic.

Dashboard note (non-blocking, pre-existing): The panel "Proposer delay (due to backpressure)" in dashboards/end-to-end-txn-latency.json (line 708) queries the bare metric name aptos_proposer_delay_proposal{...} > 0. Prometheus histograms expose _bucket, _sum, and _count suffixes — the bare name returns no series in standard PromQL. This was already the case with register_avg_counter and is unchanged by this PR. The PR's own test plan notes the correct post-deploy verification query (aptos_proposer_delay_proposal_bucket). Updating the dashboard to use histogram_quantile(0.9, rate(aptos_proposer_delay_proposal_bucket[5m])) would complete the intent of this fix.

Test Coverage

Changed Symbol	Coverage	Notes
`PROPOSER_DELAY_PROPOSAL` registration	No unit test (consistent with all other counters in this file)	Metric registration correctness is validated at startup; no runtime security path

No coverage gap with security implications.

Historical Context

git log --all --oneline -- consensus/src/counters.rs | grep -iE 'fix|security|cve' — no prior security-fix commits touching this specific metric. No regression risk.

👀

_{Sent by Cursor Automation: Security Review Bot}

ibalajiarun · 2026-04-17T04:56:22Z

+/// instead of linear-interpolation artifacts.
 pub static PROPOSER_DELAY_PROPOSAL: Lazy<Histogram> = Lazy::new(|| {
-    register_avg_counter(
+    register_histogram!(


I would suggest creating a new metric altogether otherwise rollout will be a mess.

Avoid bucket-boundary mixing during rolling deploy by using a new metric name rather than repurposing the old one. Also follows the Prometheus `_seconds` convention for duration histograms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 48efa5f. Configure here.}

cursor · 2026-04-20T20:02:43Z

-    register_avg_counter(
-        "aptos_proposer_delay_proposal",
+    register_histogram!(
+        "aptos_proposer_delay_proposal_seconds",


Metric name changed despite PR claiming it's unchanged

High Severity

The metric name changed from "aptos_proposer_delay_proposal" to "aptos_proposer_delay_proposal_seconds", contradicting the PR description's claim that "Metric name is unchanged." This is a breaking change: any existing external dashboards, alerts, or Prometheus queries referencing aptos_proposer_delay_proposal (including _sum, _count, _bucket variants) will silently stop receiving data. During rolling deployments, nodes will emit different metric names, fragmenting aggregation queries.

Additional Locations (1)

dashboards/end-to-end-txn-latency.json#L707-L708

^{Reviewed by Cursor Bugbot for commit 48efa5f. Configure here.}

github-actions · 2026-04-20T21:27:38Z

✅ Forge suite `compat` success on `ca049383dd80675149ef2d0042668964f9f9107a` ==> `78e531a55613471a9a92166949b5a49c33c760cc`

Compatibility test results for ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc (PR)
1. Check liveness of validators at old version: ca049383dd80675149ef2d0042668964f9f9107a
compatibility::simple-validator-upgrade::liveness-check : committed: 14298.66 txn/s, latency: 2425.49 ms, (p50: 2400 ms, p70: 2800, p90: 3100 ms, p99: 3600 ms), latency samples: 468360
2. Upgrading first Validator to new version: 78e531a55613471a9a92166949b5a49c33c760cc
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6529.84 txn/s, latency: 5174.22 ms, (p50: 5700 ms, p70: 5800, p90: 5900 ms, p99: 6200 ms), latency samples: 223880
3. Upgrading rest of first batch to new version: 78e531a55613471a9a92166949b5a49c33c760cc
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6277.29 txn/s, latency: 5430.85 ms, (p50: 5900 ms, p70: 6000, p90: 6100 ms, p99: 6200 ms), latency samples: 218960
4. upgrading second batch to new version: 78e531a55613471a9a92166949b5a49c33c760cc
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10934.42 txn/s, latency: 2936.45 ms, (p50: 3100 ms, p70: 3200, p90: 3400 ms, p99: 3700 ms), latency samples: 361140
5. check swarm health
Compatibility test for ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc passed
Test Ok

github-actions · 2026-04-20T21:31:01Z

✅ Forge suite `realistic_env_max_load` success on `78e531a55613471a9a92166949b5a49c33c760cc`

two traffics test: inner traffic : committed: 13522.67 txn/s, latency: 1345.98 ms, (p50: 1500 ms, p70: 1500, p90: 1800 ms, p99: 2100 ms), latency samples: 5049660
two traffics test : committed: 100.02 txn/s, latency: 617.42 ms, (p50: 500 ms, p70: 700, p90: 800 ms, p99: 1000 ms), latency samples: 1720
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 0.693, avg: 0.667", "ConsensusProposalToOrdered: max: 0.119, avg: 0.109", "ConsensusOrderedToCommit: max: 0.174, avg: 0.161", "ConsensusProposalToCommit: max: 0.280, avg: 0.270"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.77s no progress at version 53022 (avg 0.06s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.29s no progress at version 2481674 (avg 0.29s) [limit 16].
Test Ok

github-actions · 2026-04-20T21:36:11Z

✅ Forge suite `framework_upgrade` success on `ca049383dd80675149ef2d0042668964f9f9107a` ==> `78e531a55613471a9a92166949b5a49c33c760cc`

Compatibility test results for ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc (PR)
Upgrade the nodes to version: 78e531a55613471a9a92166949b5a49c33c760cc
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2521.14 txn/s, submitted: 2528.94 txn/s, failed submission: 7.80 txn/s, expired: 7.80 txn/s, latency: 1172.40 ms, (p50: 1200 ms, p70: 1200, p90: 1500 ms, p99: 2100 ms), latency samples: 226380
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2444.45 txn/s, submitted: 2451.79 txn/s, failed submission: 7.33 txn/s, expired: 7.33 txn/s, latency: 1201.69 ms, (p50: 1200 ms, p70: 1200, p90: 1500 ms, p99: 2300 ms), latency samples: 219960
5. check swarm health
Compatibility test for ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc passed
Upgrade the remaining nodes to version: 78e531a55613471a9a92166949b5a49c33c760cc
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2417.43 txn/s, submitted: 2425.51 txn/s, failed submission: 8.08 txn/s, expired: 8.08 txn/s, latency: 1342.43 ms, (p50: 1200 ms, p70: 1300, p90: 1800 ms, p99: 10800 ms), latency samples: 215461
Test Ok

danielxiangzl marked this pull request as ready for review April 16, 2026 21:35

danielxiangzl requested review from ibalajiarun, sasha8 and zekun000 as code owners April 16, 2026 21:35

zekun000 approved these changes Apr 16, 2026

View reviewed changes

cursor Bot reviewed Apr 16, 2026

View reviewed changes

danielxiangzl requested a review from grao1991 April 16, 2026 21:46

ibalajiarun reviewed Apr 17, 2026

View reviewed changes

danielxiangzl requested a review from a team as a code owner April 20, 2026 19:59

cursor Bot reviewed Apr 20, 2026

View reviewed changes

ibalajiarun approved these changes Apr 20, 2026

View reviewed changes

Merge branch 'main' into daniel/fix-proposer-delay-histogram

78e531a

danielxiangzl enabled auto-merge (squash) April 20, 2026 20:53

This comment has been minimized.

Sign in to view

danielxiangzl merged commit dd42df2 into main Apr 20, 2026
103 checks passed

danielxiangzl deleted the daniel/fix-proposer-delay-histogram branch April 20, 2026 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[consensus] Fix proposer_delay_proposal histogram buckets#19474

[consensus] Fix proposer_delay_proposal histogram buckets#19474
danielxiangzl merged 3 commits intomainfrom
daniel/fix-proposer-delay-histogram

danielxiangzl commented Apr 16, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

ibalajiarun Apr 17, 2026

Uh oh!

danielxiangzl Apr 20, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 20, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions Bot commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielxiangzl commented Apr 16, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Dashboard updates

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Differential Security Review — PR #19474: Fix proposer_delay_proposal histogram buckets

Executive Summary

What Changed

Findings

Pattern Scan Results

Semantic Analysis

Blast Radius

Test Coverage

Historical Context

Uh oh!

ibalajiarun Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

danielxiangzl Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 20, 2026

Choose a reason for hiding this comment

Metric name changed despite PR claiming it's unchanged

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions Bot commented Apr 20, 2026

✅ Forge suite compat success on ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc

Uh oh!

github-actions Bot commented Apr 20, 2026

✅ Forge suite realistic_env_max_load success on 78e531a55613471a9a92166949b5a49c33c760cc

Uh oh!

github-actions Bot commented Apr 20, 2026

✅ Forge suite framework_upgrade success on ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielxiangzl commented Apr 16, 2026 •

edited by cursor Bot

Loading

✅ Forge suite `compat` success on `ca049383dd80675149ef2d0042668964f9f9107a` ==> `78e531a55613471a9a92166949b5a49c33c760cc`

✅ Forge suite `realistic_env_max_load` success on `78e531a55613471a9a92166949b5a49c33c760cc`

✅ Forge suite `framework_upgrade` success on `ca049383dd80675149ef2d0042668964f9f9107a` ==> `78e531a55613471a9a92166949b5a49c33c760cc`