Skip to content

[consensus] Fix proposer_delay_proposal histogram buckets#19474

Merged
danielxiangzl merged 3 commits intomainfrom
daniel/fix-proposer-delay-histogram
Apr 20, 2026
Merged

[consensus] Fix proposer_delay_proposal histogram buckets#19474
danielxiangzl merged 3 commits intomainfrom
daniel/fix-proposer-delay-histogram

Conversation

@danielxiangzl
Copy link
Copy Markdown
Contributor

@danielxiangzl danielxiangzl commented Apr 16, 2026

Summary

  • PROPOSER_DELAY_PROPOSAL was registered via register_avg_counter() (single 0.5s bucket), so histogram_quantile() on aptos_proposer_delay_proposal returned garbage (linear interpolation always yielding ~0.45 when values are in [0, 0.5]).
  • Switch to a proper register_histogram! with granular seconds buckets (0.001 .. 5.0).
  • Rename the Prometheus metric to aptos_proposer_delay_proposal_seconds (per review feedback) to avoid bucket-boundary mixing during rolling deploys, and to follow the Prometheus _seconds convention for duration histograms.

Test plan

  • cargo check -p aptos-consensus passes
  • cargo xclippy passes
  • rustfmt --check passes on consensus/src/counters.rs
  • ./scripts/rust_lint.sh (pre-commit hook) passes
  • Post-deploy: verify aptos_proposer_delay_proposal_seconds_bucket emits in Prometheus and histogram_quantile(0.9, rate(aptos_proposer_delay_proposal_seconds_bucket[5m])) returns sensible values.
  • Post-deploy: verify rate(aptos_proposer_delay_proposal_seconds_sum[5m]) / rate(aptos_proposer_delay_proposal_seconds_count[5m]) computes the average.

Dashboard updates

  • dashboards/end-to-end-txn-latency.json — updated the "Proposer delay (due to backpressure)" panel to reference the renamed metric. Note: the original query (aptos_proposer_delay_proposal{...} > 0) referenced the bare histogram name, which Prometheus never emits for histograms; that panel was already broken before this PR and still needs a proper histogram_quantile(...) query rewrite as follow-up.
  • Any out-of-repo Grafana/alert references to aptos_proposer_delay_proposal will need to be updated to aptos_proposer_delay_proposal_seconds.

Note: PIPELINE_BACKPRESSURE_ON_PROPOSAL_TRIGGERED and EXECUTION_BACKPRESSURE_ON_PROPOSAL_TRIGGERED are intentionally left as register_avg_counter - they observe 0.0/1.0 flags, not durations.

Generated with Claude Code


Note

Medium Risk
Renames and redefines a Prometheus histogram, which can break dashboards/alerts and impact observability during rollout until queries are updated.

Overview
Fixes proposer backpressure delay instrumentation by switching PROPOSER_DELAY_PROPOSAL from an avg-counter-style metric to a real Prometheus duration histogram with granular second buckets.

Renames the emitted metric to aptos_proposer_delay_proposal_seconds and updates the end-to-end-txn-latency Grafana dashboard panel to query the new name.

Reviewed by Cursor Bugbot for commit 78e531a. Bugbot is set up for automated code reviews on this repo. Configure here.

PROPOSER_DELAY_PROPOSAL records a varying delay duration in seconds, but
was registered via register_avg_counter() which creates a histogram with
only a single bucket at 0.5. This means histogram_quantile() queries on
this metric return garbage (linear interpolation inside [0, 0.5] -> always
returns 0.45 when values are within range), misleading latency analysis.

Change to a proper histogram with granular time buckets so P50/P90/P99
queries work correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@danielxiangzl danielxiangzl marked this pull request as ready for review April 16, 2026 21:35
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Differential Security Review — PR #19474: Fix proposer_delay_proposal histogram buckets

Date: 2026-04-16
Scope: db26bbc35cd5ad7640bcf0cbdd542cd1f348293f..2c62345322035e80739326444e15681bcd1e3a5f

Executive Summary

Severity Count
CRITICAL 0
HIGH 0
MEDIUM 0
LOW 0

Overall risk: Low. This is a pure metrics registration change; no consensus logic, trust boundaries, or state handling are touched.

Key metrics: 1 file changed, 1 Rust file, 1 module touched (consensus/src/counters.rs), 0 findings.

Recommendation: APPROVE WITH NOTES


What Changed

Commits: db26bbc..2c62345 (1 commit)
Files changed: 1 (1 Rust) | Lines: +9 / -2

Module Files Changed Risk Level
consensus/src/counters.rs 1 Low

The PROPOSER_DELAY_PROPOSAL static was registered via register_avg_counter(), which internally creates a histogram with a single [0, 0.5] bucket. This made histogram_quantile() useless — with only one finite bucket, Prometheus linear-interpolates within [0, 0.5], always returning ≈ quantile × 0.5 regardless of actual delays. The fix switches to register_histogram! with 11 buckets spanning 1ms–5s.


Findings

No security-relevant findings identified.


Pattern Scan Results

Pattern Result
Removed ensure!/assert!/bounds checks None
New unsafe blocks None
Visibility escalation (pub(crate)pub) None
New .unwrap() / .expect() on untrusted data Not applicable — register_histogram! panics only on duplicate metric name registration (a programmer error at init), not on runtime untrusted input; consistent with #![allow(clippy::unwrap_used)] and all other register_histogram! calls in the file
Non-determinism (HashMap, SystemTime, float in execution) None
Signer / auth removal None
Gas / cost model changes None

Semantic Analysis

Call site (proposal_generator.rs:615): PROPOSER_DELAY_PROPOSAL.observe(proposal_delay.as_secs_f64()) — unchanged.

proposal_delay value range: computed as the max of all active backpressure/backoff levels:

  • backpressure_proposal_delay_ms: configurable, default max 300 ms
  • backoff_proposal_delay_ms: configurable, default max 300 ms
  • Zero when no backpressure is active (also observed, correctly landing in the le="0.001" bucket)

All default config values (50 ms – 300 ms) fall cleanly between the 0.25 and 0.5 buckets, with substantial resolution. The 5.0 s upper bound is 16× the current default maximum; any future custom configuration setting delays above 5 s would accumulate in +Inf (expected histogram behavior, not a correctness bug).

PIPELINE_BACKPRESSURE_ON_PROPOSAL_TRIGGERED and EXECUTION_BACKPRESSURE_ON_PROPOSAL_TRIGGERED remain as register_avg_counter — they observe 0.0/1.0 boolean flags, where a single 0.5 s bucket is appropriate for computing an activation rate via sum/count. This deliberate asymmetry is correctly documented in the PR summary.


Blast Radius

PROPOSER_DELAY_PROPOSAL has exactly 1 non-test observe site (proposal_generator.rs:615). No callers are affected — this is a metrics sink, not a data source for any decision logic.

Dashboard note (non-blocking, pre-existing): The panel "Proposer delay (due to backpressure)" in dashboards/end-to-end-txn-latency.json (line 708) queries the bare metric name aptos_proposer_delay_proposal{...} > 0. Prometheus histograms expose _bucket, _sum, and _count suffixes — the bare name returns no series in standard PromQL. This was already the case with register_avg_counter and is unchanged by this PR. The PR's own test plan notes the correct post-deploy verification query (aptos_proposer_delay_proposal_bucket). Updating the dashboard to use histogram_quantile(0.9, rate(aptos_proposer_delay_proposal_bucket[5m])) would complete the intent of this fix.


Test Coverage

Changed Symbol Coverage Notes
PROPOSER_DELAY_PROPOSAL registration No unit test (consistent with all other counters in this file) Metric registration correctness is validated at startup; no runtime security path

No coverage gap with security implications.


Historical Context

git log --all --oneline -- consensus/src/counters.rs | grep -iE 'fix|security|cve' — no prior security-fix commits touching this specific metric. No regression risk.

👀

Open in Web View Automation 

Sent by Cursor Automation: Security Review Bot

@danielxiangzl danielxiangzl requested a review from grao1991 April 16, 2026 21:46
Comment thread consensus/src/counters.rs
/// instead of linear-interpolation artifacts.
pub static PROPOSER_DELAY_PROPOSAL: Lazy<Histogram> = Lazy::new(|| {
register_avg_counter(
register_histogram!(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest creating a new metric altogether otherwise rollout will be a mess.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Avoid bucket-boundary mixing during rolling deploy by using a new
metric name rather than repurposing the old one. Also follows the
Prometheus `_seconds` convention for duration histograms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@danielxiangzl danielxiangzl requested a review from a team as a code owner April 20, 2026 19:59
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 48efa5f. Configure here.

Comment thread consensus/src/counters.rs
register_avg_counter(
"aptos_proposer_delay_proposal",
register_histogram!(
"aptos_proposer_delay_proposal_seconds",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric name changed despite PR claiming it's unchanged

High Severity

The metric name changed from "aptos_proposer_delay_proposal" to "aptos_proposer_delay_proposal_seconds", contradicting the PR description's claim that "Metric name is unchanged." This is a breaking change: any existing external dashboards, alerts, or Prometheus queries referencing aptos_proposer_delay_proposal (including _sum, _count, _bucket variants) will silently stop receiving data. During rolling deployments, nodes will emit different metric names, fragmenting aggregation queries.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 48efa5f. Configure here.

@danielxiangzl danielxiangzl enabled auto-merge (squash) April 20, 2026 20:53
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite compat success on ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc

Compatibility test results for ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc (PR)
1. Check liveness of validators at old version: ca049383dd80675149ef2d0042668964f9f9107a
compatibility::simple-validator-upgrade::liveness-check : committed: 14298.66 txn/s, latency: 2425.49 ms, (p50: 2400 ms, p70: 2800, p90: 3100 ms, p99: 3600 ms), latency samples: 468360
2. Upgrading first Validator to new version: 78e531a55613471a9a92166949b5a49c33c760cc
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6529.84 txn/s, latency: 5174.22 ms, (p50: 5700 ms, p70: 5800, p90: 5900 ms, p99: 6200 ms), latency samples: 223880
3. Upgrading rest of first batch to new version: 78e531a55613471a9a92166949b5a49c33c760cc
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6277.29 txn/s, latency: 5430.85 ms, (p50: 5900 ms, p70: 6000, p90: 6100 ms, p99: 6200 ms), latency samples: 218960
4. upgrading second batch to new version: 78e531a55613471a9a92166949b5a49c33c760cc
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10934.42 txn/s, latency: 2936.45 ms, (p50: 3100 ms, p70: 3200, p90: 3400 ms, p99: 3700 ms), latency samples: 361140
5. check swarm health
Compatibility test for ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc passed
Test Ok

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite realistic_env_max_load success on 78e531a55613471a9a92166949b5a49c33c760cc

two traffics test: inner traffic : committed: 13522.67 txn/s, latency: 1345.98 ms, (p50: 1500 ms, p70: 1500, p90: 1800 ms, p99: 2100 ms), latency samples: 5049660
two traffics test : committed: 100.02 txn/s, latency: 617.42 ms, (p50: 500 ms, p70: 700, p90: 800 ms, p99: 1000 ms), latency samples: 1720
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 0.693, avg: 0.667", "ConsensusProposalToOrdered: max: 0.119, avg: 0.109", "ConsensusOrderedToCommit: max: 0.174, avg: 0.161", "ConsensusProposalToCommit: max: 0.280, avg: 0.270"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.77s no progress at version 53022 (avg 0.06s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.29s no progress at version 2481674 (avg 0.29s) [limit 16].
Test Ok

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite framework_upgrade success on ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc

Compatibility test results for ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc (PR)
Upgrade the nodes to version: 78e531a55613471a9a92166949b5a49c33c760cc
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2521.14 txn/s, submitted: 2528.94 txn/s, failed submission: 7.80 txn/s, expired: 7.80 txn/s, latency: 1172.40 ms, (p50: 1200 ms, p70: 1200, p90: 1500 ms, p99: 2100 ms), latency samples: 226380
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2444.45 txn/s, submitted: 2451.79 txn/s, failed submission: 7.33 txn/s, expired: 7.33 txn/s, latency: 1201.69 ms, (p50: 1200 ms, p70: 1200, p90: 1500 ms, p99: 2300 ms), latency samples: 219960
5. check swarm health
Compatibility test for ca049383dd80675149ef2d0042668964f9f9107a ==> 78e531a55613471a9a92166949b5a49c33c760cc passed
Upgrade the remaining nodes to version: 78e531a55613471a9a92166949b5a49c33c760cc
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2417.43 txn/s, submitted: 2425.51 txn/s, failed submission: 8.08 txn/s, expired: 8.08 txn/s, latency: 1342.43 ms, (p50: 1200 ms, p70: 1300, p90: 1800 ms, p99: 10800 ms), latency samples: 215461
Test Ok

@danielxiangzl danielxiangzl merged commit dd42df2 into main Apr 20, 2026
103 checks passed
@danielxiangzl danielxiangzl deleted the daniel/fix-proposer-delay-histogram branch April 20, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants