[consensus] Fix BATCH_NUM_PER_BLOCK histogram buckets#19473
[consensus] Fix BATCH_NUM_PER_BLOCK histogram buckets#19473danielxiangzl merged 3 commits intomainfrom
Conversation
Use PROOF_COUNT_BUCKETS instead of TRANSACTION_COUNT_BUCKETS for the batch-count-per-block metric. The old buckets were designed for transaction counts (exponential from 1.5 up to thousands), while batch counts per block are typically in the tens to low hundreds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2844c6f to
dc419c3
Compare
There was a problem hiding this comment.
Stale comment
Differential Security Review — PR #19473
Scope:
c1dfd4f..2844c6f· 4 commits · 7 files ·+107 / -44
Reviewer: Automated differential review (Trail of Bits / Cursor)Executive Summary
Severity Count CRITICAL 0 HIGH 1 MEDIUM 1 LOW 2 INFO 1 Overall risk: Medium. The metrics fix is correct and low-risk. The HACK code in
round_manager.rsintroduces a deliberate, unconditional 1-second proposal delay in production consensus that degrades liveness and lacks any gating mechanism; this is the primary concern. Forge test-suite changes are internally consistent but weaken land-blocking regression coverage.Recommendation: REVIEW BEFORE MERGE
What Changed
Module Files Risk consensus/src/quorum_store/1 Low — metrics only consensus/src/1 High — production consensus behavior testsuite/forge/src/2 Low — test infra testsuite/forge-cli/src/suites/2 Medium — land-blocking threshold reduction .github/workflows/1 Low — CI timeout increase
Findings
[HIGH] Unconditional HACK delay shipped in production consensus — no feature gate
File:
consensus/src/round_manager.rs(around lines 485–519 and 1453–1489)
Blast radius: Affects every validator deployment where the author is the last byvalidator_indexordering; applied onround % 10 == 0rounds unconditionally.
Test coverage: No tests verify the delay behavior or its absence in non-HACK builds.Description: Two blocks labelled
// HACKunconditionally delay proposal generation by 1 second (≥round_initial_timeout_msdefault of 1000 ms) for the last validator by ordered index, 10% of rounds. There is no#[cfg(test)],#[cfg(feature = "...")], or config-flag guard. The delay runs on any binary built from this diff — including mainnet validators.Concrete impact:
- Liveness degradation: For the targeted validator, every 10th round as proposer results in a near-certain timeout (1s delay with 1s timeout). This raises consensus round latency, increases timeout-vote traffic, and reduces throughput measurably.
- Asymmetric burden: A specific operator (the one holding the highest
validator_indexin the epoch'sValidatorSet) is systematically penalized. This ordering is stable within an epoch but changes at epoch boundaries.- Log noise in production:
warn!("HACK: delaying proposal…")fires in production logs every 10th proposer round for the affected node, confusing operators and alerting pipelines. The optimistic path (opt_proposal_delay) has no corresponding warn, adding asymmetric observability.- Mixed-build divergence: Any deployment mixing this build with a clean build gets behaviorally divergent validators at a consensus-critical path.
n == 1amplification: On a single-validator network the conditioni >= n - 1is0 >= 0— every round divisible by 10 triggers the delay, not just 10%.The commit message acknowledges this is for testing "slow proposer" behavior; it should not reach production without gating.
[MEDIUM] Land-blocking test stringency materially reduced — epoch regression coverage removed
File:
testsuite/forge-cli/src/suites/realistic_environment.rs,testsuite/forge-cli/src/suites/land_blocking.rs
Test coverage: These ARE the tests; the change reduces what they catch.Description: Three concurrent relaxations to
realistic_env_max_load_testas used in land-blocking:
- Traffic mode:
EmitJobMode::MaxLoad { mempool_backlog: 38000 }→EmitJobMode::ConstTps { tps: 4000 }. MaxLoad stresses the network to its capacity ceiling; ConstTps at 4000 caps load regardless of what the chain can handle.- Success criteria:
10 000 TPS→3 500 TPS. A 65% reduction in the minimum passing throughput bar.- Epoch duration:
300 s→24 * 3600 s. No epoch transition occurs during any realistic test run. Consensus reconfiguration paths, state-sync edge cases at epoch boundaries, and on-chain governance driven by new epochs are no longer exercised by this job.Concrete impact: Regressions that degrade throughput to between 3 500 and 10 000 TPS, or that break behavior at epoch boundaries, will now pass land-blocking. This is a test-coverage gap, not a runtime bug. Acceptable if a separate longer-running suite (e.g. forge-stable) retains the stricter criteria, but that should be confirmed.
[LOW]
PROOF_SIZE_WHEN_PULLandNUM_BATCHES_WITHOUT_PROOF_OF_STOREstill use transaction-scale bucketsFile:
consensus/src/quorum_store/counters.rs(lines ~406–413 and ~378–385)
Concrete impact: Metrics observability only — no runtime or security effect.Description: The PR correctly fixes
BATCH_NUM_PER_BLOCKto usePROOF_COUNT_BUCKETS(range ~1–500) instead ofTRANSACTION_COUNT_BUCKETS(exponential up to ~25 000+). Two adjacent metrics that also measure batch/proof counts (not transaction counts) still useTRANSACTION_COUNT_BUCKETS:
PROOF_SIZE_WHEN_PULL— "number of proof-of-store per block when pulled"NUM_BATCHES_WITHOUT_PROOF_OF_STORE— "number of batches without proof of store"Both would benefit from
PROOF_COUNT_BUCKETSfor the same reason asBATCH_NUM_PER_BLOCK. Not blocking, but worth cleaning up in the same pass.
[LOW] Warn log for HACK missing on optimistic proposal path
File:
consensus/src/round_manager.rs(~line 1453)
Concrete impact: No runtime bug; observability asymmetry if HACK ships.Description: The regular proposal path emits
warn!("HACK: delaying proposal…")when a delay fires. The optimistic proposal path (opt_proposal_delay) has no equivalent warn, so delayed opt proposals are silent — harder to correlate with latency spikes from that code path.
[INFO]
FORGE_RUNNER_DURATION_SECSincrease: CI wall-time impactFile:
.github/workflows/docker-build-test.yaml
Concrete impact: Infrastructure cost / CI queue latency.The
realistic_env_max_loadForge job's runner duration increased from 480 s to 1800 s. Combined with theduration_override(900s)for land-blocking and theduration_override.unwrap_or(global_duration)pattern inrunner.rs, the effective namespace TTL for land-blocking is900 + 1200 = 2100 s. The CI job budget (1800 s) is tighter than the full cluster TTL, but the internal runner timeout is driven byduration_override(900 s) + test overhead, not the Forge runner duration directly. No correctness issue, but CI cost increases ~3.75×.
Test Coverage Assessment
Changed Component Coverage Risk Elevation counters.rsbucket fixNo dedicated test; bucket defs are observability-only No elevation round_manager.rsHACK delayNo test for delay presence or absence; no gate HIGH — production behavior change without test coverage Forge config / runner plumbing Exercised by the test suite it configures No elevation Land-blocking thresholds These are the tests — coverage gap introduced MEDIUM
Blast Radius
Changed Symbol Non-test callers Classification BATCH_NUM_PER_BLOCK(bucket change)update_batch_stats(1 site)LOW — observability only proposal_delayHACKEvery process_new_round_eventinvocation on the target validatorHIGH opt_proposal_delayHACKEvery start_next_opt_roundinvocation on the target validatorHIGH ForgeConfig::duration_overriderunner.rs(2 sites);land_blocking.rs(1 site)LOW — test infra
Recommendations
Before Merge
- Gate or remove the HACK delay in
round_manager.rs. If this behavior is needed to test slow-proposer resilience, it should be gated behind#[cfg(feature = "failpoints")], a dedicatedConsensusConfigfield (e.g.test_proposal_delay_ms), or moved to the failpoint injection system (fail::fail_point!) already used elsewhere in the consensus codebase.- Confirm a stricter suite covers epoch transitions — if forge-stable or another scheduled job still runs the higher-backlog / shorter-epoch configuration, document that explicitly. If not, the epoch regression coverage loss in land-blocking is unmitigated.
Nice-to-have
- Add
PROOF_COUNT_BUCKETStoPROOF_SIZE_WHEN_PULLandNUM_BATCHES_WITHOUT_PROOF_OF_STOREto complete the bucket alignment started in this PR.- Add
warn!to the optimistic proposal delay path for symmetric observability.Sent by Cursor Automation: Security Review Bot
| "Histogram for the number of batches per (committed) blocks.", | ||
| &["type"], | ||
| TRANSACTION_COUNT_BUCKETS.clone(), | ||
| PROOF_COUNT_BUCKETS.clone(), |
There was a problem hiding this comment.
Bucket fix is correct. BATCH_NUM_PER_BLOCK measures proof/batch counts per block (observed values of proof_num, inline_batch_num, opt_batch_num — all small integers, typically < 500). PROOF_COUNT_BUCKETS (max 500) is the right scale; TRANSACTION_COUNT_BUCKETS (exponential to ~25 000+) was semantically mismatched.
Two adjacent metrics with the same scale mismatch remain: PROOF_SIZE_WHEN_PULL (line ~408) and NUM_BATCHES_WITHOUT_PROOF_OF_STORE (line ~380) — both measure proof/batch counts but still use TRANSACTION_COUNT_BUCKETS.
There was a problem hiding this comment.
Differential Security Review — PR #19473: Fix BATCH_NUM_PER_BLOCK histogram buckets
👀
Date: 2026-04-16
Scope: Single commit dc419c3688 — consensus/src/quorum_store/counters.rs, line 169
Reviewer: Automated differential review
Executive Summary
| Severity | Count |
|---|---|
| CRITICAL | 0 |
| HIGH | 0 |
| MEDIUM | 0 |
| LOW | 0 |
| Informational | 2 |
Overall risk: Low. This is a metrics-only change that corrects histogram bucket semantics with no effect on consensus correctness, validation logic, or security properties.
Key metrics: 1 file changed, 1 Rust line changed, 1 finding (informational consistency gap)
Recommendation: APPROVE WITH NOTES
What Changed
Commits: c1dfd4f..dc419c3 (1 commit)
Files changed: 1 (1 Rust, 0 Move, 0 config) | Lines: +1 / -1
| Module | Files Changed | Risk Level |
|---|---|---|
consensus/src/quorum_store/ |
1 | LOW |
Phase 0: Orientation
Change scope: 1 file changed (1 Rust). Strategy: Full trace (tiny diff).
Critical files: None.
Risk assessment: Metrics/observability change only. Histogram bucket boundaries have zero impact on consensus protocol, execution, validation, or security-critical paths.
File Risk Classification
| File | Complexity | Risk | Rationale |
|---|---|---|---|
consensus/src/quorum_store/counters.rs |
Low | LOW | Prometheus static Lazy<HistogramVec> bucket swap; no state, no logic |
Phase 1: Pattern Detection
BEFORE:
TRANSACTION_COUNT_BUCKETS.clone(), // exponential_buckets(1.5, 1.5, 25) → ~1.5 to ~25,251AFTER:
PROOF_COUNT_BUCKETS.clone(), // [1, 3, 5, 7, 10, 12, 15, 20, 25, 30, 40, 50, 60, 75, 100, 125, 150, 200, 250, 300, 500]CHANGE: Histogram bucket layout. TRANSACTION_COUNT_BUCKETS uses exponential_buckets(1.5, 1.5, 25), producing 25 upper bounds from ~1.5 to ~25,251. PROOF_COUNT_BUCKETS is a hand-tuned set of 21 bounds from 1 to 500, giving dense granularity in the 1–100 range. The metric BATCH_NUM_PER_BLOCK observes the count of proof-type batches, inline batches, and opt-batches per committed block.
SECURITY: None — histogram bucket changes are pure observability. No validation, no branching, no state mutation.
No Rust red flags triggered (no new .unwrap(), no unsafe, no removed checks, no visibility changes, no arithmetic).
Phase 2: Semantic Analysis
Semantic correctness of the bucket choice:
BATCH_NUM_PER_BLOCK observes results of block.proof_stats(), block.inline_batch_stats(), and block.opt_batch_stats(). These return batch/proof counts per committed block, not transaction counts. In normal Aptos mainnet operation, blocks carry tens to low hundreds of batches — not the thousands that TRANSACTION_COUNT_BUCKETS was designed for. PROOF_COUNT_BUCKETS (1–500) provides better resolution in that range and is already used by related proof-count histograms in the same file (PROOFS_WITHOUT_BATCH_SUMMARY, PROOFS_WITH_BATCH_SUMMARY, NUM_PROOFS_IN_PROOF_QUEUE_AFTER_PULL). The fix aligns naming and semantics consistently.
Blast radius: BATCH_NUM_PER_BLOCK is only written inside update_batch_stats(), which is called from one site in consensus/src/counters.rs:update_counters_for_block. No external code references the static directly. No consensus decision depends on histogram values.
Findings
[INFORMATIONAL] Finding 1 — PROOF_SIZE_WHEN_PULL and NUM_BATCHES_WITHOUT_PROOF_OF_STORE remain on TRANSACTION_COUNT_BUCKETS
File: consensus/src/quorum_store/counters.rs:407 and :378
Blast radius: Observability only
Test coverage: N/A (metrics registration)
Description: Two other histograms that count proofs or batches (not transactions) still use TRANSACTION_COUNT_BUCKETS:
quorum_store_proof_size_when_pull— observesproof_of_stores.len()on each block pull (call site:batch_proof_queue.rs:577)num_batches_without_proof_of_store— observesnum_batches_without_proof()in proof manager
These are semantically identical to BATCH_NUM_PER_BLOCK (small-integer batch/proof counts), yet they use the coarse exponential transaction buckets that start at 1.5. This inconsistency means dashboards and alerts get fine resolution for batch_num_per_block but not for these two related metrics.
Concrete impact: No current exploit, no correctness issue. Dashboard/alerting resolution mismatch only.
Note: Not a blocker for this PR; suitable as a follow-up.
[INFORMATIONAL] Finding 2 — Metrics series continuity on rollout
File: consensus/src/quorum_store/counters.rs:164
Description: Changing histogram bucket definitions causes Prometheus _bucket time-series label sets to shift on process restart. Any dashboards or alerts keyed to specific bucket boundaries (e.g. le="25.62890625" from the old exponential set) will stop matching. This is operational, not a security concern.
Concrete impact: No current exploit. Operator action may be needed to update dashboards after rollout.
Test Coverage Analysis
| Changed Element | Coverage | Risk Elevation | Recommendation |
|---|---|---|---|
BATCH_NUM_PER_BLOCK bucket definition |
N/A — metrics registration | None | No action needed; bucket changes are not unit-testable at the Rust level in a meaningful security sense |
Blast Radius
| Changed Symbol | Non-Test Callers | Classification | Critical Callers |
|---|---|---|---|
BATCH_NUM_PER_BLOCK (indirect via update_batch_stats) |
1 (consensus/src/counters.rs) |
LOW | None |
Highest-risk dependency chains: None — update_batch_stats is a fire-and-forget metrics observer.
Historical Context
The original introduction of BATCH_NUM_PER_BLOCK in commit f37103cef0 ([counters] adding/fixing batch counters) used TRANSACTION_COUNT_BUCKETS without documented rationale. There is no indication that bucket choice was tied to alerting thresholds or security invariants. This PR's fix is not a regression of a security measure.
Recommendations
Before Production
- (Optional, non-blocking) Follow-up PR to align
PROOF_SIZE_WHEN_PULLandNUM_BATCHES_WITHOUT_PROOF_OF_STOREwithPROOF_COUNT_BUCKETSfor consistency
Operational
- Update any dashboards or alerts that query
quorum_store_batch_num_per_blockwith specificle=bucket boundaries after rollout
Sent by Cursor Automation: Security Review Bot
| "Histogram for the number of batches per (committed) blocks.", | ||
| &["type"], | ||
| TRANSACTION_COUNT_BUCKETS.clone(), | ||
| PROOF_COUNT_BUCKETS.clone(), |
There was a problem hiding this comment.
[INFORMATIONAL] The bucket swap from TRANSACTION_COUNT_BUCKETS (exponential, ~1.5 to ~25,251) to PROOF_COUNT_BUCKETS (hand-tuned, 1–500) is semantically correct: this metric counts batches/proofs per committed block, which in practice stays in the tens-to-low-hundreds range. The fix also makes this metric consistent with PROOFS_WITHOUT_BATCH_SUMMARY, PROOFS_WITH_BATCH_SUMMARY, and NUM_PROOFS_IN_PROOF_QUEUE_AFTER_PULL, which already use PROOF_COUNT_BUCKETS.
Note for follow-up: PROOF_SIZE_WHEN_PULL (line 407) and NUM_BATCHES_WITHOUT_PROOF_OF_STORE (line 378) are semantically equivalent (probe counts, not tx counts) but still use TRANSACTION_COUNT_BUCKETS. Aligning those in a separate PR would complete the consistency story.
Address review comment: NUM_BATCHES_WITHOUT_PROOF_OF_STORE and PROOF_SIZE_WHEN_PULL measure proof/batch counts (small integers), so they should use PROOF_COUNT_BUCKETS rather than the transaction-scaled TRANSACTION_COUNT_BUCKETS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
✅ Forge suite
|


Summary
BATCH_NUM_PER_BLOCKhistogram to usePROOF_COUNT_BUCKETSinstead ofTRANSACTION_COUNT_BUCKETSTest plan
🤖 Generated with Claude Code
Note
Low Risk
Low risk: metrics-only change that adjusts histogram bucket definitions for batch/proof counts and does not affect consensus logic or data handling.
Overview
Updates quorum store metrics to use
PROOF_COUNT_BUCKETS(instead of transaction-oriented buckets) for count-based histograms:BATCH_NUM_PER_BLOCK,NUM_BATCHES_WITHOUT_PROOF_OF_STORE, andPROOF_SIZE_WHEN_PULL. This improves bucket granularity for typical batch/proof counts without changing any runtime behavior beyond metrics aggregation.Reviewed by Cursor Bugbot for commit 9c62399. Bugbot is set up for automated code reviews on this repo. Configure here.