Skip to content

[consensus] Fix BATCH_NUM_PER_BLOCK histogram buckets#19473

Merged
danielxiangzl merged 3 commits intomainfrom
daniel/fix-batch-num-buckets
Apr 21, 2026
Merged

[consensus] Fix BATCH_NUM_PER_BLOCK histogram buckets#19473
danielxiangzl merged 3 commits intomainfrom
daniel/fix-batch-num-buckets

Conversation

@danielxiangzl
Copy link
Copy Markdown
Contributor

@danielxiangzl danielxiangzl commented Apr 16, 2026

Summary

  • Fix BATCH_NUM_PER_BLOCK histogram to use PROOF_COUNT_BUCKETS instead of TRANSACTION_COUNT_BUCKETS
  • The old buckets were designed for transaction counts (exponential from 1.5 up to thousands), while batch counts per block are typically in the tens to low hundreds

Test plan

  • Compiles cleanly
  • Lint passes

🤖 Generated with Claude Code


Note

Low Risk
Low risk: metrics-only change that adjusts histogram bucket definitions for batch/proof counts and does not affect consensus logic or data handling.

Overview
Updates quorum store metrics to use PROOF_COUNT_BUCKETS (instead of transaction-oriented buckets) for count-based histograms: BATCH_NUM_PER_BLOCK, NUM_BATCHES_WITHOUT_PROOF_OF_STORE, and PROOF_SIZE_WHEN_PULL. This improves bucket granularity for typical batch/proof counts without changing any runtime behavior beyond metrics aggregation.

Reviewed by Cursor Bugbot for commit 9c62399. Bugbot is set up for automated code reviews on this repo. Configure here.

@danielxiangzl danielxiangzl marked this pull request as draft April 16, 2026 20:32
Use PROOF_COUNT_BUCKETS instead of TRANSACTION_COUNT_BUCKETS for the
batch-count-per-block metric. The old buckets were designed for
transaction counts (exponential from 1.5 up to thousands), while batch
counts per block are typically in the tens to low hundreds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@danielxiangzl danielxiangzl force-pushed the daniel/fix-batch-num-buckets branch from 2844c6f to dc419c3 Compare April 16, 2026 20:33
@danielxiangzl danielxiangzl marked this pull request as ready for review April 16, 2026 20:34
@danielxiangzl danielxiangzl requested review from grao1991 and removed request for a team April 16, 2026 20:34
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Differential Security Review — PR #19473

Scope: c1dfd4f..2844c6f · 4 commits · 7 files · +107 / -44
Reviewer: Automated differential review (Trail of Bits / Cursor)

Executive Summary

Severity Count
CRITICAL 0
HIGH 1
MEDIUM 1
LOW 2
INFO 1

Overall risk: Medium. The metrics fix is correct and low-risk. The HACK code in round_manager.rs introduces a deliberate, unconditional 1-second proposal delay in production consensus that degrades liveness and lacks any gating mechanism; this is the primary concern. Forge test-suite changes are internally consistent but weaken land-blocking regression coverage.

Recommendation: REVIEW BEFORE MERGE


What Changed

Module Files Risk
consensus/src/quorum_store/ 1 Low — metrics only
consensus/src/ 1 High — production consensus behavior
testsuite/forge/src/ 2 Low — test infra
testsuite/forge-cli/src/suites/ 2 Medium — land-blocking threshold reduction
.github/workflows/ 1 Low — CI timeout increase

Findings

[HIGH] Unconditional HACK delay shipped in production consensus — no feature gate

File: consensus/src/round_manager.rs (around lines 485–519 and 1453–1489)
Blast radius: Affects every validator deployment where the author is the last by validator_index ordering; applied on round % 10 == 0 rounds unconditionally.
Test coverage: No tests verify the delay behavior or its absence in non-HACK builds.

Description: Two blocks labelled // HACK unconditionally delay proposal generation by 1 second (≥ round_initial_timeout_ms default of 1000 ms) for the last validator by ordered index, 10% of rounds. There is no #[cfg(test)], #[cfg(feature = "...")], or config-flag guard. The delay runs on any binary built from this diff — including mainnet validators.

Concrete impact:

  1. Liveness degradation: For the targeted validator, every 10th round as proposer results in a near-certain timeout (1s delay with 1s timeout). This raises consensus round latency, increases timeout-vote traffic, and reduces throughput measurably.
  2. Asymmetric burden: A specific operator (the one holding the highest validator_index in the epoch's ValidatorSet) is systematically penalized. This ordering is stable within an epoch but changes at epoch boundaries.
  3. Log noise in production: warn!("HACK: delaying proposal…") fires in production logs every 10th proposer round for the affected node, confusing operators and alerting pipelines. The optimistic path (opt_proposal_delay) has no corresponding warn, adding asymmetric observability.
  4. Mixed-build divergence: Any deployment mixing this build with a clean build gets behaviorally divergent validators at a consensus-critical path.
  5. n == 1 amplification: On a single-validator network the condition i >= n - 1 is 0 >= 0every round divisible by 10 triggers the delay, not just 10%.

The commit message acknowledges this is for testing "slow proposer" behavior; it should not reach production without gating.


[MEDIUM] Land-blocking test stringency materially reduced — epoch regression coverage removed

File: testsuite/forge-cli/src/suites/realistic_environment.rs, testsuite/forge-cli/src/suites/land_blocking.rs
Test coverage: These ARE the tests; the change reduces what they catch.

Description: Three concurrent relaxations to realistic_env_max_load_test as used in land-blocking:

  1. Traffic mode: EmitJobMode::MaxLoad { mempool_backlog: 38000 }EmitJobMode::ConstTps { tps: 4000 }. MaxLoad stresses the network to its capacity ceiling; ConstTps at 4000 caps load regardless of what the chain can handle.
  2. Success criteria: 10 000 TPS3 500 TPS. A 65% reduction in the minimum passing throughput bar.
  3. Epoch duration: 300 s24 * 3600 s. No epoch transition occurs during any realistic test run. Consensus reconfiguration paths, state-sync edge cases at epoch boundaries, and on-chain governance driven by new epochs are no longer exercised by this job.

Concrete impact: Regressions that degrade throughput to between 3 500 and 10 000 TPS, or that break behavior at epoch boundaries, will now pass land-blocking. This is a test-coverage gap, not a runtime bug. Acceptable if a separate longer-running suite (e.g. forge-stable) retains the stricter criteria, but that should be confirmed.


[LOW] PROOF_SIZE_WHEN_PULL and NUM_BATCHES_WITHOUT_PROOF_OF_STORE still use transaction-scale buckets

File: consensus/src/quorum_store/counters.rs (lines ~406–413 and ~378–385)
Concrete impact: Metrics observability only — no runtime or security effect.

Description: The PR correctly fixes BATCH_NUM_PER_BLOCK to use PROOF_COUNT_BUCKETS (range ~1–500) instead of TRANSACTION_COUNT_BUCKETS (exponential up to ~25 000+). Two adjacent metrics that also measure batch/proof counts (not transaction counts) still use TRANSACTION_COUNT_BUCKETS:

  • PROOF_SIZE_WHEN_PULL — "number of proof-of-store per block when pulled"
  • NUM_BATCHES_WITHOUT_PROOF_OF_STORE — "number of batches without proof of store"

Both would benefit from PROOF_COUNT_BUCKETS for the same reason as BATCH_NUM_PER_BLOCK. Not blocking, but worth cleaning up in the same pass.


[LOW] Warn log for HACK missing on optimistic proposal path

File: consensus/src/round_manager.rs (~line 1453)
Concrete impact: No runtime bug; observability asymmetry if HACK ships.

Description: The regular proposal path emits warn!("HACK: delaying proposal…") when a delay fires. The optimistic proposal path (opt_proposal_delay) has no equivalent warn, so delayed opt proposals are silent — harder to correlate with latency spikes from that code path.


[INFO] FORGE_RUNNER_DURATION_SECS increase: CI wall-time impact

File: .github/workflows/docker-build-test.yaml
Concrete impact: Infrastructure cost / CI queue latency.

The realistic_env_max_load Forge job's runner duration increased from 480 s to 1800 s. Combined with the duration_override(900s) for land-blocking and the duration_override.unwrap_or(global_duration) pattern in runner.rs, the effective namespace TTL for land-blocking is 900 + 1200 = 2100 s. The CI job budget (1800 s) is tighter than the full cluster TTL, but the internal runner timeout is driven by duration_override (900 s) + test overhead, not the Forge runner duration directly. No correctness issue, but CI cost increases ~3.75×.


Test Coverage Assessment

Changed Component Coverage Risk Elevation
counters.rs bucket fix No dedicated test; bucket defs are observability-only No elevation
round_manager.rs HACK delay No test for delay presence or absence; no gate HIGH — production behavior change without test coverage
Forge config / runner plumbing Exercised by the test suite it configures No elevation
Land-blocking thresholds These are the tests — coverage gap introduced MEDIUM

Blast Radius

Changed Symbol Non-test callers Classification
BATCH_NUM_PER_BLOCK (bucket change) update_batch_stats (1 site) LOW — observability only
proposal_delay HACK Every process_new_round_event invocation on the target validator HIGH
opt_proposal_delay HACK Every start_next_opt_round invocation on the target validator HIGH
ForgeConfig::duration_override runner.rs (2 sites); land_blocking.rs (1 site) LOW — test infra

Recommendations

Before Merge

  • Gate or remove the HACK delay in round_manager.rs. If this behavior is needed to test slow-proposer resilience, it should be gated behind #[cfg(feature = "failpoints")], a dedicated ConsensusConfig field (e.g. test_proposal_delay_ms), or moved to the failpoint injection system (fail::fail_point!) already used elsewhere in the consensus codebase.
  • Confirm a stricter suite covers epoch transitions — if forge-stable or another scheduled job still runs the higher-backlog / shorter-epoch configuration, document that explicitly. If not, the epoch regression coverage loss in land-blocking is unmitigated.

Nice-to-have

  • Add PROOF_COUNT_BUCKETS to PROOF_SIZE_WHEN_PULL and NUM_BATCHES_WITHOUT_PROOF_OF_STORE to complete the bucket alignment started in this PR.
  • Add warn! to the optimistic proposal delay path for symmetric observability.
Open in Web View Automation 

Sent by Cursor Automation: Security Review Bot

"Histogram for the number of batches per (committed) blocks.",
&["type"],
TRANSACTION_COUNT_BUCKETS.clone(),
PROOF_COUNT_BUCKETS.clone(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bucket fix is correct. BATCH_NUM_PER_BLOCK measures proof/batch counts per block (observed values of proof_num, inline_batch_num, opt_batch_num — all small integers, typically < 500). PROOF_COUNT_BUCKETS (max 500) is the right scale; TRANSACTION_COUNT_BUCKETS (exponential to ~25 000+) was semantically mismatched.

Two adjacent metrics with the same scale mismatch remain: PROOF_SIZE_WHEN_PULL (line ~408) and NUM_BATCHES_WITHOUT_PROOF_OF_STORE (line ~380) — both measure proof/batch counts but still use TRANSACTION_COUNT_BUCKETS.

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Differential Security Review — PR #19473: Fix BATCH_NUM_PER_BLOCK histogram buckets

👀

Date: 2026-04-16
Scope: Single commit dc419c3688consensus/src/quorum_store/counters.rs, line 169
Reviewer: Automated differential review

Executive Summary

Severity Count
CRITICAL 0
HIGH 0
MEDIUM 0
LOW 0
Informational 2

Overall risk: Low. This is a metrics-only change that corrects histogram bucket semantics with no effect on consensus correctness, validation logic, or security properties.

Key metrics: 1 file changed, 1 Rust line changed, 1 finding (informational consistency gap)

Recommendation: APPROVE WITH NOTES


What Changed

Commits: c1dfd4f..dc419c3 (1 commit)
Files changed: 1 (1 Rust, 0 Move, 0 config) | Lines: +1 / -1

Module Files Changed Risk Level
consensus/src/quorum_store/ 1 LOW

Phase 0: Orientation

Change scope: 1 file changed (1 Rust). Strategy: Full trace (tiny diff).
Critical files: None.
Risk assessment: Metrics/observability change only. Histogram bucket boundaries have zero impact on consensus protocol, execution, validation, or security-critical paths.

File Risk Classification

File Complexity Risk Rationale
consensus/src/quorum_store/counters.rs Low LOW Prometheus static Lazy<HistogramVec> bucket swap; no state, no logic

Phase 1: Pattern Detection

BEFORE:

TRANSACTION_COUNT_BUCKETS.clone(),  // exponential_buckets(1.5, 1.5, 25) → ~1.5 to ~25,251

AFTER:

PROOF_COUNT_BUCKETS.clone(),  // [1, 3, 5, 7, 10, 12, 15, 20, 25, 30, 40, 50, 60, 75, 100, 125, 150, 200, 250, 300, 500]

CHANGE: Histogram bucket layout. TRANSACTION_COUNT_BUCKETS uses exponential_buckets(1.5, 1.5, 25), producing 25 upper bounds from ~1.5 to ~25,251. PROOF_COUNT_BUCKETS is a hand-tuned set of 21 bounds from 1 to 500, giving dense granularity in the 1–100 range. The metric BATCH_NUM_PER_BLOCK observes the count of proof-type batches, inline batches, and opt-batches per committed block.

SECURITY: None — histogram bucket changes are pure observability. No validation, no branching, no state mutation.

No Rust red flags triggered (no new .unwrap(), no unsafe, no removed checks, no visibility changes, no arithmetic).


Phase 2: Semantic Analysis

Semantic correctness of the bucket choice:

BATCH_NUM_PER_BLOCK observes results of block.proof_stats(), block.inline_batch_stats(), and block.opt_batch_stats(). These return batch/proof counts per committed block, not transaction counts. In normal Aptos mainnet operation, blocks carry tens to low hundreds of batches — not the thousands that TRANSACTION_COUNT_BUCKETS was designed for. PROOF_COUNT_BUCKETS (1–500) provides better resolution in that range and is already used by related proof-count histograms in the same file (PROOFS_WITHOUT_BATCH_SUMMARY, PROOFS_WITH_BATCH_SUMMARY, NUM_PROOFS_IN_PROOF_QUEUE_AFTER_PULL). The fix aligns naming and semantics consistently.

Blast radius: BATCH_NUM_PER_BLOCK is only written inside update_batch_stats(), which is called from one site in consensus/src/counters.rs:update_counters_for_block. No external code references the static directly. No consensus decision depends on histogram values.


Findings

[INFORMATIONAL] Finding 1 — PROOF_SIZE_WHEN_PULL and NUM_BATCHES_WITHOUT_PROOF_OF_STORE remain on TRANSACTION_COUNT_BUCKETS

File: consensus/src/quorum_store/counters.rs:407 and :378
Blast radius: Observability only
Test coverage: N/A (metrics registration)

Description: Two other histograms that count proofs or batches (not transactions) still use TRANSACTION_COUNT_BUCKETS:

  • quorum_store_proof_size_when_pull — observes proof_of_stores.len() on each block pull (call site: batch_proof_queue.rs:577)
  • num_batches_without_proof_of_store — observes num_batches_without_proof() in proof manager

These are semantically identical to BATCH_NUM_PER_BLOCK (small-integer batch/proof counts), yet they use the coarse exponential transaction buckets that start at 1.5. This inconsistency means dashboards and alerts get fine resolution for batch_num_per_block but not for these two related metrics.

Concrete impact: No current exploit, no correctness issue. Dashboard/alerting resolution mismatch only.

Note: Not a blocker for this PR; suitable as a follow-up.

[INFORMATIONAL] Finding 2 — Metrics series continuity on rollout

File: consensus/src/quorum_store/counters.rs:164

Description: Changing histogram bucket definitions causes Prometheus _bucket time-series label sets to shift on process restart. Any dashboards or alerts keyed to specific bucket boundaries (e.g. le="25.62890625" from the old exponential set) will stop matching. This is operational, not a security concern.

Concrete impact: No current exploit. Operator action may be needed to update dashboards after rollout.


Test Coverage Analysis

Changed Element Coverage Risk Elevation Recommendation
BATCH_NUM_PER_BLOCK bucket definition N/A — metrics registration None No action needed; bucket changes are not unit-testable at the Rust level in a meaningful security sense

Blast Radius

Changed Symbol Non-Test Callers Classification Critical Callers
BATCH_NUM_PER_BLOCK (indirect via update_batch_stats) 1 (consensus/src/counters.rs) LOW None

Highest-risk dependency chains: None — update_batch_stats is a fire-and-forget metrics observer.


Historical Context

The original introduction of BATCH_NUM_PER_BLOCK in commit f37103cef0 ([counters] adding/fixing batch counters) used TRANSACTION_COUNT_BUCKETS without documented rationale. There is no indication that bucket choice was tied to alerting thresholds or security invariants. This PR's fix is not a regression of a security measure.


Recommendations

Before Production

  • (Optional, non-blocking) Follow-up PR to align PROOF_SIZE_WHEN_PULL and NUM_BATCHES_WITHOUT_PROOF_OF_STORE with PROOF_COUNT_BUCKETS for consistency

Operational

  • Update any dashboards or alerts that query quorum_store_batch_num_per_block with specific le= bucket boundaries after rollout
Open in Web View Automation 

Sent by Cursor Automation: Security Review Bot

"Histogram for the number of batches per (committed) blocks.",
&["type"],
TRANSACTION_COUNT_BUCKETS.clone(),
PROOF_COUNT_BUCKETS.clone(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[INFORMATIONAL] The bucket swap from TRANSACTION_COUNT_BUCKETS (exponential, ~1.5 to ~25,251) to PROOF_COUNT_BUCKETS (hand-tuned, 1–500) is semantically correct: this metric counts batches/proofs per committed block, which in practice stays in the tens-to-low-hundreds range. The fix also makes this metric consistent with PROOFS_WITHOUT_BATCH_SUMMARY, PROOFS_WITH_BATCH_SUMMARY, and NUM_PROOFS_IN_PROOF_QUEUE_AFTER_PULL, which already use PROOF_COUNT_BUCKETS.

Note for follow-up: PROOF_SIZE_WHEN_PULL (line 407) and NUM_BATCHES_WITHOUT_PROOF_OF_STORE (line 378) are semantically equivalent (probe counts, not tx counts) but still use TRANSACTION_COUNT_BUCKETS. Aligning those in a separate PR would complete the consistency story.

Address review comment: NUM_BATCHES_WITHOUT_PROOF_OF_STORE and
PROOF_SIZE_WHEN_PULL measure proof/batch counts (small integers),
so they should use PROOF_COUNT_BUCKETS rather than the
transaction-scaled TRANSACTION_COUNT_BUCKETS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@danielxiangzl danielxiangzl enabled auto-merge (squash) April 20, 2026 20:13
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite compat success on ca049383dd80675149ef2d0042668964f9f9107a ==> 9c623999dd9d9cda3e397850d170957db7493387

Compatibility test results for ca049383dd80675149ef2d0042668964f9f9107a ==> 9c623999dd9d9cda3e397850d170957db7493387 (PR)
1. Check liveness of validators at old version: ca049383dd80675149ef2d0042668964f9f9107a
compatibility::simple-validator-upgrade::liveness-check : committed: 13721.66 txn/s, latency: 2529.76 ms, (p50: 2400 ms, p70: 2700, p90: 3400 ms, p99: 4600 ms), latency samples: 447700
2. Upgrading first Validator to new version: 9c623999dd9d9cda3e397850d170957db7493387
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6327.72 txn/s, latency: 5312.44 ms, (p50: 5700 ms, p70: 5900, p90: 6000 ms, p99: 6100 ms), latency samples: 223720
3. Upgrading rest of first batch to new version: 9c623999dd9d9cda3e397850d170957db7493387
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6356.18 txn/s, latency: 5310.87 ms, (p50: 5900 ms, p70: 6000, p90: 6200 ms, p99: 6300 ms), latency samples: 217700
4. upgrading second batch to new version: 9c623999dd9d9cda3e397850d170957db7493387
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 11095.80 txn/s, latency: 2957.08 ms, (p50: 3100 ms, p70: 3300, p90: 3500 ms, p99: 3800 ms), latency samples: 364720
5. check swarm health
Compatibility test for ca049383dd80675149ef2d0042668964f9f9107a ==> 9c623999dd9d9cda3e397850d170957db7493387 passed
Test Ok

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite framework_upgrade success on ca049383dd80675149ef2d0042668964f9f9107a ==> 9c623999dd9d9cda3e397850d170957db7493387

Compatibility test results for ca049383dd80675149ef2d0042668964f9f9107a ==> 9c623999dd9d9cda3e397850d170957db7493387 (PR)
Upgrade the nodes to version: 9c623999dd9d9cda3e397850d170957db7493387
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2226.84 txn/s, submitted: 2233.42 txn/s, failed submission: 6.59 txn/s, expired: 6.59 txn/s, latency: 1291.49 ms, (p50: 1200 ms, p70: 1500, p90: 1800 ms, p99: 2700 ms), latency samples: 202801
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2432.26 txn/s, submitted: 2439.77 txn/s, failed submission: 7.52 txn/s, expired: 7.52 txn/s, latency: 1197.05 ms, (p50: 1100 ms, p70: 1200, p90: 1800 ms, p99: 2700 ms), latency samples: 219962
5. check swarm health
Compatibility test for ca049383dd80675149ef2d0042668964f9f9107a ==> 9c623999dd9d9cda3e397850d170957db7493387 passed
Upgrade the remaining nodes to version: 9c623999dd9d9cda3e397850d170957db7493387
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1469.40 txn/s, submitted: 1473.70 txn/s, failed submission: 4.30 txn/s, expired: 4.30 txn/s, latency: 2386.54 ms, (p50: 1200 ms, p70: 1500, p90: 6000 ms, p99: 16900 ms), latency samples: 129820
Test Ok

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite realistic_env_max_load success on 9c623999dd9d9cda3e397850d170957db7493387

two traffics test: inner traffic : committed: 13212.33 txn/s, latency: 1379.23 ms, (p50: 1500 ms, p70: 1500, p90: 1800 ms, p99: 2300 ms), latency samples: 4934200
two traffics test : committed: 99.99 txn/s, latency: 624.66 ms, (p50: 500 ms, p70: 600, p90: 900 ms, p99: 1000 ms), latency samples: 1660
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 0.736, avg: 0.710", "ConsensusProposalToOrdered: max: 0.125, avg: 0.116", "ConsensusOrderedToCommit: max: 0.162, avg: 0.153", "ConsensusProposalToCommit: max: 0.285, avg: 0.269"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.50s no progress at version 4905479 (avg 0.06s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.35s no progress at version 2204541 (avg 0.35s) [limit 16].
Test Ok

@danielxiangzl danielxiangzl merged commit 81b3d57 into main Apr 21, 2026
78 of 83 checks passed
@danielxiangzl danielxiangzl deleted the daniel/fix-batch-num-buckets branch April 21, 2026 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants