Skip to content

[cp][aptos-release-v1.44] [consensus] Add pull loop duration and empty retry metrics#19464

Open
aptos-bot wants to merge 1 commit intoaptos-release-v1.44from
backport/aptos-release-v1.44/pr-19449
Open

[cp][aptos-release-v1.44] [consensus] Add pull loop duration and empty retry metrics#19464
aptos-bot wants to merge 1 commit intoaptos-release-v1.44from
backport/aptos-release-v1.44/pr-19449

Conversation

@aptos-bot
Copy link
Copy Markdown
Contributor

@aptos-bot aptos-bot commented Apr 15, 2026

Backport

This will backport the following commits from main to aptos-release-v1.44:

Questions ?

Please refer to the Backport tool documentation


Note

Low Risk
Low risk: adds consensus/QuorumStore observability only, with minimal logic change (tracking elapsed time and empty poll retries). Main risk is minor metric-cardinality/bucketing overhead or misleading interpretation if the loop semantics change.

Overview
Adds two new consensus metrics: aptos_consensus_pull_loop_duration_seconds (custom-bucketed histogram for end-to-end pull-loop duration including retries) and aptos_consensus_pull_loop_empty_retries (histogram for how many empty polls occurred).

In QuorumStoreClient::pull, tracks empty_retries, records both histograms after the loop completes, and enriches the debug log with empty_retries while reusing the computed pull_duration.

Reviewed by Cursor Bugbot for commit 83d1dce. Bugbot is set up for automated code reviews on this repo. Configure here.

* [consensus] Add pull loop duration and empty retry metrics

The payload pull loop in proposal generation can silently add
30-300ms of latency when QS has no batches ready, but there are
no metrics to confirm this. Add two new metrics to instrument
the outer pull loop:

- aptos_consensus_pull_loop_duration_seconds: full loop duration
- aptos_consensus_pull_loop_empty_retries: count of 30ms empty
  retries per proposal

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [consensus] Fix histogram buckets for pull loop metrics

- PULL_LOOP_DURATION: add custom buckets at 30ms steps (matching
  NO_TXN_DELAY) up to 330ms, then coarser to 1s. Avoids the default
  Prometheus 250-500ms gap that would hide the 300ms poll ceiling.
- PULL_LOOP_EMPTY_RETRIES: extend buckets to 35 retries to cover
  non-default configs (1000ms poll time = ~34 retries).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [consensus] Fix formatting for histogram bucket vecs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
(cherry picked from commit b35cd71)
@danielxiangzl danielxiangzl enabled auto-merge (squash) April 15, 2026 20:40
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite framework_upgrade success on e0f33de9783f2ceaa30e5f2b004d3b39812c4f06 ==> 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed

Forge report malformed: Expecting property name enclosed in double quotes: line 11 column 1 (char 290)
'{\n  "metrics": [\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "submitted_txn",\n      "value": 222580.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "expired_txn",\n[2026-04-15T21:18:45Z INFO  aptos_forge::report] Test Ok\n      "value": 760.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "avg_tps",\n      "value": 2461.1038526823468\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "avg_latency",\n      "value": 1179.3033089892706\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "p50_latency",\n      "value": 1200.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "p90_latency",\n      "value": 1500.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "p99_latency",\n      "value": 2100.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "submitted_txn",\n      "value": 217601.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "expired_txn",\n      "value": 600.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "avg_tps",\n      "value": 2391.7090815262072\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "avg_latency",\n      "value": 1220.7552684089014\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "p50_latency",\n      "value": 1200.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "p90_latency",\n      "value": 1500.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "p99_latency",\n      "value": 2100.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "submitted_txn",\n      "value": 118500.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "expired_txn",\n      "value": 500.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "avg_tps",\n      "value": 1342.2600284045318\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "avg_latency",\n      "value": 2274.547127118644\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "p50_latency",\n      "value": 1400.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "p90_latency",\n      "value": 5600.0\n    },\n    {\n      "test_name": "framework_upgrade::framework-upgrade::full-framework-upgrade",\n      "metric": "p99_latency",\n      "value": 11200.0\n    }\n  ],\n  "text": "Compatibility test results for e0f33de9783f2ceaa30e5f2b004d3b39812c4f06 ==> 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed (PR)\\nUpgrade the nodes to version: 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed\\nframework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2461.10 txn/s, submitted: 2469.54 txn/s, failed submission: 8.43 txn/s, expired: 8.43 txn/s, latency: 1179.30 ms, (p50: 1200 ms, p70: 1200, p90: 1500 ms, p99: 2100 ms), latency samples: 221820\\nframework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2391.71 txn/s, submitted: 2398.32 txn/s, failed submission: 6.61 txn/s, expired: 6.61 txn/s, latency: 1220.76 ms, (p50: 1200 ms, p70: 1200, p90: 1500 ms, p99: 2100 ms), latency samples: 217001\\n5. check swarm health\\nCompatibility test for e0f33de9783f2ceaa30e5f2b004d3b39812c4f06 ==> 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed passed\\nUpgrade the remaining nodes to version: 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed\\nframework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1342.26 txn/s, submitted: 1347.95 txn/s, failed submission: 5.69 txn/s, expired: 5.69 txn/s, latency: 2274.55 ms, (p50: 1400 ms, p70: 2400, p90: 5600 ms, p99: 11200 ms), latency samples: 118000\\nTest Ok"\n}'
Trailing Log Lines:
Compatibility test results for e0f33de9783f2ceaa30e5f2b004d3b39812c4f06 ==> 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed (PR)
Upgrade the nodes to version: 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2461.10 txn/s, submitted: 2469.54 txn/s, failed submission: 8.43 txn/s, expired: 8.43 txn/s, latency: 1179.30 ms, (p50: 1200 ms, p70: 1200, p90: 1500 ms, p99: 2100 ms), latency samples: 221820
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2391.71 txn/s, submitted: 2398.32 txn/s, failed submission: 6.61 txn/s, expired: 6.61 txn/s, latency: 1220.76 ms, (p50: 1200 ms, p70: 1200, p90: 1500 ms, p99: 2100 ms), latency samples: 217001
5. check swarm health
Compatibility test for e0f33de9783f2ceaa30e5f2b004d3b39812c4f06 ==> 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed passed
Upgrade the remaining nodes to version: 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1342.26 txn/s, submitted: 1347.95 txn/s, failed submission: 5.69 txn/s, expired: 5.69 txn/s, latency: 2274.55 ms, (p50: 1400 ms, p70: 2400, p90: 5600 ms, p99: 11200 ms), latency samples: 118000
Test Ok

=== BEGIN JUNIT ===
<?xml version="1.0" encoding="UTF-8"?>
<testsuites name="forge" tests="1" failures="0" errors="0" uuid="54f451da-676f-4df8-8e90-9a594b47e237">
    <testsuite name="local" tests="1" disabled="0" errors="0" failures="0">
        <testcase name="framework_upgrade::framework-upgrade">
        </testcase>
    </testsuite>
</testsuites>
=== END JUNIT ===
[2026-04-15T21:18:45Z INFO  aptos_forge::backend::k8s::cluster_helper] Deleting namespace forge-framework-upgrade-pr-19464: Some(NamespaceStatus { conditions: None, phase: Some("Terminating") })
[2026-04-15T21:18:45Z INFO  aptos_forge::backend::k8s::cluster_helper] aptos-node resources for Forge removed in namespace: forge-framework-upgrade-pr-19464
[2026-04-15T21:18:45Z INFO  ureq::unit] sending request POST http://vmagent-victoria-metrics-agent.victoria-metrics.svc:8429/api/v1/import/prometheus

test result: ok. 1 passed; 0 soft failed; 0 hard failed; 0 filtered out

Debugging output:
NAME                                         READY   STATUS      RESTARTS   AGE
aptos-node-0-validator-0                     1/1     Running     0          11m
aptos-node-1-validator-0                     1/1     Running     0          10m
aptos-node-2-validator-0                     1/1     Running     0          2m52s
aptos-node-3-validator-0                     1/1     Running     0          2m24s
forge-testnet-deployer-lpbtz                 0/1     Completed   0          15m
genesis-aptos-genesis-eforge7ee68f48-mksgn   0/1     Completed   0          14m

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite realistic_env_max_load success on 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed

two traffics test: inner traffic : committed: 14263.49 txn/s, latency: 2495.28 ms, (p50: 2300 ms, p70: 2600, p90: 3400 ms, p99: 4200 ms), latency samples: 5310200
two traffics test : committed: 100.02 txn/s, latency: 1169.91 ms, (p50: 1000 ms, p70: 1400, p90: 1700 ms, p99: 2200 ms), latency samples: 1720
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.255, avg: 1.152", "ConsensusProposalToOrdered: max: 0.173, avg: 0.170", "ConsensusOrderedToCommit: max: 0.164, avg: 0.144", "ConsensusProposalToCommit: max: 0.337, avg: 0.314"]
Max non-epoch-change gap was: 2 rounds at version 36192 (avg 0.00) [limit 4], 2.27s no progress at version 36192 (avg 0.08s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.41s no progress at version 2603424 (avg 0.41s) [limit 16].
Test Ok

@danielxiangzl danielxiangzl disabled auto-merge April 15, 2026 22:08
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite compat success on e0f33de9783f2ceaa30e5f2b004d3b39812c4f06 ==> 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed

Compatibility test results for e0f33de9783f2ceaa30e5f2b004d3b39812c4f06 ==> 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed (PR)
1. Check liveness of validators at old version: e0f33de9783f2ceaa30e5f2b004d3b39812c4f06
compatibility::simple-validator-upgrade::liveness-check : committed: 13798.37 txn/s, latency: 2510.63 ms, (p50: 2500 ms, p70: 2800, p90: 3300 ms, p99: 3700 ms), latency samples: 457520
2. Upgrading first Validator to new version: 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 5858.80 txn/s, latency: 5792.22 ms, (p50: 6500 ms, p70: 6600, p90: 6700 ms, p99: 6800 ms), latency samples: 200080
3. Upgrading rest of first batch to new version: 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 5863.55 txn/s, latency: 5795.95 ms, (p50: 6400 ms, p70: 6500, p90: 6600 ms, p99: 6600 ms), latency samples: 203520
4. upgrading second batch to new version: 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9845.26 txn/s, latency: 3374.23 ms, (p50: 3300 ms, p70: 4100, p90: 4600 ms, p99: 4900 ms), latency samples: 324700
5. check swarm health
Compatibility test for e0f33de9783f2ceaa30e5f2b004d3b39812c4f06 ==> 83d1dce2fc67c6f9a43959cd33d98bfd27a5e7ed passed
Test Ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants