[inspection-service] opens immediately#19488
Merged
Conversation
ibalajiarun
approved these changes
Apr 23, 2026
| /// Holds the components that are injected into the inspection service after it starts. | ||
| /// Uses `RwLock<Option<T>>` so the service can start before these are available. | ||
| pub struct InspectionServiceComponents { | ||
| pub data_client: RwLock<Option<AptosDataClient>>, |
Contributor
There was a problem hiding this comment.
nit: this can be a OnceCell?
395dc7e to
7f4ee1b
Compare
7f4ee1b to
7490bff
Compare
JoshLind
approved these changes
Apr 23, 2026
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Contributor
✅ Forge suite
|
Contributor
✅ Forge suite
|
Contributor
✅ Forge suite
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The node inspection service (port 9101,
/metrics) previously started after RocksDBinitialization, which can take 1–2+ minutes on validators. This meant Prometheus scrapers
received
connection refusedfor the entire RocksDB open phase, delaying metric visibilityby 1–2 minutes plus up to one additional scrape interval (30s).
The root cause was that
start_node_inspection_servicewas called at the end ofsetup_environment_and_start_node, afterinitialize_database_and_checkpoints,setup_networks_and_get_interfaces, andstart_state_sync_and_get_notification_handleshad all completed.
The
/metricsendpoint has no dependency on any of those subsystems — it only reads fromthe global Prometheus registry. Only
/peer_informationactually needsAptosDataClientand
PeersAndMetadata.Changes:
InspectionServiceComponentstoaptos-inspection-service— anArc-shareable struct holdingRwLock<Option<AptosDataClient>>andRwLock<Option<Arc<PeersAndMetadata>>>. Both fields start asNoneand arefilled in via
components.set(...)once the rest of the node finishes initializing.start_inspection_servicenow takesArc<InspectionServiceComponents>instead ofthe live values directly.
handle_peer_information_requestacceptsOption<...>types and returns503 Service Unavailable ("Node is still initializing") when either value is
None,so clients know to retry rather than treating it as a hard error.
aptos-node/src/lib.rs,start_node_inspection_serviceis called immediately afterstart_admin_service— before RocksDB opens. After state sync completes andaptos_data_clientis available,inspection_components.set(...)is called to unlockfull endpoint functionality.
Result: Port 9101 opens seconds into node startup.
aptos_dkg_public_params_source(and all other boot-time metrics) are scrapeable within the first vmagent poll after
the process starts, instead of after RocksDB finishes.
How Has This Been Tested?
cargo check -p aptos-inspection-serviceandcargo check -p aptos-node— clean,no warnings introduced by this change.
cargo test -p aptos-inspection-service— all 13 tests pass, includingtest_inspect_peer_informationwhich exercises the fully-initialized anddisabled-endpoint paths.
Key Areas to Review
InspectionServiceComponents::setis called exactly once, after state sync returns.There is no guard against calling it twice; a second call would silently overwrite the
first. This is intentional — the call site is a single sequential code path — but worth
noting.
RwLockchoice:std::sync::RwLock(nottokio::sync::RwLock) is used becausethe read path runs inside an async handler but only holds the lock for a clone, making
it safe to use the sync variant without risking executor starvation.
/peer_informationduring init: 503 was chosen (over 503/425)because it is the conventional "retry later" signal for infrastructure scrapers and
health checks.
Type of Change
Which Components or Systems Does This Change Impact?
Note
Medium Risk
Changes node startup ordering and inspection-service initialization by starting the HTTP server before storage/state sync are ready and injecting dependencies later; mistakes could break early boot observability or cause panics if components are set incorrectly.
Overview
Starts the node inspection service earlier in
aptos-node(immediately after the admin service, before RocksDB/state sync) so/metricsis scrapeable from the first moments of startup.Refactors
aptos-inspection-serviceto accept a sharedInspectionServiceComponentscontainer that is populated later viaset(...); endpoints that require late-bound values (notably/peer_information) now takeOptioninputs and return 503 Service Unavailable until the components are injected, while tests are updated to construct and pre-populate the new components.Reviewed by Cursor Bugbot for commit 7490bff. Bugbot is set up for automated code reviews on this repo. Configure here.