Skip to content

chore: support allocating testnets to the local DC#10122

Open
basvandijk wants to merge 20 commits into
masterfrom
basvandijk/allocate_testnet_to_local_dc
Open

chore: support allocating testnets to the local DC#10122
basvandijk wants to merge 20 commits into
masterfrom
basvandijk/allocate_testnet_to_local_dc

Conversation

@basvandijk
Copy link
Copy Markdown
Collaborator

@basvandijk basvandijk commented May 7, 2026

What

Support optionally allocating Farm testnets to the same DC as were the GitHub runner is running.

Why

The nested system-tests are quite flaky:

$ bazel run //ci/githubstats:query -- top 100 flaky --gt 0 --columns=label,total,flaky,flaky% --include=%nested%
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┑
│    │ label                                                       │   total │   flaky │   flaky% │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┥
│  0 │ //rs/tests/nested/nns_recovery:nr_all_broken_seq_np_actions │      37 │       3 │      8.1 │
│  1 │ //rs/tests/nested:hostos_upgrade_smoke_test                 │      45 │       2 │      4.4 │
│  2 │ //rs/tests/nested/nns_recovery:nr_broken_dfinity_node       │      37 │       1 │      2.7 │
│  3 │ //rs/tests/nested:hostos_upgrade_smoke_test_head_nns        │      37 │       1 │      2.7 │
│  4 │ //rs/tests/nested:registration                              │      45 │       1 │      2.2 │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┙

It's because these are the only tests that use the SetupOS disk images which are very large (2.6G). Downloading these images on Farm hosts often times out. Especially if the transfer has to cross the Atlantic, i.e. when the Farm host is in zh1 and the image was built in dm1 or vice versa.

We should avoid these cross-DC transfers. One way of doing that is forcing a testnet to be allocated to the same DC as were the image was created which is the DC where the the GitHub runner is running.

How

  • This introduces the new function on SystemTestGroup: .allocate_testnet_to_local_dc(). When set the Farm group is created with required_host_features set to the DC of the GitHub runner.
  • The DC of the GitHub runner is determined via the NODE_NAME environment variable. This will have a value on CI like dm1-spm34 where the part before the - denotes the DC.
  • The DC is outputted as a bazel volatile status variable in bazel/workspace_status.sh. It has to be volatile because we don't want a different node to invalidate a previously cached test.
  • A new //bazel:volatile-status.txt target is introduced which declares the volatile status file as the rule's default output.
  • system_tests will use //bazel:volatile-status.txt as a runtime dependency.

Farm metadata

The FARM_METADATA used to be a stable variable meaning that a change in user or CI job would invalidate a previously cached system-test. We should work towards having the property that when CI has ran a system-test running the same test locally on the same commit should cause the test to be fetched from the cache instead of being run again. For this reason the FARM_METADATA has also been made volatile.

@github-actions github-actions Bot added the chore label May 7, 2026
@basvandijk
Copy link
Copy Markdown
Collaborator Author

basvandijk commented May 7, 2026

Before merging this we need to add a new feature to Farm. The problem is that when the bazel-test-all job is run from dm1 (which is often the case since we have more runners there than in zh1) we quickly allocate all available resources on the dm1 Farm hosts. This is because these Farm hosts were originally meant for performance tests and hence only allow a maximum of 64 vCPUs per host while regular hosts allow 256 vCPUs.

We can't simply increase the 64 to 256 since that will cause performance tests to allocate multiple VMs per host which makes their measurements / benchmarks unreliable.

What I think we need is a new VmAllocationMode that is specifically designed for performance tests. Let's call it PerformanceAllocation for now. Then we would increase the max vCPUs to 256 but then the semantics of PerformanceAllocation is that it would behave like the default MinIntraDistanceLoadBalanceAllocation but remove its first ordering property:

  • the number of VMs of the group that the host has already allocated in
    descending order (VMs of a testnet are grouped together on a host as
    much as possible).

This will cause VMs of performance tests to be load-balanced over the dm1 hosts instead of being colocated together on a single host before spilling over to other hosts.

@basvandijk basvandijk marked this pull request as ready for review May 7, 2026 18:09
@basvandijk basvandijk requested review from a team as code owners May 7, 2026 18:09
@basvandijk basvandijk added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label May 8, 2026
Comment thread bazel/defs.bzl Outdated
)

def _write_version_file_var_impl(ctx):
"""Helper rule that creates a file with the content of the provided var from the version file."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a sentence explaining what we empirically discovered about when those targets will get rebuilt? Namely (IIRC) if the target (modulo volatile status content) does not exist then it will be built; otherwise the existing one will be used.

Writing this now actually I wonder if there are security implications (since you'll be blindly accepting cached artifacts from a different build) but I guess we anyway assume the cache is not malicious.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread bazel/workspace_status.sh Outdated
echo "STABLE_FARM_METADATA $STABLE_FARM_METADATA"

# Used for allocating a Farm testnet to the local DC in CI (Search for allocate_testnet_to_local_dc)
NODE_NAME="${NODE_NAME:-}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NODE_NAME="${NODE_NAME:-}"
NODE_NAME="${NODE_NAME:-unknown}"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread rs/tests/system_tests.bzl Outdated
icos_images |= icos_config.icos_images

env_var_files["FARM_METADATA"] = "//rs/tests:farm_metadata.txt"
env_var_files["DC"] = "//rs/tests:DC.txt"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this DC or NODE_NAME? I thought it'd read what's set by workspace_status

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outdated.

…stem_tests.bzl and not in bazel/BUILD.bazel anymore"

This reverts commit 57abc75.
This reverts commit a896b76.
This reverts commit 4e15a93.
Comment thread bazel/BUILD.bazel
visibility = ["//visibility:public"],
)

# This is the contents of the ctx.version_file, i.e. all non-STABLE_,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be compacted a bit, it's a lot to read. I think it's enough to point out that volatile-status.txt has to be a direct dependency, and when it is so then it behaves somewhat magically.

Comment thread rs/tests/system_tests.bzl
# VOLATILE_STATUS_FILE is read during group creation time to determine the required host features and Farm metadata.
# In colocated tests the group is created in the non-colocated wrapper driver, we shouldn't use runtime_deps
# to depend on the VOLATILE_STATUS_FILE since that will cause the variable to be repointed to a path on the colocated VM
# which won't exist during group creaton time on the non-colocated wrapper. Instead we pass it as a regular environment variable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you prefix it with RUN_SCRIPT_ and give it the same treatment as the other files read by the wrapper?

Comment thread rs/tests/run_systest.sh

exec \
env -C "$TEST_TMPDIR" \
VOLATILE_STATUS_FILE="$(realpath "$VOLATILE_STATUS_FILE")" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point could we already read out the farm metadata and only set that, instead of silently passing the whole volatile status? Same with DC. It keeps everything tight and the inputs declarative.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I guess we could do the parsing in bash rather than in Rust.

Comment thread bazel/BUILD.bazel
# BUILD_TIMESTAMP 1778503130
# DC zh1
# FORMATTED_DATE 2026 May 11 12 38 50 Mon
volatile_status(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep write_info_file_var and this consistent: either we use volatile/non-volatile for both, or we use info_file/version_file for both

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here. WDYM with "both"? Note that we use the volatile status for both the DC and the FARM_METADATA.

Comment thread bazel/workspace_status.sh
Comment on lines +31 to +33
# Used for allocating a Farm testnet to the local DC in CI (Search for allocate_testnet_to_local_dc)
NODE_NAME="${NODE_NAME:-unknown}"
echo "DC ${NODE_NAME%%-*}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the node name is always prefixed with <DC>-, correct? worth adding a note. Even better: I'd export NODE_NAME and do the transformation for the one file that needs the DC

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the assumption breaks for unknown though I think then DC will happen to be unknown

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can add a node that the DC will be everything up to the first -.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants