chore: support allocating testnets to the local DC#10122
Conversation
|
Before merging this we need to add a new feature to Farm. The problem is that when the We can't simply increase the 64 to 256 since that will cause performance tests to allocate multiple VMs per host which makes their measurements / benchmarks unreliable. What I think we need is a new
This will cause VMs of performance tests to be load-balanced over the |
| ) | ||
|
|
||
| def _write_version_file_var_impl(ctx): | ||
| """Helper rule that creates a file with the content of the provided var from the version file.""" |
There was a problem hiding this comment.
Can you add a sentence explaining what we empirically discovered about when those targets will get rebuilt? Namely (IIRC) if the target (modulo volatile status content) does not exist then it will be built; otherwise the existing one will be used.
Writing this now actually I wonder if there are security implications (since you'll be blindly accepting cached artifacts from a different build) but I guess we anyway assume the cache is not malicious.
| echo "STABLE_FARM_METADATA $STABLE_FARM_METADATA" | ||
|
|
||
| # Used for allocating a Farm testnet to the local DC in CI (Search for allocate_testnet_to_local_dc) | ||
| NODE_NAME="${NODE_NAME:-}" |
There was a problem hiding this comment.
| NODE_NAME="${NODE_NAME:-}" | |
| NODE_NAME="${NODE_NAME:-unknown}" |
| icos_images |= icos_config.icos_images | ||
|
|
||
| env_var_files["FARM_METADATA"] = "//rs/tests:farm_metadata.txt" | ||
| env_var_files["DC"] = "//rs/tests:DC.txt" |
There was a problem hiding this comment.
is this DC or NODE_NAME? I thought it'd read what's set by workspace_status
There was a problem hiding this comment.
This is outdated.
| visibility = ["//visibility:public"], | ||
| ) | ||
|
|
||
| # This is the contents of the ctx.version_file, i.e. all non-STABLE_, |
There was a problem hiding this comment.
I think this should be compacted a bit, it's a lot to read. I think it's enough to point out that volatile-status.txt has to be a direct dependency, and when it is so then it behaves somewhat magically.
| # VOLATILE_STATUS_FILE is read during group creation time to determine the required host features and Farm metadata. | ||
| # In colocated tests the group is created in the non-colocated wrapper driver, we shouldn't use runtime_deps | ||
| # to depend on the VOLATILE_STATUS_FILE since that will cause the variable to be repointed to a path on the colocated VM | ||
| # which won't exist during group creaton time on the non-colocated wrapper. Instead we pass it as a regular environment variable |
There was a problem hiding this comment.
Can you prefix it with RUN_SCRIPT_ and give it the same treatment as the other files read by the wrapper?
|
|
||
| exec \ | ||
| env -C "$TEST_TMPDIR" \ | ||
| VOLATILE_STATUS_FILE="$(realpath "$VOLATILE_STATUS_FILE")" \ |
There was a problem hiding this comment.
At this point could we already read out the farm metadata and only set that, instead of silently passing the whole volatile status? Same with DC. It keeps everything tight and the inputs declarative.
There was a problem hiding this comment.
Sure, I guess we could do the parsing in bash rather than in Rust.
| # BUILD_TIMESTAMP 1778503130 | ||
| # DC zh1 | ||
| # FORMATTED_DATE 2026 May 11 12 38 50 Mon | ||
| volatile_status( |
There was a problem hiding this comment.
I think we should keep write_info_file_var and this consistent: either we use volatile/non-volatile for both, or we use info_file/version_file for both
There was a problem hiding this comment.
I'm not sure what you mean here. WDYM with "both"? Note that we use the volatile status for both the DC and the FARM_METADATA.
| # Used for allocating a Farm testnet to the local DC in CI (Search for allocate_testnet_to_local_dc) | ||
| NODE_NAME="${NODE_NAME:-unknown}" | ||
| echo "DC ${NODE_NAME%%-*}" |
There was a problem hiding this comment.
This assumes that the node name is always prefixed with <DC>-, correct? worth adding a note. Even better: I'd export NODE_NAME and do the transformation for the one file that needs the DC
There was a problem hiding this comment.
Actually, the assumption breaks for unknown though I think then DC will happen to be unknown
There was a problem hiding this comment.
Yeah I can add a node that the DC will be everything up to the first -.
What
Support optionally allocating Farm testnets to the same DC as were the GitHub runner is running.
Why
The nested system-tests are quite flaky:
It's because these are the only tests that use the SetupOS disk images which are very large (2.6G). Downloading these images on Farm hosts often times out. Especially if the transfer has to cross the Atlantic, i.e. when the Farm host is in
zh1and the image was built indm1or vice versa.We should avoid these cross-DC transfers. One way of doing that is forcing a testnet to be allocated to the same DC as were the image was created which is the DC where the the GitHub runner is running.
How
SystemTestGroup:.allocate_testnet_to_local_dc(). When set the Farm group is created withrequired_host_featuresset to the DC of the GitHub runner.NODE_NAMEenvironment variable. This will have a value on CI likedm1-spm34where the part before the-denotes the DC.DCis outputted as a bazel volatile status variable inbazel/workspace_status.sh. It has to be volatile because we don't want a different node to invalidate a previously cached test.//bazel:volatile-status.txttarget is introduced which declares the volatile status file as the rule's default output.system_testswill use//bazel:volatile-status.txtas a runtime dependency.Farm metadata
The
FARM_METADATAused to be a stable variable meaning that a change in user or CI job would invalidate a previously cached system-test. We should work towards having the property that when CI has ran a system-test running the same test locally on the same commit should cause the test to be fetched from the cache instead of being run again. For this reason theFARM_METADATAhas also been made volatile.