Skip to content

Fix ClientServer barrier scoping#23

Open
McLavish wants to merge 1 commit into
spcl:mainfrom
McLavish:fix/barrier-comm-scope
Open

Fix ClientServer barrier scoping#23
McLavish wants to merge 1 commit into
spcl:mainfrom
McLavish:fix/barrier-comm-scope

Conversation

@McLavish

Copy link
Copy Markdown

This fixes ClientServer::barrier() incorrectly counting barrier objects across the entire shared backend namespace.

Issue

Previously, each peer uploaded a fully scoped barrier marker using comm_name + peer_id + "_barrier_N", but the wait loop listed all backend objects and counted any key ending in "_barrier_N". With Redis or S3 shared by multiple communicators, stale runs, or other namespaces, unrelated barrier markers could satisfy the count and let a barrier return before this communicator's own peers had arrived.

Solution

Change the barrier to poll the exact expected per-peer marker names, matching the addressing style used by the other collectives.

Changes

  • Replace global object listing in ClientServer::barrier() with exact-name checks for comm_name + peer_id + "_barrier_N".
  • Remove the now-unused get_object_names() interface from ClientServer.
  • Remove Redis KEYS * usage.
  • Remove S3 bucket-wide ListObjects usage.

Performance notes

The old implementation performed a global namespace listing on every poll: Redis used KEYS *, and S3 listed objects in the bucket. That made each poll scale with O(total number of keys/objects) in the shared backend, including unrelated objects.
The new implementation performs exact point reads for the expected barrier markers, so each poll scales with O(num_peers) instead of the whole backend namespace. Barrier markers are one byte, so the read payload is minimal.

A future backend-level exists() API could use Redis EXISTS or S3 HeadObject to avoid reading even that one-byte marker, but this PR keeps the fix small and avoids adding a new backend interface.

Copilot AI review requested due to automatic review settings June 16, 2026 09:37

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR removes the get_object_names() channel API (and its S3/Redis implementations) and updates the ClientServer::barrier() logic to detect peer arrival by probing for each expected barrier object individually rather than listing all objects.

Changes:

  • Removed get_object_names() from the ClientServer interface and from S3/Redis implementations.
  • Reworked ClientServer::barrier() to loop over expected peer-specific barrier object names and attempt downloads to confirm arrival.
  • Cleaned up S3 include usage by removing ListObjectsRequest.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/comm/S3.cpp Removes S3 object listing implementation and associated include.
src/comm/Redis.cpp Removes Redis key listing implementation.
src/comm/ClientServer.cpp Changes barrier implementation to probe per-peer barrier objects via download_object.
include/comm/S3.h Removes get_object_names() from S3 channel interface.
include/comm/Redis.h Removes get_object_names() from Redis channel interface.
include/comm/ClientServer.h Removes get_object_names() from the base channel API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/comm/ClientServer.cpp
Comment on lines 50 to 65
upload({&b, sizeof(b)}, file_name);
unsigned int elapsed_time = 0;
while (elapsed_time < max_timeout) {
auto objects = get_object_names();
auto has_barrier_suffix = [barrier_suffix] (const std::string& s){return s.size() > barrier_suffix.size() &&
s.compare(s.size() - barrier_suffix.size(), barrier_suffix.size(), barrier_suffix) == 0 ;};
auto num_arrived = std::count_if(objects.begin(), objects.end(), has_barrier_suffix);
if (num_arrived >= num_peers) {
bool all_arrived = true;
for (int i = 0; i < num_peers; i++) {
char arrived = 0;
std::string expected_name = comm_name + std::to_string(i) + barrier_suffix;
if (!download_object({&arrived, sizeof(arrived)}, expected_name)) {
all_arrived = false;
break;
}
}
if (all_arrived) {
return;
} else {
elapsed_time += timeout;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot, elapsed_time was not part of the PR 💯

Comment thread src/comm/ClientServer.cpp
Comment on lines +53 to +61
bool all_arrived = true;
for (int i = 0; i < num_peers; i++) {
char arrived = 0;
std::string expected_name = comm_name + std::to_string(i) + barrier_suffix;
if (!download_object({&arrived, sizeof(arrived)}, expected_name)) {
all_arrived = false;
break;
}
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably the previous solution had worse scaling performance.

Comment on lines 44 to 49
//! Uploads objects and keeps track of them.
virtual void upload(channel_data buf, std::string name);

//! List all the currently existing objects, needs to be implemented by channels. Needed by some collectives that check for the existence of files, but do not care about their content.
virtual std::vector<std::string> get_object_names() = 0;

//! Delete the object with the given name, needs to be implemented by channels.
virtual void delete_object(std::string name) = 0;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that an object_exists() backend hook would be cleaner and could map to Redis EXISTS / S3 HeadObject. However, I wanted to keep this PR small by reusing the existing API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants