Fix ClientServer barrier scoping#23
Conversation
…d now unused get_object_names.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR removes the get_object_names() channel API (and its S3/Redis implementations) and updates the ClientServer::barrier() logic to detect peer arrival by probing for each expected barrier object individually rather than listing all objects.
Changes:
- Removed
get_object_names()from theClientServerinterface and from S3/Redis implementations. - Reworked
ClientServer::barrier()to loop over expected peer-specific barrier object names and attempt downloads to confirm arrival. - Cleaned up S3 include usage by removing
ListObjectsRequest.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/comm/S3.cpp | Removes S3 object listing implementation and associated include. |
| src/comm/Redis.cpp | Removes Redis key listing implementation. |
| src/comm/ClientServer.cpp | Changes barrier implementation to probe per-peer barrier objects via download_object. |
| include/comm/S3.h | Removes get_object_names() from S3 channel interface. |
| include/comm/Redis.h | Removes get_object_names() from Redis channel interface. |
| include/comm/ClientServer.h | Removes get_object_names() from the base channel API. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| upload({&b, sizeof(b)}, file_name); | ||
| unsigned int elapsed_time = 0; | ||
| while (elapsed_time < max_timeout) { | ||
| auto objects = get_object_names(); | ||
| auto has_barrier_suffix = [barrier_suffix] (const std::string& s){return s.size() > barrier_suffix.size() && | ||
| s.compare(s.size() - barrier_suffix.size(), barrier_suffix.size(), barrier_suffix) == 0 ;}; | ||
| auto num_arrived = std::count_if(objects.begin(), objects.end(), has_barrier_suffix); | ||
| if (num_arrived >= num_peers) { | ||
| bool all_arrived = true; | ||
| for (int i = 0; i < num_peers; i++) { | ||
| char arrived = 0; | ||
| std::string expected_name = comm_name + std::to_string(i) + barrier_suffix; | ||
| if (!download_object({&arrived, sizeof(arrived)}, expected_name)) { | ||
| all_arrived = false; | ||
| break; | ||
| } | ||
| } | ||
| if (all_arrived) { | ||
| return; | ||
| } else { | ||
| elapsed_time += timeout; |
There was a problem hiding this comment.
Copilot, elapsed_time was not part of the PR 💯
| bool all_arrived = true; | ||
| for (int i = 0; i < num_peers; i++) { | ||
| char arrived = 0; | ||
| std::string expected_name = comm_name + std::to_string(i) + barrier_suffix; | ||
| if (!download_object({&arrived, sizeof(arrived)}, expected_name)) { | ||
| all_arrived = false; | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
Arguably the previous solution had worse scaling performance.
| //! Uploads objects and keeps track of them. | ||
| virtual void upload(channel_data buf, std::string name); | ||
|
|
||
| //! List all the currently existing objects, needs to be implemented by channels. Needed by some collectives that check for the existence of files, but do not care about their content. | ||
| virtual std::vector<std::string> get_object_names() = 0; | ||
|
|
||
| //! Delete the object with the given name, needs to be implemented by channels. | ||
| virtual void delete_object(std::string name) = 0; | ||
|
|
There was a problem hiding this comment.
Agreed that an object_exists() backend hook would be cleaner and could map to Redis EXISTS / S3 HeadObject. However, I wanted to keep this PR small by reusing the existing API.
This fixes
ClientServer::barrier()incorrectly counting barrier objects across the entire shared backend namespace.Issue
Previously, each peer uploaded a fully scoped barrier marker using
comm_name + peer_id + "_barrier_N", but the wait loop listed all backend objects and counted any key ending in"_barrier_N". With Redis or S3 shared by multiple communicators, stale runs, or other namespaces, unrelated barrier markers could satisfy the count and let a barrier return before this communicator's own peers had arrived.Solution
Change the barrier to poll the exact expected per-peer marker names, matching the addressing style used by the other collectives.
Changes
ClientServer::barrier()with exact-name checks forcomm_name + peer_id + "_barrier_N".get_object_names()interface fromClientServer.KEYS *usage.ListObjectsusage.Performance notes
The old implementation performed a global namespace listing on every poll: Redis used
KEYS *, and S3 listed objects in the bucket. That made each poll scale with O(total number of keys/objects) in the shared backend, including unrelated objects.The new implementation performs exact point reads for the expected barrier markers, so each poll scales with O(
num_peers) instead of the whole backend namespace. Barrier markers are one byte, so the read payload is minimal.A future backend-level
exists()API could use RedisEXISTSor S3HeadObjectto avoid reading even that one-byte marker, but this PR keeps the fix small and avoids adding a new backend interface.