[CELEBORN-2319] Standalone LifecycleManager && rust sdk by gavin9402 · Pull Request #3677 · apache/celeborn

gavin9402 · 2026-05-07T12:14:54Z

What changes were proposed in this pull request?

This PR introduces two major features to support non-JVM (C++/Rust) clients using Apache Celeborn for shuffle:

1. Standalone LifecycleManager Daemon (Scala/JVM)

Added LifecycleManagerDaemon — a standalone JVM process that hosts a LifecycleManager independently from any compute engine (Spark/Flink) Driver.
Added LifecycleManagerDaemonArguments for CLI argument parsing (--app-id, --master-endpoints, --port, --host, --properties-file).
Added sbin/start-lifecycle-manager.sh launch script with classpath assembly, environment loading, and required-argument validation.
Updated CelebornBuild.scala and service/pom.xml to add celeborn-client as a dependency of the service module (needed because the Daemon instantiates LifecycleManager from the client module).

2. Rust SDK via C++ FFI (rust/ directory)

celeborn-client-sys: Low-level FFI crate using cxx to bridge Rust ↔ C++ Celeborn client. Includes:
- wrapper.h / wrapper.cc: C++ shim exposing 7 functions (create_client, setup_lifecycle_manager, shutdown, push_data, mapper_end, update_reducer_file_group, read_partition_full).
- build.rs: Build script linking Celeborn C++ static libs and system dependencies (folly, protobuf, abseil, boost, etc.) for macOS and Linux.
celeborn-client: Safe, ergonomic Rust wrapper providing ShuffleClient with:
- Input validation (app_id non-empty, port > 0, codec ∈ {NONE, LZ4, ZSTD}).
- Drop-safe shutdown (prevents double ffi::shutdown via UniquePtr::null() swap).
- Convenience method read_partition_all.
Two example programs (data_sum_writer.rs, data_sum_reader.rs) mirroring the existing C++ DataSumWithWriterClient / DataSumWithReaderClient test programs.

Why are the changes needed?

Currently, LifecycleManager can only run embedded inside a JVM-based compute engine Driver (e.g., Spark Driver). This makes it impossible for non-JVM applications (Daft engine, etc.) to use Celeborn as their shuffle service, because:

The C++ client requires a running LifecycleManager to coordinate shuffle metadata (register shuffles, allocate slots, manage partition locations) with Celeborn Masters and Workers.
Without a standalone LifecycleManager, non-JVM applications have no way to bootstrap this coordination layer.

By decoupling the LifecycleManager into a standalone daemon process, any client — regardless of language runtime — can connect to it via RPC. The Rust SDK then leverages this architecture to provide first-class Rust support by bridging to the existing, battle-tested C++ client implementation via FFI.

Does this PR resolve a correctness bug?

No

Does this PR introduce any user-facing change?

Yes.

New component: Users can now start a standalone LifecycleManager daemon via sbin/start-lifecycle-manager.sh --app-id <id> --master-endpoints <eps> --port <port>.
New SDK: Rust applications can now use the celeborn-client crate to perform shuffle read/write operations against a Celeborn cluster.
Limitation: The standalone LifecycleManager does not support auth (celeborn.auth.enabled must be false), as the C++/Rust clients lack SASL support.

How was this patch tested?

The Rust SDK was validated using the data_sum_writer and data_sum_reader example programs, which are Rust ports of the existing C++ integration tests (DataSumWithWriterClient.cpp / DataSumWithReaderClient.cpp). These write random numeric data across partitions and verify correctness by comparing partition sums between writer and reader.
The LifecycleManagerDaemon was tested by starting it against a local Celeborn cluster (Master + Workers) and verifying that the Rust examples can successfully connect, push data, and read data through the daemon.

FMX · 2026-05-08T02:18:42Z

This is amazing. What compute engine do you use?

gavin9402 · 2026-05-08T02:40:13Z

This is amazing. What compute engine do you use?

@FMX We plan to integrate it into the Daft engine.

afterincomparableyum · 2026-05-09T20:45:55Z

This is really good @gavin9402 , thanks for contributing. I will drop some review comments soon if I have any.

afterincomparableyum

added some small comments, overall though looks good to me.

I am fine with service --> client dependency leaks into master/worker, but others may think the Daemon doesn't need to live in service, and would propose that you consider putting it in a new module that depends on both service and client without these dependency leaks.

afterincomparableyum · 2026-05-09T21:00:04Z

+    shutdownLatch.await()
+  }
+
+  private[lifecyclemanager] def applyArgsToConf(


The --host argument is basically ignored. LifecycleManagerDaemonArguments parses --host into host: Option[String], but applyArgsToConf only writes MASTER_ENDPOINTS and CLIENT_SHUFFLE_MANAGER_PORT. The host is never propagated. LifecycleManager binds to lifecycleHost = Utils.localHostName(conf) (LifecycleManager.scala:81), and Utils.localHostName only looks at the CELEBORN_LOCAL_HOSTNAME env or the auto resolved hostname. There's no conf key path for it.

So someone running start-lifecycle-manager.sh --host 10.0.0.5 gets the default hostname with no warning. Either drop the option, set CELEBORN_LOCAL_HOSTNAME from the script, or call Utils.setCustomHostname(host) before constructing LifecycleManager.

afterincomparableyum · 2026-05-09T21:09:06Z

+    /// already be torn down by `IOThreadPoolExecutor::join()`.
+    pub fn shutdown(mut self) -> Result<()> {
+        if let Some(pinned) = self.inner.as_mut() {
+            ffi::shutdown(pinned)?;


If ffi::shutdown returns Err, the ? propagates and self is dropped normally. Drop::drop then calls ffi::shutdown(pinned) a second time.

The comment on shutdown says the handle is intentionally leaked to avoid a folly EventBase use-after-free during destruction, so calling shutdown twice could re-trigger that teardown path.

I propose either ensuring the handle is leaked even when ffi::shutdown returns an error (so Drop cannot call ffi::shutdown a second time), or adding a “shutdown attempted” flag so Drop skips the shutdown call and only leaks the handle.

afterincomparableyum · 2026-05-09T21:14:58Z

+    rust::Vec<uint8_t> out;
+    out.reserve(64 * 1024);
+    std::vector<uint8_t> buf(64 * 1024);
+
+    while (true) {
+      int n = stream->read(buf.data(), 0, buf.size());
+      if (n == -1) {
+        break;
+      }
+      if (n <= 0) {
+        throw std::runtime_error(
+            "celeborn-ffi: CelebornInputStream::read returned unexpected non-positive " +
+            std::to_string(n));
+      }
+      for (int i = 0; i < n; ++i) {
+        out.push_back(buf[i]);


Per-byte push_back into rust::Vec<uint8_t> can introduce noticeable overhead for large partition reads, and the initial reserve(64 * 1024) only avoids reallocations for the first buffer chunk.

Consider accumulating into a std::vector<uint8_t> (or appending in larger chunks) and copying once at the end, or resizing the destination per chunk and using memcpy instead of pushing one byte at a time.

This doesn’t need to be addressed in the current PR, but it may be worth leaving a TODO here to revisit for performance improvements.

afterincomparableyum · 2026-05-09T21:27:42Z

+      return
+    }
+
+    if (daemonArgs.port < 1024) {


LifecycleManagerDaemonArguments already sys.exit(1) on this condition. Dead code.

gavin9402 · 2026-05-11T06:04:33Z

@afterincomparableyum Thank you for your thorough review. I have made the requested changes by your suggestions above.

夷羿 added 3 commits May 7, 2026 11:47

Support LifecycleManager daemon && rust sdk

7c32952

fix SIGSEGV

c463c1f

start on random port

a194bab

github-actions Bot added kind:build kind:deploy module:service labels May 7, 2026

afterincomparableyum suggested changes May 9, 2026

View reviewed changes

fix

a3c9f70

github-actions Bot removed the module:service label May 11, 2026

夷羿 added 4 commits May 11, 2026 11:22

fix

8426749

fix

bc19af4

fix

34d6490

fix

6f22ec7

gavin9402 requested a review from afterincomparableyum May 11, 2026 07:17

夷羿 added 2 commits May 11, 2026 15:58

fix

ba37d87

auto make

c08f576

gavin9402 force-pushed the standalone_and_rust branch from 391b486 to c08f576 Compare May 13, 2026 06:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-2319] Standalone LifecycleManager && rust sdk#3677

[CELEBORN-2319] Standalone LifecycleManager && rust sdk#3677
gavin9402 wants to merge 10 commits into
apache:mainfrom
gavin9402:standalone_and_rust

gavin9402 commented May 7, 2026

Uh oh!

FMX commented May 8, 2026

Uh oh!

gavin9402 commented May 8, 2026

Uh oh!

afterincomparableyum commented May 9, 2026

Uh oh!

afterincomparableyum left a comment

Uh oh!

afterincomparableyum May 9, 2026

Uh oh!

afterincomparableyum May 9, 2026

Uh oh!

afterincomparableyum May 9, 2026

Uh oh!

afterincomparableyum May 9, 2026

Uh oh!

gavin9402 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gavin9402 commented May 7, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

FMX commented May 8, 2026

Uh oh!

gavin9402 commented May 8, 2026

Uh oh!

afterincomparableyum commented May 9, 2026

Uh oh!

afterincomparableyum left a comment

Choose a reason for hiding this comment

Uh oh!

afterincomparableyum May 9, 2026

Choose a reason for hiding this comment

Uh oh!

afterincomparableyum May 9, 2026

Choose a reason for hiding this comment

Uh oh!

afterincomparableyum May 9, 2026

Choose a reason for hiding this comment

Uh oh!

afterincomparableyum May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gavin9402 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants