feat(spu): in-process EmbedderClient backend (PR-1.F)#294
Conversation
Phase 1 of the embedder cutover per docs/specs/weaver-spu-Spec.md §10: introduce the in-process candle-backed embedder backend. No consumer is wired up yet — that's the follow-up split into: - PR-A (this): land the EmbedderClient + supporting plumbing - PR-B: backend selection in serve.rs (boot-time pin probe path) - PR-C: agent-side runtime cutover (herobench/benchmark.rs) - PR-D: retire grpc_client_legacy + Python embedder service ## What's in this PR - `crates/weaver-spu/src/encoder/client.rs` (NEW): `EmbedderClient` struct holding `JinaV4Embedder` behind `Arc<parking_lot::Mutex<…>>`. Implements `weaver_core::embedder::Embedder`. Forward passes run under `tokio::task::spawn_blocking` (the GPU forward is sync and seconds-long; blocking on the runtime would starve other tasks). `parking_lot::Mutex` rather than `std::sync::Mutex` so a panic in `encode_text` doesn't poison the slot for the daemon's lifetime. - `embed_late_chunked` returns `EmbeddingError::NotAvailable` for now. The token-level forward surface needed to drive `late_chunk.rs` from the in-process path hasn't landed yet; failing loud is preferable to silently degrading to single-vector pooling. Consumers needing late chunking keep `backend = "python"` until the surface lands. - `encoder/mod.rs`: wire the new `client` module under `cfg(feature = "candle")` (consistent with the rest of the candle encoder modules). - `encoder/jina_v4.rs`: add `pub fn max_seq_len()` getter so `EmbedderClient` can populate `EmbedderInfo::max_seq_length` for the cohort-pin guard. - `encoder/qwen25vl_loraed.rs`: pre-existing clippy lint (`run_attention_post_proj` reaches 9 args under `feature = "debug-layers"`) was passing on default-features but blocked `cargo clippy --features flash-attn --lib -- -D warnings`. Added `#[allow(clippy::too_many_arguments)]` — splitting the function for a cfg-gated diagnostic-only param would be worse. ## Why in-process Per the project mantra "latency is the enemy of agency": every cross- process hop on every memory read is dead cost. The SPU is one unit (encoder + decoder co-resident in `weaver-infer`); if one half crashes, the whole stack should die. That's the desired failure mode — no silent fallback to a wedged embedder. ## Validation - `cargo check -p weaver-spu --features flash-attn --lib` — clean - `cargo clippy -p weaver-spu --features flash-attn --lib -- -D warnings` — clean - `cargo check -p weaver-spu --features flash-attn --all-targets` — clean (verifies tests + binaries still compile against the lib changes) End-to-end smoke validation deferred to PR-B (boot-time pin probe path), where the daemon constructs an `EmbedderClient::from_snapshot` and the cohort-pin guard reads its `info()`. ## Design notes - `weight_revision` derived from snapshot dir basename (HF cache layout puts the SHA in the `snapshots/<sha>/` directory). Empty string when the path has no usable basename. - `dimension` hardcoded to `matryoshka::FULL_DIM` (2048). Matryoshka truncation isn't surfaced through the trait yet; would be a per-call parameter when callers need it. - `health_check()` overrides the trait default — always returns `true` once construction succeeds, since the model is mmapped for the daemon's lifetime and there's no runtime path that flips `model_loaded` back to false. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds an in-process, candle-backed EmbedderClient using Jina V4 snapshots; caches EmbedderInfo (including max sequence length), serializes GPU access via a mutex, validates embedding shapes, implements Embedder trait (embed, info, health_check) and returns NotAvailable for late-chunked requests; exposes client module under the candle feature. ChangesIn-Process Embedder Client Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/weaver-spu/src/encoder/client.rs`:
- Around line 165-167: The current .await .map_err(...) chain around the
spawn_blocking result incorrectly converts panics into EmbeddingError::Transport
(IoError) and keeps the daemon running; modify the error handling for the
JoinError from spawn_blocking (the await on the JoinHandle returned by
spawn_blocking in this file) to check JoinError::is_panic() and call
std::panic::resume_unwind(join_err.into_panic()) to re-propagate panics,
otherwise map non-panic join errors into EmbeddingError::Transport (wrapping
IoError) as before; specifically update the code around the .await that
currently maps both errors into EmbeddingError::Transport so panics are resumed
instead of converted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 354d8066-91ec-4518-b4b5-da8a185e123d
📒 Files selected for processing (4)
crates/weaver-spu/src/encoder/client.rscrates/weaver-spu/src/encoder/jina_v4.rscrates/weaver-spu/src/encoder/mod.rscrates/weaver-spu/src/encoder/qwen25vl_loraed.rs
…review) CR finding: turning a `JoinError::is_panic()` from `tokio::task:: spawn_blocking` into an `EmbeddingError::Transport` masks bugs as transient transport failures. Per the canonical Tokio pattern (see `tokio::task::JoinError` docs), `resume_unwind(join_error.into_panic())` is the right propagation when `is_panic()` is true so error-handling downstream sees a real panic vs a real error. Cancellation isn't a path we exercise (we don't cancel the spawn_blocking handle), but if it ever happens, treat it as a transport error so the caller sees something rather than nothing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…capped value (PR #297 review) The getter's rustdoc said it returned the architectural ceiling read from `config.json::max_position_embeddings`. That was correct when the getter was added in PR #294 but became stale after the cap landed earlier in this PR. Now states explicitly that it returns `min(JINA_V4_TRAINED_CONTEXT, raw.max_position_embeddings)` and references both terms so a reader can see the relationship between the cap, the raw config field, and the value `EmbedderInfo` exposes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Why in-process
Per the project mantra "latency is the enemy of agency": every cross-process hop on every memory read is dead cost. The SPU is one unit (encoder + decoder co-resident in `weaver-infer`); if one half crashes, the whole stack should die. That's the desired failure mode — no silent fallback to a wedged embedder.
What's in this PR
Design notes
Concurrency model: `encode_text` needs `&mut self` (KV cache state); the trait surfaces `&self`. Wrapped behind `Arc<parking_lot::Mutex>`. Forward passes run under `tokio::task::spawn_blocking` since they're sync + GPU-bound (seconds-long; would starve the runtime if blocking inline). `parking_lot::Mutex` rather than `std::sync::Mutex` so a panic in `encode_text` doesn't poison the slot for the daemon's lifetime.
Late-chunked: returns `EmbeddingError::NotAvailable`. The token-level forward surface needed to drive `late_chunk.rs` from the in-process path hasn't landed yet; failing loud is preferable to silently degrading to single-vector pooling. Consumers needing late chunking keep `backend = "python"` until the surface lands (tracked separately).
`weight_revision`: derived from snapshot dir basename (HF cache layout puts the SHA in `snapshots//`). Empty string when the path has no usable basename. Read by the cohort-pin guard for identity-drift detection.
`health_check()` override: always `true` once construction succeeds. The model is mmapped for the daemon's lifetime; no runtime path can flip `model_loaded` back to false. Avoids the round-trip through `info()` of the trait default.
Clippy fix on qwen25vl_loraed.rs: `run_attention_post_proj` reaches 9 args under `feature = "debug-layers"`. Lint passed on default-features but blocked `cargo clippy --features flash-attn --lib -- -D warnings`. Splitting the function for a cfg-gated diagnostic-only param would be worse than the suppression.
Test plan
Follow-ups (existing tasks)
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores