Skip to content

feat(spu): in-process EmbedderClient backend (PR-1.F)#294

Merged
toddwbucy merged 2 commits into
mainfrom
feat/spu-in-process-embedder-client
May 8, 2026
Merged

feat(spu): in-process EmbedderClient backend (PR-1.F)#294
toddwbucy merged 2 commits into
mainfrom
feat/spu-in-process-embedder-client

Conversation

@toddwbucy
Copy link
Copy Markdown
Owner

@toddwbucy toddwbucy commented May 8, 2026

Summary

  • Introduce the in-process candle-backed embedder backend per `docs/specs/weaver-spu-Spec.md` §10 Phase 1 (PR-1.F).
  • No consumer wired yet — follow-ups: backend selection in `serve.rs` (chore: untrack ONBOARDING.md #116), agent-side runtime cutover in `herobench/benchmark.rs` (feat(inference): ModelEntry family + mode + embedder fields (REG-02) #118), Python embedder retire (#TBD).
  • Implements `weaver_core::embedder::Embedder` against `JinaV4Embedder`. `embed_late_chunked` deliberately fails loud (`EmbeddingError::NotAvailable`) until the token-level forward surface lands.

Why in-process

Per the project mantra "latency is the enemy of agency": every cross-process hop on every memory read is dead cost. The SPU is one unit (encoder + decoder co-resident in `weaver-infer`); if one half crashes, the whole stack should die. That's the desired failure mode — no silent fallback to a wedged embedder.

What's in this PR

File Change Why
`crates/weaver-spu/src/encoder/client.rs` NEW (~190 LOC + tests) `EmbedderClient` struct + `Embedder` trait impl
`crates/weaver-spu/src/encoder/mod.rs` +5 LOC Wire `client` module under `cfg(feature = "candle")`
`crates/weaver-spu/src/encoder/jina_v4.rs` +6 LOC Pub `max_seq_len()` getter for `EmbedderInfo::max_seq_length`
`crates/weaver-spu/src/encoder/qwen25vl_loraed.rs` +1 LOC `#[allow(clippy::too_many_arguments)]` on cfg-gated debug fn

Design notes

  • Concurrency model: `encode_text` needs `&mut self` (KV cache state); the trait surfaces `&self`. Wrapped behind `Arc<parking_lot::Mutex>`. Forward passes run under `tokio::task::spawn_blocking` since they're sync + GPU-bound (seconds-long; would starve the runtime if blocking inline). `parking_lot::Mutex` rather than `std::sync::Mutex` so a panic in `encode_text` doesn't poison the slot for the daemon's lifetime.

  • Late-chunked: returns `EmbeddingError::NotAvailable`. The token-level forward surface needed to drive `late_chunk.rs` from the in-process path hasn't landed yet; failing loud is preferable to silently degrading to single-vector pooling. Consumers needing late chunking keep `backend = "python"` until the surface lands (tracked separately).

  • `weight_revision`: derived from snapshot dir basename (HF cache layout puts the SHA in `snapshots//`). Empty string when the path has no usable basename. Read by the cohort-pin guard for identity-drift detection.

  • `health_check()` override: always `true` once construction succeeds. The model is mmapped for the daemon's lifetime; no runtime path can flip `model_loaded` back to false. Avoids the round-trip through `info()` of the trait default.

  • Clippy fix on qwen25vl_loraed.rs: `run_attention_post_proj` reaches 9 args under `feature = "debug-layers"`. Lint passed on default-features but blocked `cargo clippy --features flash-attn --lib -- -D warnings`. Splitting the function for a cfg-gated diagnostic-only param would be worse than the suppression.

Test plan

  • CodeRabbit review clean
  • `cargo check -p weaver-spu --features flash-attn --lib` — clean
  • `cargo clippy -p weaver-spu --features flash-attn --lib -- -D warnings` — clean
  • `cargo check -p weaver-spu --features flash-attn --all-targets` — clean (verifies tests + binaries still compile against lib changes)
  • End-to-end smoke (deferred to PR-B): daemon boot constructs `EmbedderClient::from_snapshot`; cohort-pin guard reads its `info()`

Follow-ups (existing tasks)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • In-process local embedder added for faster, on-device embedding generation.
    • Exposes model max sequence length so longer inputs are handled correctly.
    • Note: token-level embedding output is not available from the in-process backend.
  • Chores

    • Module export updates and small code quality/style tweaks.

Phase 1 of the embedder cutover per docs/specs/weaver-spu-Spec.md §10:
introduce the in-process candle-backed embedder backend. No consumer is
wired up yet — that's the follow-up split into:

  - PR-A (this): land the EmbedderClient + supporting plumbing
  - PR-B: backend selection in serve.rs (boot-time pin probe path)
  - PR-C: agent-side runtime cutover (herobench/benchmark.rs)
  - PR-D: retire grpc_client_legacy + Python embedder service

## What's in this PR

- `crates/weaver-spu/src/encoder/client.rs` (NEW): `EmbedderClient`
  struct holding `JinaV4Embedder` behind `Arc<parking_lot::Mutex<…>>`.
  Implements `weaver_core::embedder::Embedder`. Forward passes run under
  `tokio::task::spawn_blocking` (the GPU forward is sync and seconds-long;
  blocking on the runtime would starve other tasks). `parking_lot::Mutex`
  rather than `std::sync::Mutex` so a panic in `encode_text` doesn't
  poison the slot for the daemon's lifetime.

- `embed_late_chunked` returns `EmbeddingError::NotAvailable` for now.
  The token-level forward surface needed to drive `late_chunk.rs` from
  the in-process path hasn't landed yet; failing loud is preferable to
  silently degrading to single-vector pooling. Consumers needing late
  chunking keep `backend = "python"` until the surface lands.

- `encoder/mod.rs`: wire the new `client` module under
  `cfg(feature = "candle")` (consistent with the rest of the candle
  encoder modules).

- `encoder/jina_v4.rs`: add `pub fn max_seq_len()` getter so
  `EmbedderClient` can populate `EmbedderInfo::max_seq_length` for the
  cohort-pin guard.

- `encoder/qwen25vl_loraed.rs`: pre-existing clippy lint
  (`run_attention_post_proj` reaches 9 args under
  `feature = "debug-layers"`) was passing on default-features but blocked
  `cargo clippy --features flash-attn --lib -- -D warnings`. Added
  `#[allow(clippy::too_many_arguments)]` — splitting the function for a
  cfg-gated diagnostic-only param would be worse.

## Why in-process

Per the project mantra "latency is the enemy of agency": every cross-
process hop on every memory read is dead cost. The SPU is one unit
(encoder + decoder co-resident in `weaver-infer`); if one half crashes,
the whole stack should die. That's the desired failure mode — no silent
fallback to a wedged embedder.

## Validation

- `cargo check -p weaver-spu --features flash-attn --lib` — clean
- `cargo clippy -p weaver-spu --features flash-attn --lib -- -D warnings` — clean
- `cargo check -p weaver-spu --features flash-attn --all-targets` — clean
  (verifies tests + binaries still compile against the lib changes)

End-to-end smoke validation deferred to PR-B (boot-time pin probe path),
where the daemon constructs an `EmbedderClient::from_snapshot` and the
cohort-pin guard reads its `info()`.

## Design notes

- `weight_revision` derived from snapshot dir basename (HF cache layout
  puts the SHA in the `snapshots/<sha>/` directory). Empty string when
  the path has no usable basename.
- `dimension` hardcoded to `matryoshka::FULL_DIM` (2048). Matryoshka
  truncation isn't surfaced through the trait yet; would be a per-call
  parameter when callers need it.
- `health_check()` overrides the trait default — always returns `true`
  once construction succeeds, since the model is mmapped for the
  daemon's lifetime and there's no runtime path that flips
  `model_loaded` back to false.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4793ca59-171c-402e-ab63-fcf21a397810

📥 Commits

Reviewing files that changed from the base of the PR and between 98122b5 and 8826da3.

📒 Files selected for processing (1)
  • crates/weaver-spu/src/encoder/client.rs

📝 Walkthrough

Walkthrough

Adds an in-process, candle-backed EmbedderClient using Jina V4 snapshots; caches EmbedderInfo (including max sequence length), serializes GPU access via a mutex, validates embedding shapes, implements Embedder trait (embed, info, health_check) and returns NotAvailable for late-chunked requests; exposes client module under the candle feature.

Changes

In-Process Embedder Client Implementation

Layer / File(s) Summary
JinaV4 API Extension
crates/weaver-spu/src/encoder/jina_v4.rs
Public max_seq_len() accessor exposes the embedder's stored max sequence length.
EmbedderClient Constructor
crates/weaver-spu/src/encoder/client.rs
from_snapshot loads JinaV4Embedder, derives weight revision, caches EmbedderInfo (dimension, max_seq_length), and wraps the embedder in Arc<parking_lot::Mutex<_>>.
Core Embedder Trait Methods
crates/weaver-spu/src/encoder/client.rs
embed uses tokio::spawn_blocking, locks the embedder for GPU serialization, runs encode_text, and validates batch size and per-vector dimensionality; embed_late_chunked returns NotAvailable; info returns cached metadata; health_check returns true.
Error Handling
crates/weaver-spu/src/encoder/client.rs
Private IoError wrapper maps blocking/join/anyhow failures into EmbeddingError::Transport.
Module Export
crates/weaver-spu/src/encoder/mod.rs
Publicly export client submodule behind candle feature for daemon wiring.
Linting
crates/weaver-spu/src/encoder/qwen25vl_loraed.rs
Added #[allow(clippy::too_many_arguments)] to run_attention_post_proj; no logic changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 In candlelight I learned to hum,
I load the weights and beat the drum,
A mutex keeps the GPU tame,
Vectors march with proper frame,
Late chunks—no flight, I whisper "shh."

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(spu): in-process EmbedderClient backend (PR-1.F)' accurately summarizes the main change: introducing a new in-process embedder client implementation backed by candle/JinaV4Embedder, which is the primary feature added across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/weaver-spu/src/encoder/client.rs`:
- Around line 165-167: The current .await .map_err(...) chain around the
spawn_blocking result incorrectly converts panics into EmbeddingError::Transport
(IoError) and keeps the daemon running; modify the error handling for the
JoinError from spawn_blocking (the await on the JoinHandle returned by
spawn_blocking in this file) to check JoinError::is_panic() and call
std::panic::resume_unwind(join_err.into_panic()) to re-propagate panics,
otherwise map non-panic join errors into EmbeddingError::Transport (wrapping
IoError) as before; specifically update the code around the .await that
currently maps both errors into EmbeddingError::Transport so panics are resumed
instead of converted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 354d8066-91ec-4518-b4b5-da8a185e123d

📥 Commits

Reviewing files that changed from the base of the PR and between c2ceb87 and 98122b5.

📒 Files selected for processing (4)
  • crates/weaver-spu/src/encoder/client.rs
  • crates/weaver-spu/src/encoder/jina_v4.rs
  • crates/weaver-spu/src/encoder/mod.rs
  • crates/weaver-spu/src/encoder/qwen25vl_loraed.rs

Comment thread crates/weaver-spu/src/encoder/client.rs
…review)

CR finding: turning a `JoinError::is_panic()` from `tokio::task::
spawn_blocking` into an `EmbeddingError::Transport` masks bugs as
transient transport failures. Per the canonical Tokio pattern (see
`tokio::task::JoinError` docs), `resume_unwind(join_error.into_panic())`
is the right propagation when `is_panic()` is true so error-handling
downstream sees a real panic vs a real error.

Cancellation isn't a path we exercise (we don't cancel the spawn_blocking
handle), but if it ever happens, treat it as a transport error so the
caller sees something rather than nothing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@toddwbucy toddwbucy merged commit d6ca4e8 into main May 8, 2026
1 check passed
@toddwbucy toddwbucy deleted the feat/spu-in-process-embedder-client branch May 8, 2026 02:26
toddwbucy pushed a commit that referenced this pull request May 8, 2026
…capped value (PR #297 review)

The getter's rustdoc said it returned the architectural ceiling read
from `config.json::max_position_embeddings`. That was correct when
the getter was added in PR #294 but became stale after the cap landed
earlier in this PR. Now states explicitly that it returns
`min(JINA_V4_TRAINED_CONTEXT, raw.max_position_embeddings)` and
references both terms so a reader can see the relationship between
the cap, the raw config field, and the value `EmbedderInfo` exposes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants